├── .gitignore
├── LICENSE
├── README.md
├── examples
    └── neo4j_weaviate_combined.ipynb
├── pyproject.toml
├── src
    └── ms_graphrag_neo4j
    │   ├── __init__.py
    │   ├── cypher_queries.py
    │   ├── ms_graphrag.py
    │   ├── prompts.py
    │   └── utils.py
└── tests
    └── dev.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | .ipynb_checkpoints
3 | .DS_Store
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Tomaz Bratanic
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MsGraphRAG-Neo4j
  2 | 
  3 | A Neo4j implementation of Microsoft's GraphRAG approach for knowledge graph-based retrieval augmented generation based on QFS.
  4 | Query-focused summarization (QFS) is a summarization task where the objective is to produce a concise and relevant summary of a given document or set of documents specifically tailored to answer or address a particular query or question provided by a user.
  5 | 
  6 | Learn more about [GraphRAG](https://graphrag.com/).
  7 | 
  8 | ## Overview
  9 | 
 10 | MsGraphRAG-Neo4j is a Python library that implements Microsoft's GraphRAG methodology with Neo4j as the graph database backend. This library provides a seamless way to:
 11 | 
 12 | 1. Extract entities and relationships from unstructured text
 13 | 2. Build a knowledge graph in Neo4j
 14 | 3. Generate summaries for nodes and relationships
 15 | 4. Detect and summarize communities within the graph
 16 | 5. Leverage this graph structure for enhanced RAG
 17 | 
 18 | The implementation uses OpenAI's models for text processing and Neo4j's powerful graph capabilities including the Graph Data Science (GDS) library.
 19 | 
 20 | > **⚠️ IMPORTANT NOTE**: This repository is experimental and provided as-is. The current implementation lacks optimizations for larger graphs, which may lead to exceptions or performance issues when processing substantial amounts of data. Use with caution in production environments and consider implementing additional error handling and optimization for large-scale deployments.
 21 | 
 22 | 
 23 | ## Features
 24 | 
 25 | - **Entity and Relationship Extraction**: Extract structured information from unstructured text using LLMs
 26 | - **Graph Construction**: Automatically build a knowledge graph in Neo4j
 27 | - **Node and Relationship Summarization**: Generate concise summaries to improve retrieval
 28 | - **Community Detection**: Use Neo4j GDS to identify clusters of related information
 29 | - **Community Summarization**: Provide high-level descriptions of concept clusters
 30 | - **Neo4j Integration**: Seamless integration with Neo4j database for persistent storage
 31 | 
 32 | ## Installation
 33 | 
 34 | ```bash
 35 | pip install -e .
 36 | ```
 37 | 
 38 | ## Requirements
 39 | 
 40 | - Neo4j database (5.26+)
 41 | - APOC plugin installed in Neo4j
 42 | - Graph Data Science (GDS) library installed in Neo4j
 43 | - OpenAI API key
 44 | 
 45 | ## Quick Start
 46 | 
 47 | ```python
 48 | import os
 49 | 
 50 | from ms_graphrag_neo4j import MsGraphRAG
 51 | from neo4j import GraphDatabase
 52 | 
 53 | # Set your environment variables
 54 | os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
 55 | os.environ["NEO4J_URI"] = "bolt://localhost:7687"
 56 | os.environ["NEO4J_USERNAME"] = "neo4j"
 57 | os.environ["NEO4J_PASSWORD"] = "password"
 58 | 
 59 | # Connect to Neo4j
 60 | driver = GraphDatabase.driver(
 61 |     os.environ["NEO4J_URI"], 
 62 |     auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])
 63 | )
 64 | 
 65 | # Initialize MsGraphRAG
 66 | ms_graph = MsGraphRAG(driver=driver, model='gpt-4o')
 67 | 
 68 | # Define example texts and entity types
 69 | example_texts = [
 70 |     "Tomaz works for Neo4j",
 71 |     "Tomaz lives in Grosuplje", 
 72 |     "Tomaz went to school in Grosuplje"
 73 | ]
 74 | allowed_entities = ["Person", "Organization", "Location"]
 75 | 
 76 | # Extract entities and relationships
 77 | result = ms_graph.extract_nodes_and_rels(example_texts, allowed_entities)
 78 | print(result)
 79 | 
 80 | # Generate summaries for nodes and relationships
 81 | result = ms_graph.summarize_nodes_and_rels()
 82 | print(result)
 83 | 
 84 | # Identify and summarize communities
 85 | result = ms_graph.summarize_communities()
 86 | print(result)
 87 | 
 88 | # Close the connection
 89 | ms_graph.close()
 90 | ```
 91 | 
 92 | ## How It Works
 93 | 
 94 | 1. **Extract Nodes and Relationships**: The library uses OpenAI's models to extract entities and relationships from your text data, creating a structured graph.
 95 | 
 96 | 2. **Summarize Nodes and Relationships**: Each entity and relationship is summarized to capture its essence across all mentions in the source documents.
 97 | 
 98 | 3. **Community Detection**: The Leiden algorithm is applied to identify communities of related entities.
 99 | 
100 | 4. **Community Summarization**: Each community is summarized to provide a high-level understanding of the concepts it contains.
101 | 


--------------------------------------------------------------------------------
/examples/neo4j_weaviate_combined.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "include_colab_link": true
  8 |     },
  9 |     "kernelspec": {
 10 |       "name": "python3",
 11 |       "display_name": "Python 3"
 12 |     },
 13 |     "language_info": {
 14 |       "name": "python"
 15 |     }
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "cell_type": "markdown",
 20 |       "metadata": {
 21 |         "id": "view-in-github",
 22 |         "colab_type": "text"
 23 |       },
 24 |       "source": [
 25 |         "<a href=\"https://colab.research.google.com/github/neo4j-contrib/ms-graphrag-neo4j/blob/main/examples/neo4j_weaviate_combined.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 26 |       ]
 27 |     },
 28 |     {
 29 |       "cell_type": "markdown",
 30 |       "source": [
 31 |         ">[Naive RAG vs GraphRAG with Neo4J & Weaviate](#scrollTo=n3QFDMrgAkCo)\n",
 32 |         "\n",
 33 |         ">>[Install Dependencies](#scrollTo=n3QFDMrgAkCo)\n",
 34 |         "\n",
 35 |         ">>[Write Documents to Weaviate Cloud](#scrollTo=nqwuGr0Xhgtm)\n",
 36 |         "\n",
 37 |         ">>[Classic RAG with OpenAI](#scrollTo=-uAAWPQXBUdX)\n",
 38 |         "\n",
 39 |         ">>[Graph RAG](#scrollTo=zzBnUF4bBYKG)\n",
 40 |         "\n",
 41 |         ">>>[Build a Graph with Neo4J](#scrollTo=zzBnUF4bBYKG)\n",
 42 |         "\n",
 43 |         ">>>[Extract Relevant Entities](#scrollTo=FVzpKJViBiJT)\n",
 44 |         "\n",
 45 |         ">>>[Summarize Nodes and Communities](#scrollTo=j1wAsUfIBrGc)\n",
 46 |         "\n",
 47 |         ">>>[Write the Entities to Weaviate](#scrollTo=n105cc-_B9bN)\n",
 48 |         "\n"
 49 |       ],
 50 |       "metadata": {
 51 |         "colab_type": "toc",
 52 |         "id": "GXFxB5p3CPfX"
 53 |       }
 54 |     },
 55 |     {
 56 |       "cell_type": "markdown",
 57 |       "source": [
 58 |         "# Naive RAG vs GraphRAG with Neo4J & Weaviate\n",
 59 |         "\n",
 60 |         "In this recipe, we will be walking through 2 ways of doing RAG:\n",
 61 |         "\n",
 62 |         "1. Classic RAG where we do simple vector search, followed be answer generation based on this context\n",
 63 |         "2. Graph RAG, making use of both vector search, combined by a graph representation of our dataset including community and node summaries\n",
 64 |         "\n",
 65 |         "For this example, we will be using a generated dataset called \"Financial Contracts\", that lists (fake) contracts sugned between individuals and companies.\n",
 66 |         "\n",
 67 |         "## Install Dependencies"
 68 |       ],
 69 |       "metadata": {
 70 |         "id": "n3QFDMrgAkCo"
 71 |       }
 72 |     },
 73 |     {
 74 |       "cell_type": "code",
 75 |       "source": [
 76 |         "!pip install --quiet --upgrade git+https://github.com/neo4j-contrib/ms-graphrag-neo4j.git datasets weaviate-client neo4j-graphrag"
 77 |       ],
 78 |       "metadata": {
 79 |         "id": "Wahz0DAFT4p7"
 80 |       },
 81 |       "execution_count": null,
 82 |       "outputs": []
 83 |     },
 84 |     {
 85 |       "cell_type": "markdown",
 86 |       "source": [
 87 |         "## Write Documents to Weaviate Cloud\n",
 88 |         "\n",
 89 |         "To get started, you can use a free Weaviate Sandbox.\n",
 90 |         "\n",
 91 |         "1. Create a cluster\n",
 92 |         "2. Take note of the cluster URL and API key\n",
 93 |         "3. Go to 'Embeddings' and turn it on."
 94 |       ],
 95 |       "metadata": {
 96 |         "id": "nqwuGr0Xhgtm"
 97 |       }
 98 |     },
 99 |     {
100 |       "cell_type": "code",
101 |       "source": [
102 |         "import os\n",
103 |         "from getpass import getpass\n",
104 |         "\n",
105 |         "if \"WEAVIATE_API_KEY\" not in os.environ:\n",
106 |         "  os.environ[\"WEAVIATE_API_KEY\"] = getpass(\"Weaviate API Key\")\n",
107 |         "if \"WEAVIATE_URL\" not in os.environ:\n",
108 |         "  os.environ[\"WEAVIATE_URL\"] = getpass(\"Weaviate URL\")"
109 |       ],
110 |       "metadata": {
111 |         "id": "chKCzqNS8CvX"
112 |       },
113 |       "execution_count": null,
114 |       "outputs": []
115 |     },
116 |     {
117 |       "cell_type": "code",
118 |       "source": [
119 |         "import weaviate\n",
120 |         "from weaviate.auth import Auth\n",
121 |         "\n",
122 |         "client = weaviate.connect_to_weaviate_cloud(\n",
123 |         "    cluster_url=os.environ.get(\"WEAVIATE_URL\"),\n",
124 |         "    auth_credentials=Auth.api_key(os.environ.get(\"WEAVIATE_API_KEY\")),\n",
125 |         ")"
126 |       ],
127 |       "metadata": {
128 |         "id": "o5opqllkhzm6"
129 |       },
130 |       "execution_count": 4,
131 |       "outputs": []
132 |     },
133 |     {
134 |       "cell_type": "code",
135 |       "source": [
136 |         "from weaviate.classes.config import Configure\n",
137 |         "\n",
138 |         "#client.collections.delete(\"Financial_contracts\")\n",
139 |         "client.collections.create(\n",
140 |         "    \"Financial_contracts\",\n",
141 |         "    description=\"A dataset of financial contracts between indivicuals and/or companies, as well as information on the type of contract and who has authored them.\",\n",
142 |         "    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),\n",
143 |         ")"
144 |       ],
145 |       "metadata": {
146 |         "id": "0ZzBFSLah2QQ"
147 |       },
148 |       "execution_count": null,
149 |       "outputs": []
150 |     },
151 |     {
152 |       "cell_type": "code",
153 |       "source": [
154 |         "from datasets import load_dataset\n",
155 |         "\n",
156 |         "financial_dataset = load_dataset(\"weaviate/agents\", \"query-agent-financial-contracts\", split=\"train\", streaming=True)"
157 |       ],
158 |       "metadata": {
159 |         "id": "KChqfZu8pTZP"
160 |       },
161 |       "execution_count": null,
162 |       "outputs": []
163 |     },
164 |     {
165 |       "cell_type": "code",
166 |       "source": [
167 |         "financial_collection = client.collections.get(\"Financial_contracts\")\n",
168 |         "\n",
169 |         "with financial_collection.batch.dynamic() as batch:\n",
170 |         "    for item in financial_dataset:\n",
171 |         "        batch.add_object(properties=item[\"properties\"])"
172 |       ],
173 |       "metadata": {
174 |         "id": "sigF9C03iBki"
175 |       },
176 |       "execution_count": 7,
177 |       "outputs": []
178 |     },
179 |     {
180 |       "cell_type": "markdown",
181 |       "source": [
182 |         "## Classic RAG with OpenAI"
183 |       ],
184 |       "metadata": {
185 |         "id": "-uAAWPQXBUdX"
186 |       }
187 |     },
188 |     {
189 |       "cell_type": "code",
190 |       "source": [
191 |         "os.environ[\"OPENAI_API_KEY\"]= getpass(\"Openai API Key:\")"
192 |       ],
193 |       "metadata": {
194 |         "id": "XOdJPeg9U_sl"
195 |       },
196 |       "execution_count": null,
197 |       "outputs": []
198 |     },
199 |     {
200 |       "cell_type": "code",
201 |       "source": [
202 |         "from openai import AsyncOpenAI\n",
203 |         "\n",
204 |         "openai_client = AsyncOpenAI()\n",
205 |         "\n",
206 |         "async def achat(messages, model=\"gpt-4o\", temperature=0, config={}):\n",
207 |         "    response = await openai_client.chat.completions.create(\n",
208 |         "        model=model,\n",
209 |         "        temperature=temperature,\n",
210 |         "        messages=messages,\n",
211 |         "        **config,\n",
212 |         "    )\n",
213 |         "    return response.choices[0].message.content"
214 |       ],
215 |       "metadata": {
216 |         "id": "WtoZ2oCIwpRL"
217 |       },
218 |       "execution_count": 10,
219 |       "outputs": []
220 |     },
221 |     {
222 |       "cell_type": "code",
223 |       "source": [
224 |         "async def classic_rag(input: str) -> str:\n",
225 |         "    context = [str(obj.properties) for obj in financial_collection.query.near_text(query = input, limit=3).objects]\n",
226 |         "    messages = [\n",
227 |         "    {\n",
228 |         "        \"role\": \"user\",\n",
229 |         "        \"content\": \"Based on the given context: {context} \\n\\n Answer the following question: {question}\".format(context=context, question=input)\n",
230 |         "    },\n",
231 |         "    ]\n",
232 |         "    output = await achat(messages, model=\"gpt-4o\")\n",
233 |         "    return output"
234 |       ],
235 |       "metadata": {
236 |         "id": "S3oJUcHF7IhA"
237 |       },
238 |       "execution_count": 11,
239 |       "outputs": []
240 |     },
241 |     {
242 |       "cell_type": "code",
243 |       "source": [
244 |         "response = await classic_rag(\"What do you know about Weaviate\")\n",
245 |         "print(response)"
246 |       ],
247 |       "metadata": {
248 |         "colab": {
249 |           "base_uri": "https://localhost:8080/"
250 |         },
251 |         "id": "6Cv88Dv883ST",
252 |         "outputId": "a8617e0e-29b9-4dce-ef01-dd45de748b55"
253 |       },
254 |       "execution_count": 12,
255 |       "outputs": [
256 |         {
257 |           "output_type": "stream",
258 |           "name": "stdout",
259 |           "text": [
260 |             "Based on the provided context, Weaviate is a corporation involved in multiple partnership agreements with OpenAI. Here are some details about Weaviate from the context:\n",
261 |             "\n",
262 |             "1. **Location and Legal Organization**:\n",
263 |             "   - In the first agreement dated March 15, 2022, Weaviate is described as a corporation organized under the laws of the State of Delaware, with its principal place of business at 123 Innovation Drive, Wilmington, DE.\n",
264 |             "   - In the second agreement dated April 5, 2023, Weaviate is located at 123 Innovation Drive, Tech City.\n",
265 |             "   - In the third agreement dated November 15, 2023, Weaviate is described as a corporation organized under the laws of the state of California, with its principal office located at 123 Innovation Drive, San Francisco, CA.\n",
266 |             "\n",
267 |             "2. **Partnerships with OpenAI**:\n",
268 |             "   - Weaviate has entered into multiple partnership agreements with OpenAI to collaborate on various projects, particularly in the field of artificial intelligence.\n",
269 |             "   - The agreements outline financial contributions, responsibilities, and profit-sharing arrangements between Weaviate and OpenAI.\n",
270 |             "\n",
271 |             "3. **Responsibilities and Contributions**:\n",
272 |             "   - In the agreements, Weaviate is responsible for project management, client relations, marketing, and promotion of joint projects.\n",
273 |             "   - Weaviate has made financial contributions to the partnerships, with specific amounts mentioned in each agreement.\n",
274 |             "\n",
275 |             "4. **Duration and Termination**:\n",
276 |             "   - The agreements have specified durations, typically two to three years, with provisions for early termination upon written notice.\n",
277 |             "\n",
278 |             "5. **Representation**:\n",
279 |             "   - Jane Doe and Johnathan Smith are mentioned as representatives of Weaviate in different agreements.\n",
280 |             "\n",
281 |             "Overall, Weaviate appears to be an active participant in collaborative projects with OpenAI, focusing on AI-driven initiatives and sharing responsibilities and profits with its partner.\n"
282 |           ]
283 |         }
284 |       ]
285 |     },
286 |     {
287 |       "cell_type": "markdown",
288 |       "source": [
289 |         "## Graph RAG\n",
290 |         "\n",
291 |         "### Build a Graph with Neo4J\n"
292 |       ],
293 |       "metadata": {
294 |         "id": "zzBnUF4bBYKG"
295 |       }
296 |     },
297 |     {
298 |       "cell_type": "code",
299 |       "execution_count": 13,
300 |       "metadata": {
301 |         "id": "TKTTyhgzT16d"
302 |       },
303 |       "outputs": [],
304 |       "source": [
305 |         "import os\n",
306 |         "from getpass import getpass\n",
307 |         "\n",
308 |         "from ms_graphrag_neo4j import MsGraphRAG\n",
309 |         "from neo4j import GraphDatabase\n",
310 |         "import pandas as pd\n",
311 |         "\n",
312 |         "# Use Neo4j Sandbox - Blank Project https://sandbox.neo4j.com/\n",
313 |         "\n",
314 |         "os.environ[\"NEO4J_URI\"]=\"bolt://52.207.220.65:7687\"\n",
315 |         "os.environ[\"NEO4J_USERNAME\"]=\"neo4j\"\n",
316 |         "os.environ[\"NEO4J_PASSWORD\"]=\"dives-platform-eligibility\""
317 |       ]
318 |     },
319 |     {
320 |       "cell_type": "code",
321 |       "source": [
322 |         "driver = GraphDatabase.driver(\n",
323 |         "    os.environ[\"NEO4J_URI\"],\n",
324 |         "    auth=(os.environ[\"NEO4J_USERNAME\"], os.environ[\"NEO4J_PASSWORD\"]),\n",
325 |         "    #notifications_min_severity=\"OFF\",\n",
326 |         ")\n",
327 |         "ms_graph = MsGraphRAG(driver=driver, model=\"gpt-4o\", max_workers=10)"
328 |       ],
329 |       "metadata": {
330 |         "id": "kYI-du6JUI00"
331 |       },
332 |       "execution_count": 14,
333 |       "outputs": []
334 |     },
335 |     {
336 |       "cell_type": "code",
337 |       "source": [
338 |         "import pandas as pd\n",
339 |         "\n",
340 |         "# Login using e.g. `huggingface-cli login` to access this dataset\n",
341 |         "df = pd.read_parquet(\"hf://datasets/weaviate/agents/query-agent/financial-contracts/0001.parquet\")\n",
342 |         "df.head()"
343 |       ],
344 |       "metadata": {
345 |         "id": "IkgiLkpyua2D"
346 |       },
347 |       "execution_count": null,
348 |       "outputs": []
349 |     },
350 |     {
351 |       "cell_type": "code",
352 |       "source": [
353 |         "texts = [el['contract_text'] for el in df['properties']]\n",
354 |         "texts[:2]"
355 |       ],
356 |       "metadata": {
357 |         "colab": {
358 |           "base_uri": "https://localhost:8080/"
359 |         },
360 |         "id": "sB0MwT0X5cIJ",
361 |         "outputId": "7af04b30-34a5-475f-9033-41408f04b724"
362 |       },
363 |       "execution_count": 16,
364 |       "outputs": [
365 |         {
366 |           "output_type": "execute_result",
367 |           "data": {
368 |             "text/plain": [
369 |               "['PARTNERSHIP AGREEMENT\\n\\nThis Partnership Agreement (\"Agreement\") is made and entered into as of the 15th day of March, 2023, by and between Weaviate, a company registered in the State of California, and OpenAI, a research organization based in San Francisco, California.\\n\\n1. Purpose\\nThe parties agree to establish a partnership to collaborate on artificial intelligence research and development, sharing resources and expertise.\\n\\n2. Contributions\\nWeaviate shall contribute technology resources valued at $112.85 and staff time equivalent to a monetary value of $550.09. OpenAI shall contribute its research expertise and a project management team valued at $98.14.\\n\\n3. Profit Sharing\\nThe net profits generated from joint projects shall be distributed as follows: Weaviate shall receive 60% and OpenAI shall receive 40%.\\n\\n4. Duration\\nThis Agreement shall commence on the date hereof and shall continue in effect for a period of three (3) years, unless terminated earlier in accordance with the provisions herein.\\n\\n5. Termination\\nEither party may terminate this Agreement with a written notice of thirty (30) days. Upon termination, the parties shall settle any outstanding financial obligations.\\n\\nIN WITNESS WHEREOF, the parties hereto have executed this Partnership Agreement as of the date first above written.\\n\\nWeaviate\\n___________________________\\nMark Robson, CEO\\n\\nOpenAI\\n___________________________\\nDanny Williams, Director',\n",
370 |               " 'LOAN AGREEMENT\\n\\nThis Loan Agreement (\"Agreement\") is made and entered into as of the 15th day of March, 2023, by and between Weaviate, a corporation organized and existing under the laws of the State, with a principal office located at 123 Innovation Drive, Tech City, and Mark Robson, an individual residing at 456 Creativity Lane, Idea Town.\\n\\n1. LOAN AMOUNT: Weaviate agrees to loan Mark Robson the total amount of $342.00 (Three Hundred Forty-Two Dollars).\\n\\n2. INTEREST RATE: The loan shall bear an interest rate of 2.68% per annum, calculated on the outstanding balance.\\n\\n3. REPAYMENT: The total amount, including interest, shall be repaid in full within a period of three years from the date of this Agreement. Payments shall be made monthly, with the first payment due on April 15, 2023.\\n\\n4. DEFAULT: In the event of default, the borrower agrees that Weaviate may demand immediate repayment of the entire unpaid balance and accrued interest.\\n\\n5. GOVERNING LAW: This Agreement shall be governed by and construed in accordance with the laws of the State.\\n\\nIN WITNESS WHEREOF, the parties hereto have executed this Loan Agreement as of the date first above written.\\n\\nWeaviate: ____________________  \\nAuthorized Signature\\n\\nMark Robson: __________________  \\nBorrower\\'s Signature']"
371 |             ]
372 |           },
373 |           "metadata": {},
374 |           "execution_count": 16
375 |         }
376 |       ]
377 |     },
378 |     {
379 |       "cell_type": "markdown",
380 |       "source": [
381 |         "### Extract Relevant Entities\n",
382 |         "\n",
383 |         "Next, we will start extracting relevant entities and relations between these entities that we might be interested in."
384 |       ],
385 |       "metadata": {
386 |         "id": "FVzpKJViBiJT"
387 |       }
388 |     },
389 |     {
390 |       "cell_type": "code",
391 |       "source": [
392 |         "allowed_entities = [\"Person\", \"Organization\", \"Location\"]\n",
393 |         "\n",
394 |         "await ms_graph.extract_nodes_and_rels(texts, allowed_entities)"
395 |       ],
396 |       "metadata": {
397 |         "colab": {
398 |           "base_uri": "https://localhost:8080/",
399 |           "height": 53
400 |         },
401 |         "id": "7TuxXPv251Ia",
402 |         "outputId": "0b9ac964-586a-4cc4-da15-59d971454188"
403 |       },
404 |       "execution_count": 17,
405 |       "outputs": [
406 |         {
407 |           "output_type": "stream",
408 |           "name": "stderr",
409 |           "text": [
410 |             "Extracting nodes & relationships: 100%|██████████| 100/100 [00:34<00:00,  2.87it/s]\n"
411 |           ]
412 |         },
413 |         {
414 |           "output_type": "execute_result",
415 |           "data": {
416 |             "text/plain": [
417 |               "'Successfuly extracted and imported 274 relationships'"
418 |             ],
419 |             "application/vnd.google.colaboratory.intrinsic+json": {
420 |               "type": "string"
421 |             }
422 |           },
423 |           "metadata": {},
424 |           "execution_count": 17
425 |         }
426 |       ]
427 |     },
428 |     {
429 |       "cell_type": "markdown",
430 |       "source": [
431 |         "### Summarize Nodes and Communities"
432 |       ],
433 |       "metadata": {
434 |         "id": "j1wAsUfIBrGc"
435 |       }
436 |     },
437 |     {
438 |       "cell_type": "code",
439 |       "source": [
440 |         "await ms_graph.summarize_nodes_and_rels()"
441 |       ],
442 |       "metadata": {
443 |         "colab": {
444 |           "base_uri": "https://localhost:8080/",
445 |           "height": 70
446 |         },
447 |         "id": "_ZlhpN9E6cNO",
448 |         "outputId": "85d52a58-cb30-43d0-aa77-1523fb79b32b"
449 |       },
450 |       "execution_count": 18,
451 |       "outputs": [
452 |         {
453 |           "output_type": "stream",
454 |           "name": "stderr",
455 |           "text": [
456 |             "Summarizing nodes: 100%|██████████| 33/33 [00:13<00:00,  2.42it/s]\n",
457 |             "Summarizing relationships: 100%|██████████| 33/33 [00:08<00:00,  3.87it/s]\n"
458 |           ]
459 |         },
460 |         {
461 |           "output_type": "execute_result",
462 |           "data": {
463 |             "text/plain": [
464 |               "'Successfuly summarized nodes and relationships'"
465 |             ],
466 |             "application/vnd.google.colaboratory.intrinsic+json": {
467 |               "type": "string"
468 |             }
469 |           },
470 |           "metadata": {},
471 |           "execution_count": 18
472 |         }
473 |       ]
474 |     },
475 |     {
476 |       "cell_type": "code",
477 |       "source": [
478 |         "await ms_graph.summarize_communities()"
479 |       ],
480 |       "metadata": {
481 |         "colab": {
482 |           "base_uri": "https://localhost:8080/",
483 |           "height": 209
484 |         },
485 |         "id": "L1do4-EX7AAL",
486 |         "outputId": "707af976-9fc7-4799-84e2-b51d7cbc7477"
487 |       },
488 |       "execution_count": 21,
489 |       "outputs": [
490 |         {
491 |           "output_type": "stream",
492 |           "name": "stdout",
493 |           "text": [
494 |             "Leiden algorithm identified 1 community levels with 3 communities on the last level.\n"
495 |           ]
496 |         },
497 |         {
498 |           "output_type": "stream",
499 |           "name": "stderr",
500 |           "text": [
501 |             "\n",
502 |             "\n",
503 |             "Summarizing communities:   0%|          | 0/3 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
504 |             "\n",
505 |             "Summarizing communities:  33%|███▎      | 1/3 [00:09<00:18,  9.19s/it]\u001b[A\u001b[A\n",
506 |             "\n",
507 |             "Summarizing communities:  67%|██████▋   | 2/3 [00:10<00:04,  4.57s/it]\u001b[A\u001b[A\n",
508 |             "\n",
509 |             "Summarizing communities: 100%|██████████| 3/3 [00:13<00:00,  4.41s/it]\n"
510 |           ]
511 |         },
512 |         {
513 |           "output_type": "execute_result",
514 |           "data": {
515 |             "text/plain": [
516 |               "'Generated 3 community summaries'"
517 |             ],
518 |             "application/vnd.google.colaboratory.intrinsic+json": {
519 |               "type": "string"
520 |             }
521 |           },
522 |           "metadata": {},
523 |           "execution_count": 21
524 |         }
525 |       ]
526 |     },
527 |     {
528 |       "cell_type": "code",
529 |       "source": [
530 |         "entities = ms_graph.query(\"\"\"\n",
531 |         "MATCH (e:__Entity__)\n",
532 |         "RETURN e.name AS entity_id, e.summary AS entity_summary\n",
533 |         "\"\"\")"
534 |       ],
535 |       "metadata": {
536 |         "id": "ayWjeNVv7JEy"
537 |       },
538 |       "execution_count": 22,
539 |       "outputs": []
540 |     },
541 |     {
542 |       "cell_type": "code",
543 |       "source": [
544 |         "entities[:2]"
545 |       ],
546 |       "metadata": {
547 |         "id": "PvgvNfY88A77",
548 |         "outputId": "b18546dc-379c-4ba2-9a3b-fde8a1493895",
549 |         "colab": {
550 |           "base_uri": "https://localhost:8080/"
551 |         }
552 |       },
553 |       "execution_count": 23,
554 |       "outputs": [
555 |         {
556 |           "output_type": "execute_result",
557 |           "data": {
558 |             "text/plain": [
559 |               "[{'entity_id': 'WEAVIATE',\n",
560 |               "  'entity_summary': \"Weaviate is a corporation organized under the laws of both the State of California and the State of Delaware, with its principal place of business primarily located in San Francisco, CA, and additional offices at 123 Innovation Drive, Tech City, CA 90210, and 123 Tech Lane, Silicon Valley, CA 94043. The company is involved in a wide range of activities, including providing consulting, software development, data analysis, cloud storage, technical support, and project management services. Weaviate is actively engaged in partnerships to develop innovative AI solutions and advanced data processing technologies, contributing resources and expertise to these collaborations.\\n\\nThe organization acts as both a lessor and a lessee in various lease agreements, and it is involved in multiple business relationships under Non-Disclosure Agreements. Weaviate also participates in sales and purchase order agreements, acting as both a buyer and a seller, and is involved in loan agreements as a lender. The company is responsible for marketing, promotion, and project management in its partnerships and provides services to clients such as OpenAI.\\n\\nWeaviate's involvement in these diverse activities highlights its role as a key player in the technology and AI sectors, leveraging its expertise to drive innovation and support its clients' needs.\"},\n",
561 |               " {'entity_id': 'MARK ROBSON',\n",
562 |               "  'entity_summary': \"Mark Robson is a multifaceted professional with diverse roles and responsibilities across various organizations. He is prominently associated with Weaviate, where he is employed as a Software Engineer. His residential addresses have been noted in several locations, including Los Angeles, CA, San Francisco, CA, and Springfield, IL, among others. In addition to his role at Weaviate, Mark Robson is also the CEO of Weaviate, indicating a significant leadership position within the company.\\n\\nFurthermore, Mark Robson is actively involved with OpenAI, serving as the Director and an authorized representative. He has executed partnership agreements and is the authorized signatory for OpenAI in various capacities. His involvement with OpenAI extends to being the primary contact person for purchase orders and partnership agreements.\\n\\nMark Robson is also engaged in several contractual agreements, including acting as a lessee in lease agreements, a borrower in loan agreements, and a buyer in sales agreements. He is responsible for maintaining leased premises and repaying loans with interest over specified periods. Additionally, he is involved in service agreements with Weaviate, both as a client and as a recipient of invoices for services rendered.\\n\\nOverall, Mark Robson's professional landscape is characterized by his significant roles in both Weaviate and OpenAI, his involvement in various legal and business agreements, and his presence in multiple locations, reflecting a dynamic and influential career.\"}]"
563 |             ]
564 |           },
565 |           "metadata": {},
566 |           "execution_count": 23
567 |         }
568 |       ]
569 |     },
570 |     {
571 |       "cell_type": "markdown",
572 |       "source": [
573 |         "### Write the Entities to Weaviate"
574 |       ],
575 |       "metadata": {
576 |         "id": "n105cc-_B9bN"
577 |       }
578 |     },
579 |     {
580 |       "cell_type": "code",
581 |       "source": [
582 |         "from weaviate.classes.config import Configure\n",
583 |         "\n",
584 |         "#client.collections.delete(\"Entities\")\n",
585 |         "client.collections.create(\n",
586 |         "    \"Entities\",\n",
587 |         "    description=\"A dataset of entities appearing in the financial contracts between indivicuals and/or companies, as well as information on the type of contract and who has authored them.\",\n",
588 |         "    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),\n",
589 |         ")"
590 |       ],
591 |       "metadata": {
592 |         "colab": {
593 |           "base_uri": "https://localhost:8080/"
594 |         },
595 |         "id": "Q8iy4rO3qQ0F",
596 |         "outputId": "d71ed661-ec93-4aa0-8a3a-369cf61736ed"
597 |       },
598 |       "execution_count": 25,
599 |       "outputs": [
600 |         {
601 |           "output_type": "execute_result",
602 |           "data": {
603 |             "text/plain": [
604 |               "<weaviate.collections.collection.sync.Collection at 0x7a71e4142510>"
605 |             ]
606 |           },
607 |           "metadata": {},
608 |           "execution_count": 25
609 |         }
610 |       ]
611 |     },
612 |     {
613 |       "cell_type": "code",
614 |       "source": [
615 |         "from datasets import IterableDataset\n",
616 |         "\n",
617 |         "# Define a simple generator\n",
618 |         "def list_generator(data):\n",
619 |         "    for item in data:\n",
620 |         "        yield item\n",
621 |         "\n",
622 |         "# Create the IterableDataset\n",
623 |         "entities_dataset = IterableDataset.from_generator(list_generator, gen_kwargs={\"data\": entities})"
624 |       ],
625 |       "metadata": {
626 |         "id": "4TG2awNzqYpA"
627 |       },
628 |       "execution_count": 26,
629 |       "outputs": []
630 |     },
631 |     {
632 |       "cell_type": "code",
633 |       "source": [
634 |         "entities_collection = client.collections.get(\"Entities\")\n",
635 |         "\n",
636 |         "with entities_collection.batch.dynamic() as batch:\n",
637 |         "    for item in entities_dataset:\n",
638 |         "        batch.add_object(properties=item)"
639 |       ],
640 |       "metadata": {
641 |         "id": "fCDnGfGXrKUd"
642 |       },
643 |       "execution_count": 28,
644 |       "outputs": []
645 |     },
646 |     {
647 |       "cell_type": "code",
648 |       "source": [
649 |         "from neo4j_graphrag.retrievers import WeaviateNeo4jRetriever\n",
650 |         "\n",
651 |         "retrieval_query = \"\"\"\n",
652 |         "    WITH collect(node) as nodes\n",
653 |         "WITH collect {\n",
654 |         "    UNWIND nodes as n\n",
655 |         "    MATCH (n)<-[:MENTIONS]->(c:__Chunk__)\n",
656 |         "    WITH c, count(distinct n) as freq\n",
657 |         "    RETURN c.text AS chunkText\n",
658 |         "    ORDER BY freq DESC\n",
659 |         "    LIMIT 3\n",
660 |         "} AS text_mapping,\n",
661 |         "collect {\n",
662 |         "    UNWIND nodes as n\n",
663 |         "    MATCH (n)-[:IN_COMMUNITY*]->(c:__Community__)\n",
664 |         "    WHERE c.summary IS NOT NULL\n",
665 |         "    WITH c, c.rating as rank\n",
666 |         "    RETURN c.summary\n",
667 |         "    ORDER BY rank DESC\n",
668 |         "    LIMIT 3\n",
669 |         "} AS report_mapping,\n",
670 |         "collect {\n",
671 |         "    UNWIND nodes as n\n",
672 |         "    MATCH (n)-[r:SUMMARIZED_RELATIONSHIP]-(m)\n",
673 |         "    WHERE m IN nodes\n",
674 |         "    RETURN r.summary AS descriptionText\n",
675 |         "    LIMIT 3\n",
676 |         "} as insideRels,\n",
677 |         "collect {\n",
678 |         "    UNWIND nodes as n\n",
679 |         "    RETURN n.summary AS descriptionText\n",
680 |         "} as entities\n",
681 |         "RETURN {Chunks: text_mapping, Reports: report_mapping,\n",
682 |         "       Relationships: insideRels,\n",
683 |         "       Entities: entities} AS output\n",
684 |         "    \"\"\"\n",
685 |         "\n",
686 |         "retriever = WeaviateNeo4jRetriever(\n",
687 |         "    driver=driver,\n",
688 |         "    client=client,\n",
689 |         "    collection=\"Entities\",\n",
690 |         "    id_property_external=\"entity_id\",\n",
691 |         "    id_property_neo4j=\"name\",\n",
692 |         "    retrieval_query=retrieval_query\n",
693 |         ")"
694 |       ],
695 |       "metadata": {
696 |         "id": "tEmPiLVdr8ao"
697 |       },
698 |       "execution_count": 29,
699 |       "outputs": []
700 |     },
701 |     {
702 |       "cell_type": "code",
703 |       "source": [
704 |         "async def hybrid_local_search_rag(input: str) -> str:\n",
705 |         "    context = [str(el[1]) for el in retriever.search(query_text=input, top_k=3)]\n",
706 |         "    messages = [\n",
707 |         "    {\n",
708 |         "        \"role\": \"user\",\n",
709 |         "        \"content\": \"Based on the given context: {context} \\n\\n Answer the following question: {question}\".format(context=context, question=input)\n",
710 |         "    },\n",
711 |         "    ]\n",
712 |         "    output = await achat(messages, model=\"gpt-4o\")\n",
713 |         "    return output"
714 |       ],
715 |       "metadata": {
716 |         "id": "VszQsBQ3v00Z"
717 |       },
718 |       "execution_count": 30,
719 |       "outputs": []
720 |     },
721 |     {
722 |       "cell_type": "code",
723 |       "source": [
724 |         "response = await hybrid_local_search_rag(input=\"What do you know about Weaviate\")\n",
725 |         "print(response)"
726 |       ],
727 |       "metadata": {
728 |         "colab": {
729 |           "base_uri": "https://localhost:8080/"
730 |         },
731 |         "id": "EmgyW2XuwC0C",
732 |         "outputId": "8c46af34-9c68-4707-f7e0-d88f45adbd34"
733 |       },
734 |       "execution_count": 31,
735 |       "outputs": [
736 |         {
737 |           "output_type": "stream",
738 |           "name": "stdout",
739 |           "text": [
740 |             "Weaviate is a corporation organized under the laws of both the State of California and the State of Delaware. Its principal place of business is primarily located in San Francisco, CA, with additional offices at 123 Innovation Drive, Tech City, CA, and 123 Tech Lane, Silicon Valley, CA. The company is involved in a wide range of activities, including consulting, software development, data analysis, cloud storage, technical support, and project management services. Weaviate is actively engaged in partnerships to develop innovative AI solutions and advanced data processing technologies, contributing resources and expertise to these collaborations.\n",
741 |             "\n",
742 |             "The organization acts as both a lessor and a lessee in various lease agreements and is involved in multiple business relationships under Non-Disclosure Agreements. Weaviate also participates in sales and purchase order agreements, acting as both a buyer and a seller, and is involved in loan agreements as a lender. The company is responsible for marketing, promotion, and project management in its partnerships and provides services to clients such as OpenAI.\n",
743 |             "\n",
744 |             "Weaviate's involvement in these diverse activities highlights its role as a key player in the technology and AI sectors, leveraging its expertise to drive innovation and support its clients' needs. Additionally, Weaviate's principal office is located in Wilmington, DE.\n"
745 |           ]
746 |         }
747 |       ]
748 |     },
749 |     {
750 |       "cell_type": "code",
751 |       "source": [],
752 |       "metadata": {
753 |         "id": "Mf-S6xR19t78"
754 |       },
755 |       "execution_count": null,
756 |       "outputs": []
757 |     }
758 |   ]
759 | }


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [build-system]
 2 | requires = ["hatchling"]
 3 | build-backend = "hatchling.build"
 4 | 
 5 | [project]
 6 | name = "ms_graphrag_neo4j"
 7 | version = "0.1.0"
 8 | description = "A simple Python package"
 9 | readme = "README.md"
10 | authors = [
11 |     {name = "Tomaz Bratanic", email = "tomaz.bratanic@neo4j.com"},
12 | ]
13 | license = {text = "MIT"}
14 | classifiers = [
15 |     "Programming Language :: Python :: 3",
16 |     "License :: OSI Approved :: MIT License",
17 |     "Operating System :: OS Independent",
18 | ]
19 | requires-python = ">=3.8"
20 | dependencies = [
21 |     "neo4j>=5.28.1",
22 |     "openai>=1.69.0"
23 | ]
24 | 
25 | [project.urls]
26 | "Homepage" = "https://github.com/yourusername/my_package"
27 | "Bug Tracker" = "https://github.com/yourusername/my_package/issues"
28 | 
29 | [tool.hatch.build.targets.wheel]
30 | packages = ["src/ms_graphrag_neo4j"]


--------------------------------------------------------------------------------
/src/ms_graphrag_neo4j/__init__.py:
--------------------------------------------------------------------------------
1 | from ms_graphrag_neo4j.ms_graphrag import MsGraphRAG
2 | 
3 | __all__ = ["MsGraphRAG"]
4 | 


--------------------------------------------------------------------------------
/src/ms_graphrag_neo4j/cypher_queries.py:
--------------------------------------------------------------------------------
  1 | import_nodes_query = """
  2 | CREATE (c:__Chunk__ {id: $chunk_id})
  3 | SET c.text = $text
  4 | WITH c
  5 | UNWIND $data AS row
  6 | MERGE (n:__Entity__ {name: row.entity_name})
  7 | SET n:$(row.entity_type),
  8 |     n.description = coalesce(n.description, []) + [row.entity_description]
  9 | MERGE (n)<-[:MENTIONS]-(c)
 10 | """
 11 | 
 12 | import_relationships_query = """
 13 | UNWIND $data AS row
 14 | MERGE (s:__Entity__ {name: row.source_entity})
 15 | MERGE (t:__Entity__ {name: row.target_entity})
 16 | CREATE (s)-[r:RELATIONSHIP {description: row.relationship_description, strength: row.relationship_strength}]->(t)
 17 | """
 18 | 
 19 | import_community_query = """
 20 | UNWIND $data AS row
 21 | MERGE (c:__Community__ {communityId: row.communityId})
 22 | SET c.title = row.community.title,
 23 |     c.summary = row.community.summary,
 24 |     c.rating = row.community.rating,
 25 |     c.rating_explanation = row.community.rating_explanation
 26 | WITH c, row
 27 | UNWIND row.nodes AS node
 28 | MERGE (n:__Entity__ {name: node})
 29 | MERGE (n)-[:IN_COMMUNITY]->(c)
 30 | """
 31 | 
 32 | candidate_nodes_summarization = """
 33 | MATCH (e:__Entity__) WHERE size(e.description) > 1 
 34 | RETURN e.name AS entity_name, e.description AS description_list
 35 | """
 36 | 
 37 | candidate_rels_summarization = """
 38 | MATCH (s:__Entity__)-[r:RELATIONSHIP]-(t:__Entity__)
 39 | WHERE elementId(s) < elementId(t)
 40 | WITH s.name AS source, t.name AS target, 
 41 |         collect(r.description) AS description_list,
 42 |         count(*) AS count
 43 | WHERE count > 1
 44 | RETURN source, target, description_list
 45 | """
 46 | 
 47 | import_entity_summary = """
 48 | UNWIND $data AS row
 49 | MATCH (e:__Entity__ {name: row.entity})
 50 | SET e.summary = row.summary
 51 | """
 52 | 
 53 | import_entity_summary = """
 54 | UNWIND $data AS row
 55 | MATCH (e:__Entity__ {name: row.entity})
 56 | SET e.summary = row.summary
 57 | """
 58 | 
 59 | import_entity_summary_single = """
 60 | MATCH (e:__Entity__)
 61 | WHERE size(e.description) = 1
 62 | SET e.summary = e.description[0]
 63 | """
 64 | 
 65 | import_rel_summary = """
 66 | UNWIND $data AS row
 67 | MATCH (s:__Entity__ {name: row.source}), (t:__Entity__ {name: row.target})
 68 | MERGE (s)-[r:SUMMARIZED_RELATIONSHIP]-(t)
 69 | SET r.summary = row.summary
 70 | """
 71 | 
 72 | import_rel_summary_single = """
 73 | MATCH (s:__Entity__)-[e:RELATIONSHIP]-(t:__Entity__)
 74 | WHERE NOT (s)-[:SUMMARIZED_RELATIONSHIP]-(t)
 75 | MERGE (s)-[r:SUMMARIZED_RELATIONSHIP]-(t)
 76 | SET r.summary = e.description
 77 | """
 78 | 
 79 | drop_gds_graph_query = "CALL gds.graph.drop('entity', False) YIELD graphName"
 80 | 
 81 | create_gds_graph_query = """
 82 | MATCH (source:__Entity__)-[r:RELATIONSHIP]->(target:__Entity__)
 83 | WITH gds.graph.project('entity', source, target, {}, {undirectedRelationshipTypes: ['*']}) AS g
 84 | RETURN
 85 |     g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels
 86 | """
 87 | 
 88 | leiden_query = """
 89 | CALL gds.leiden.write("entity", {writeProperty:"communities", includeIntermediateCommunities: True})
 90 | """
 91 | 
 92 | community_hierarchy_query = """
 93 | MATCH (e:`__Entity__`)
 94 | WHERE e.communities IS NOT NULL
 95 | UNWIND range(0, size(e.communities) - 1 , 1) AS index
 96 | CALL (e, index) {
 97 |   WITH e, index
 98 |   WHERE index = 0
 99 |   MERGE (c:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])})
100 |   ON CREATE SET c.level = index
101 |   MERGE (e)-[:IN_COMMUNITY]->(c)
102 |   RETURN count(*) AS count_0
103 | }
104 | CALL (e, index) {
105 |   WITH e, index
106 |   WHERE index > 0
107 |   MERGE (current:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])})
108 |   ON CREATE SET current.level = index
109 |   MERGE (previous:`__Community__` {id: toString(index - 1) + '-' + toString(e.communities[index - 1])})
110 |   ON CREATE SET previous.level = index - 1
111 |   MERGE (previous)-[:IN_COMMUNITY]->(current)
112 |   RETURN count(*) AS count_1
113 | }
114 | RETURN count(*)
115 | """
116 | 
117 | community_info_query = """
118 | MATCH (c:`__Community__`)<-[:IN_COMMUNITY*]-(e:__Entity__)
119 | WHERE c.level IN $levels
120 | WITH c, collect(e ) AS nodes
121 | WHERE size(nodes) > 1
122 | CALL apoc.path.subgraphAll(nodes[0], {
123 | 	whitelistNodes:nodes,
124 |     relationshipFilter: "SUMMARIZED_RELATIONSHIP"
125 | })
126 | YIELD relationships
127 | RETURN c.id AS communityId,
128 |        [n in nodes | {id: n.id, description: n.summary, type: [el in labels(n) WHERE el <> '__Entity__'][0]}] AS nodes,
129 |        [r in relationships | {start: startNode(r).id, type: type(r), end: endNode(r).id, description: r.summary}] AS rels
130 | """
131 | 
132 | import_community_summary = """
133 | UNWIND $data AS row
134 | MERGE (c:__Community__ {id: row.communityId})
135 | SET c.title = row.community.title,
136 |     c.summary = row.community.summary,
137 |     c.rating = row.community.rating,
138 |     c.rating_explanation = row.community.rating_explanation
139 | """
140 | 


--------------------------------------------------------------------------------
/src/ms_graphrag_neo4j/ms_graphrag.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from typing import Any, Dict, List, Optional, Type
  3 | from neo4j import Driver
  4 | from openai import AsyncOpenAI
  5 | import asyncio
  6 | 
  7 | 
  8 | from tqdm.asyncio import tqdm, tqdm_asyncio
  9 | 
 10 | 
 11 | from ms_graphrag_neo4j.cypher_queries import *
 12 | from ms_graphrag_neo4j.utils import *
 13 | from ms_graphrag_neo4j.prompts import *
 14 | 
 15 | 
 16 | class MsGraphRAG:
 17 |     """
 18 |     MsGraphRAG: Microsoft GraphRAG Implementation for Neo4j
 19 | 
 20 |     A class for implementing the Microsoft GraphRAG approach with Neo4j graph database.
 21 |     GraphRAG enhances retrieval-augmented generation by leveraging graph structures
 22 |     to provide context-aware information for LLM responses.
 23 | 
 24 |     This implementation features:
 25 |     - Entity and relationship extraction from unstructured text
 26 |     - Node and relationship summarization for improved retrieval
 27 |     - Community detection and summarization for concept clustering
 28 |     - Integration with OpenAI models for generation
 29 | 
 30 |     The class connects to Neo4j for graph storage and uses OpenAI for content generation
 31 |     and extraction, providing a seamless way to build knowledge graphs from text
 32 |     and perform graph-based retrieval.
 33 | 
 34 |     Requirements:
 35 |     - Neo4j database with APOC and GDS plugins installed
 36 |     - OpenAI API key for LLM interactions
 37 | 
 38 |     Example:
 39 |     ```
 40 |     from ms_graphrag_neo4j import MsGraphRAG
 41 |     import os
 42 | 
 43 |     os.environ["OPENAI_API_KEY"]= "sk-proj-"
 44 |     os.environ["NEO4J_URI"]="bolt://localhost:7687"
 45 |     os.environ["NEO4J_USERNAME"]="neo4j"
 46 |     os.environ["NEO4J_PASSWORD"]="password"
 47 | 
 48 |     from neo4j import GraphDatabase
 49 |     driver = GraphDatabase.driver(os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"]))
 50 |     ms_graph = MsGraphRAG(driver=driver, model='gpt-4o')
 51 | 
 52 |     example_texts = ["Tomaz works for Neo4j", "Tomaz lives in Grosuplje", "Tomaz went to school in Grosuplje"]
 53 |     allowed_entities = ["Person", "Organization", "Location"]
 54 | 
 55 |     await ms_graph.extract_nodes_and_rels(example_texts, allowed_entities)
 56 | 
 57 |     await ms_graph.summarize_nodes_and_rels()
 58 | 
 59 |     await ms_graph.summarize_communities()
 60 |     ```
 61 | 
 62 |     References:
 63 |     - Microsoft GraphRAG: https://github.com/microsoft/graphrag
 64 |     """
 65 | 
 66 |     def __init__(
 67 |         self,
 68 |         driver: Driver,
 69 |         model: str = "gpt-4o",
 70 |         database: str = "neo4j",
 71 |         max_workers: int = 10,
 72 |         create_constraints: bool = True,
 73 |     ) -> None:
 74 |         """
 75 |         Initialize MsGraphRAG with Neo4j driver and LLM.
 76 | 
 77 |         Args:
 78 |             driver (Driver): Neo4j driver instance
 79 |             model (str, optional): The language model to use. Defaults to "gpt-4o".
 80 |             database (str, optional): Neo4j database name. Defaults to "neo4j".
 81 |             max_workers (int, optional): Maximum number of concurrent workers. Defaults to 10.
 82 |             create_constraints (bool, optional): Whether to create database constraints. Defaults to True.
 83 |         """
 84 |         if not os.environ.get("OPENAI_API_KEY"):
 85 |             raise ValueError(
 86 |                 "You need to define the `OPENAI_API_KEY` environment variable"
 87 |             )
 88 | 
 89 |         self._driver = driver
 90 |         self.model = model
 91 |         self.max_workers = max_workers
 92 |         self._database = database
 93 |         self._openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
 94 |         # Test for APOC
 95 |         try:
 96 |             self.query("CALL apoc.help('test')")
 97 |         except:
 98 |             raise ValueError("You need to install and allow APOC functions")
 99 |         # Test for GDS
100 |         try:
101 |             self.query("CALL gds.list('test')")
102 |         except:
103 |             raise ValueError("You need to install and allow GDS functions")
104 |         if create_constraints:
105 |             self.query(
106 |                 "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Chunk__) REQUIRE e.id IS UNIQUE;"
107 |             )
108 |             self.query(
109 |                 "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Entity__) REQUIRE e.name IS UNIQUE;"
110 |             )
111 |             self.query(
112 |                 "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Community__) REQUIRE e.id IS UNIQUE;"
113 |             )
114 | 
115 |     async def extract_nodes_and_rels(
116 |         self, input_texts: list, allowed_entities: list
117 |     ) -> str:
118 |         """
119 |         Extract nodes and relationships from input texts using LLM and store them in Neo4j.
120 | 
121 |         Args:
122 |             input_texts (list): List of text documents to process and extract entities from
123 |             allowed_entities (list): List of entity types to extract from the texts
124 | 
125 |         Returns:
126 |             str: Success message with count of extracted relationships
127 | 
128 |         Notes:
129 |             - Uses parallel processing with tqdm progress tracking
130 |             - Extracted entities and relationships are stored directly in Neo4j
131 |             - Each text document is processed independently by the LLM
132 |         """
133 | 
134 |         async def process_text(input_text):
135 |             prompt = GRAPH_EXTRACTION_PROMPT.format(
136 |                 entity_types=allowed_entities,
137 |                 input_text=input_text,
138 |                 tuple_delimiter=";",
139 |                 record_delimiter="|",
140 |                 completion_delimiter="\n\n",
141 |             )
142 |             messages = [
143 |                 {"role": "user", "content": prompt},
144 |             ]
145 |             # Make the LLM call
146 |             output = await self.achat(messages, model=self.model)
147 |             # Construct JSON from output
148 |             return parse_extraction_output(output.content)
149 | 
150 |         # Create tasks for all input texts
151 |         tasks = [process_text(text) for text in input_texts]
152 | 
153 |         # Process tasks with tqdm progress bar
154 |         # Use semaphore to limit concurrent tasks if max_workers is specified
155 |         if self.max_workers:
156 |             semaphore = asyncio.Semaphore(self.max_workers)
157 | 
158 |             async def process_with_semaphore(task):
159 |                 async with semaphore:
160 |                     return await task
161 | 
162 |             results = []
163 |             for task in tqdm.as_completed(
164 |                 [process_with_semaphore(task) for task in tasks],
165 |                 total=len(tasks),
166 |                 desc="Extracting nodes & relationships",
167 |             ):
168 |                 results.append(await task)
169 |         else:
170 |             results = []
171 |             for task in tqdm.as_completed(
172 |                 tasks, total=len(tasks), desc="Extracting nodes & relationships"
173 |             ):
174 |                 results.append(await task)
175 | 
176 |         total_relationships = 0
177 |         # Import nodes and relationships
178 |         for text, output in zip(input_texts, results):
179 |             nodes, relationships = output
180 |             total_relationships += len(relationships)
181 |             # Import nodes
182 |             self.query(
183 |                 import_nodes_query,
184 |                 params={"text": text, "chunk_id": get_hash(text), "data": nodes},
185 |             )
186 |             # Import relationships
187 |             self.query(import_relationships_query, params={"data": relationships})
188 | 
189 |         return f"Successfuly extracted and imported {total_relationships} relationships"
190 | 
191 |     async def summarize_nodes_and_rels(self) -> str:
192 |         """
193 |         Generate summaries for all nodes and relationships in the graph.
194 | 
195 |         Returns:
196 |             str: Success message indicating completion of summarization
197 | 
198 |         Notes:
199 |             - Retrieves candidate nodes and relationships from Neo4j
200 |             - Uses LLM to generate concise summaries for each entity and relationship
201 |             - Stores summarized properties in the graph
202 |         """
203 |         # Summarize nodes
204 |         nodes = self.query(candidate_nodes_summarization)
205 | 
206 |         async def process_node(node):
207 |             messages = [
208 |                 {
209 |                     "role": "user",
210 |                     "content": SUMMARIZE_PROMPT.format(
211 |                         entity_name=node["entity_name"],
212 |                         description_list=node["description_list"],
213 |                     ),
214 |                 },
215 |             ]
216 |             summary = await self.achat(messages, model=self.model)
217 |             return {"entity": node["entity_name"], "summary": summary.content}
218 | 
219 |         # Create a progress bar for node processing with max_workers limit
220 |         if self.max_workers:
221 |             semaphore = asyncio.Semaphore(self.max_workers)
222 | 
223 |             async def process_with_semaphore(node):
224 |                 async with semaphore:
225 |                     return await process_node(node)
226 | 
227 |             summaries = await tqdm_asyncio.gather(
228 |                 *[process_with_semaphore(node) for node in nodes],
229 |                 desc="Summarizing nodes",
230 |             )
231 |         else:
232 |             summaries = await tqdm_asyncio.gather(
233 |                 *[process_node(node) for node in nodes], desc="Summarizing nodes"
234 |             )
235 | 
236 |         # Summarize relationships
237 |         rels = self.query(candidate_rels_summarization)
238 | 
239 |         async def process_rel(rel):
240 |             entity_name = f"{rel['source']} relationship to {rel['target']}"
241 |             messages = [
242 |                 {
243 |                     "role": "user",
244 |                     "content": SUMMARIZE_PROMPT.format(
245 |                         entity_name=entity_name,
246 |                         description_list=rel["description_list"],
247 |                     ),
248 |                 },
249 |             ]
250 |             summary = await self.achat(messages, model=self.model)
251 |             return {
252 |                 "source": rel["source"],
253 |                 "target": rel["target"],
254 |                 "summary": summary.content,
255 |             }
256 | 
257 |         # Create a progress bar for relationship processing with max_workers limit
258 |         if self.max_workers:
259 |             semaphore = asyncio.Semaphore(self.max_workers)
260 | 
261 |             async def process_rel_with_semaphore(rel):
262 |                 async with semaphore:
263 |                     return await process_rel(rel)
264 | 
265 |             rel_summaries = await tqdm_asyncio.gather(
266 |                 *[process_rel_with_semaphore(rel) for rel in rels],
267 |                 desc="Summarizing relationships",
268 |             )
269 |         else:
270 |             rel_summaries = await tqdm_asyncio.gather(
271 |                 *[process_rel(rel) for rel in rels], desc="Summarizing relationships"
272 |             )
273 | 
274 |         # Import nodes
275 |         self.query(import_entity_summary, params={"data": summaries})
276 |         self.query(import_entity_summary_single)
277 | 
278 |         # Import relationships
279 |         self.query(import_rel_summary, params={"data": rel_summaries})
280 |         self.query(import_rel_summary_single)
281 | 
282 |         return "Successfuly summarized nodes and relationships"
283 | 
284 |     async def summarize_communities(self, summarize_all_levels: bool = False) -> str:
285 |         """
286 |         Detect and summarize communities within the graph using the Leiden algorithm.
287 | 
288 |         Args:
289 |             summarize_all_levels (bool, optional): Whether to summarize all community levels
290 |                 or just the final level. Defaults to False.
291 | 
292 |         Returns:
293 |             str: Success message with count of generated community summaries
294 | 
295 |         Notes:
296 |             - Uses Neo4j GDS library to run Leiden community detection algorithm
297 |             - Generates hierarchical community structures in the graph
298 |             - Uses LLM to create descriptive summaries of each community
299 |             - The community summaries include key entities, relationships, and themes
300 |         """
301 |         # Calculate communities
302 |         self.query(drop_gds_graph_query)
303 |         self.query(create_gds_graph_query)
304 |         community_summary = self.query(leiden_query)
305 |         community_levels = community_summary[0]["ranLevels"]
306 |         print(
307 |             f"Leiden algorithm identified {community_levels} community levels "
308 |             f"with {community_summary[0]['communityCount']} communities on the last level."
309 |         )
310 |         self.query(community_hierarchy_query)
311 | 
312 |         # Community summarization
313 |         if summarize_all_levels:
314 |             levels = list(range(community_levels))
315 |         else:
316 |             levels = [community_levels - 1]
317 |         communities = self.query(community_info_query, params={"levels": levels})
318 | 
319 |         # Define async function for processing a single community
320 |         async def process_community(community):
321 |             input_text = f"""Entities:
322 |                     {community['nodes']}
323 | 
324 |                     Relationships:
325 |                     {community['rels']}"""
326 | 
327 |             messages = [
328 |                 {
329 |                     "role": "user",
330 |                     "content": COMMUNITY_REPORT_PROMPT.format(input_text=input_text),
331 |                 },
332 |             ]
333 |             summary = await self.achat(messages, model=self.model)
334 |             return {
335 |                 "community": extract_json(summary.content),
336 |                 "communityId": community["communityId"],
337 |             }
338 | 
339 |         # Process all communities concurrently with tqdm progress bar and max_workers limit
340 |         if self.max_workers:
341 |             semaphore = asyncio.Semaphore(self.max_workers)
342 | 
343 |             async def process_community_with_semaphore(community):
344 |                 async with semaphore:
345 |                     return await process_community(community)
346 | 
347 |             community_summary = await tqdm_asyncio.gather(
348 |                 *(
349 |                     process_community_with_semaphore(community)
350 |                     for community in communities
351 |                 ),
352 |                 desc="Summarizing communities",
353 |                 total=len(communities),
354 |             )
355 |         else:
356 |             community_summary = await tqdm_asyncio.gather(
357 |                 *(process_community(community) for community in communities),
358 |                 desc="Summarizing communities",
359 |                 total=len(communities),
360 |             )
361 | 
362 |         self.query(import_community_summary, params={"data": community_summary})
363 |         return f"Generated {len(community_summary)} community summaries"
364 | 
365 |     def _check_driver_state(self) -> None:
366 |         """
367 |         Check if the Neo4j driver is still available.
368 | 
369 |         Raises:
370 |             RuntimeError: If the Neo4j driver has been closed.
371 |         """
372 |         if not hasattr(self, "_driver"):
373 |             raise RuntimeError(
374 |                 "This MsGraphRAG instance has been closed, and cannot be used anymore."
375 |             )
376 | 
377 |     def query(
378 |         self,
379 |         query: str,
380 |         params: dict = {},
381 |         session_params: dict = {},
382 |     ) -> List[Dict[str, Any]]:
383 |         """Query Neo4j database.
384 | 
385 |         Args:
386 |             query (str): The Cypher query to execute.
387 |             params (dict): The parameters to pass to the query.
388 |             session_params (dict): Parameters to pass to the session used for executing
389 |                 the query.
390 | 
391 |         Returns:
392 |             List[Dict[str, Any]]: The list of dictionaries containing the query results.
393 | 
394 |         Raises:
395 |             RuntimeError: If the connection has been closed.
396 |         """
397 |         self._check_driver_state()
398 |         from neo4j import Query
399 |         from neo4j.exceptions import Neo4jError
400 | 
401 |         if not session_params:
402 |             try:
403 |                 data, _, _ = self._driver.execute_query(
404 |                     Query(text=query),
405 |                     database_=self._database,
406 |                     parameters_=params,
407 |                 )
408 |                 return [r.data() for r in data]
409 |             except Neo4jError as e:
410 |                 if not (
411 |                     (
412 |                         (  # isCallInTransactionError
413 |                             e.code == "Neo.DatabaseError.Statement.ExecutionFailed"
414 |                             or e.code
415 |                             == "Neo.DatabaseError.Transaction.TransactionStartFailed"
416 |                         )
417 |                         and e.message is not None
418 |                         and "in an implicit transaction" in e.message
419 |                     )
420 |                     or (  # isPeriodicCommitError
421 |                         e.code == "Neo.ClientError.Statement.SemanticError"
422 |                         and e.message is not None
423 |                         and (
424 |                             "in an open transaction is not possible" in e.message
425 |                             or "tried to execute in an explicit transaction"
426 |                             in e.message
427 |                         )
428 |                     )
429 |                 ):
430 |                     raise
431 |         # fallback to allow implicit transactions
432 |         session_params.setdefault("database", self._database)
433 |         with self._driver.session(**session_params) as session:
434 |             result = session.run(Query(text=query, timeout=self.timeout), params)
435 |             return [r.data() for r in result]
436 | 
437 |     async def achat(self, messages, model="gpt-4o", temperature=0, config={}):
438 |         response = await self._openai_client.chat.completions.create(
439 |             model=model,
440 |             temperature=temperature,
441 |             messages=messages,
442 |             **config,
443 |         )
444 |         return response.choices[0].message
445 | 
446 |     def close(self) -> None:
447 |         """
448 |         Explicitly close the Neo4j driver connection.
449 | 
450 |         Delegates connection management to the Neo4j driver.
451 |         """
452 |         if hasattr(self, "_driver"):
453 |             self._driver.close()
454 |             # Remove the driver attribute to indicate closure
455 |             delattr(self, "_driver")
456 | 
457 |     def __enter__(self) -> "MsGraphRAG":
458 |         """
459 |         Enter the runtime context for the Neo4j graph connection.
460 | 
461 |         Enables use of the graph connection with the 'with' statement.
462 |         This method allows for automatic resource management and ensures
463 |         that the connection is properly handled.
464 | 
465 |         Returns:
466 |             MsGraphRAG: The current graph connection instance
467 | 
468 |         Example:
469 |             with MsGraphRAG(...) as graph:
470 |                 graph.query(...)  # Connection automatically managed
471 |         """
472 |         return self
473 | 
474 |     def __exit__(
475 |         self,
476 |         exc_type: Optional[Type[BaseException]],
477 |         exc_val: Optional[BaseException],
478 |         exc_tb: Optional[Any],
479 |     ) -> None:
480 |         """
481 |         Exit the runtime context for the Neo4j graph connection.
482 | 
483 |         This method is automatically called when exiting a 'with' statement.
484 |         It ensures that the database connection is closed, regardless of
485 |         whether an exception occurred during the context's execution.
486 | 
487 |         Args:
488 |             exc_type: The type of exception that caused the context to exit
489 |                       (None if no exception occurred)
490 |             exc_val: The exception instance that caused the context to exit
491 |                      (None if no exception occurred)
492 |             exc_tb: The traceback for the exception (None if no exception occurred)
493 | 
494 |         Note:
495 |             Any exception is re-raised after the connection is closed.
496 |         """
497 |         self.close()
498 | 
499 |     def __del__(self) -> None:
500 |         """
501 |         Destructor for the Neo4j graph connection.
502 | 
503 |         This method is called during garbage collection to ensure that
504 |         database resources are released if not explicitly closed.
505 | 
506 |         Caution:
507 |             - Do not rely on this method for deterministic resource cleanup
508 |             - Always prefer explicit .close() or context manager
509 | 
510 |         Best practices:
511 |             1. Use context manager:
512 |                with MsGraphRAG(...) as graph:
513 |                    ...
514 |             2. Explicitly close:
515 |                graph = MsGraphRAG(...)
516 |                try:
517 |                    ...
518 |                finally:
519 |                    graph.close()
520 |         """
521 |         try:
522 |             self.close()
523 |         except Exception:
524 |             # Suppress any exceptions during garbage collection
525 |             pass
526 | 


--------------------------------------------------------------------------------
/src/ms_graphrag_neo4j/prompts.py:
--------------------------------------------------------------------------------
  1 | GRAPH_EXTRACTION_PROMPT = """-Goal-
  2 | Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
  3 | 
  4 | -Steps-
  5 | 1. Identify all entities. For each identified entity, extract the following information:
  6 | - entity_name: Name of the entity, capitalized
  7 | - entity_type: One of the following types: [{entity_types}]
  8 | - entity_description: Comprehensive description of the entity's attributes and activities
  9 | Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
 10 | 
 11 | 2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
 12 | For each pair of related entities, extract the following information:
 13 | - source_entity: name of the source entity, as identified in step 1
 14 | - target_entity: name of the target entity, as identified in step 1
 15 | - relationship_description: explanation as to why you think the source entity and the target entity are related to each other
 16 | - relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
 17 |  Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
 18 | 
 19 | 3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
 20 | 
 21 | 4. When finished, output {completion_delimiter}
 22 | 
 23 | ######################
 24 | -Examples-
 25 | ######################
 26 | Example 1:
 27 | Entity_types: ORGANIZATION,PERSON
 28 | Text:
 29 | The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
 30 | ######################
 31 | Output:
 32 | ("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
 33 | {record_delimiter}
 34 | ("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution)
 35 | {record_delimiter}
 36 | ("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
 37 | {record_delimiter}
 38 | ("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9)
 39 | {completion_delimiter}
 40 | 
 41 | ######################
 42 | Example 2:
 43 | Entity_types: ORGANIZATION
 44 | Text:
 45 | TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.
 46 | 
 47 | TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
 48 | ######################
 49 | Output:
 50 | ("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
 51 | {record_delimiter}
 52 | ("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal)
 53 | {record_delimiter}
 54 | ("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5)
 55 | {completion_delimiter}
 56 | 
 57 | ######################
 58 | Example 3:
 59 | Entity_types: ORGANIZATION,GEO,PERSON
 60 | Text:
 61 | Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.
 62 | 
 63 | The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.
 64 | 
 65 | The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.
 66 | 
 67 | They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.
 68 | 
 69 | The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
 70 | ######################
 71 | Output:
 72 | ("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages)
 73 | {record_delimiter}
 74 | ("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages)
 75 | {record_delimiter}
 76 | ("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages)
 77 | {record_delimiter}
 78 | {record_delimiter}
 79 | ("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held)
 80 | {record_delimiter}
 81 | ("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara)
 82 | {record_delimiter}
 83 | ("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia)
 84 | {record_delimiter}
 85 | ("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison)
 86 | {record_delimiter}
 87 | ("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia)
 88 | {record_delimiter}
 89 | ("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage)
 90 | {record_delimiter}
 91 | ("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage)
 92 | {record_delimiter}
 93 | ("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2)
 94 | {record_delimiter}
 95 | ("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
 96 | {record_delimiter}
 97 | ("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2)
 98 | {record_delimiter}
 99 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8)
100 | {record_delimiter}
101 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2)
102 | {record_delimiter}
103 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
104 | {record_delimiter}
105 | ("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2)
106 | {record_delimiter}
107 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2)
108 | {record_delimiter}
109 | ("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2)
110 | {record_delimiter}
111 | ("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2)
112 | {completion_delimiter}
113 | 
114 | ######################
115 | -Real Data-
116 | ######################
117 | Entity_types: {entity_types}
118 | Text: {input_text}
119 | ######################
120 | Output:"""
121 | 
122 | SUMMARIZE_PROMPT = """
123 | You are a helpful assistant responsible for generating a comprehensive summary of the data provided below.
124 | Given one or two entities, and a list of descriptions, all related to the same entity or group of entities.
125 | Please concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.
126 | If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.
127 | Make sure it is written in third person, and include the entity names so we have the full context.
128 | 
129 | #######
130 | -Data-
131 | Entities: {entity_name}
132 | Description List: {description_list}
133 | #######
134 | Output:
135 | """
136 | 
137 | COMMUNITY_REPORT_PROMPT = """
138 | You are an AI assistant that helps a human analyst to perform general information discovery. Information discovery is the process of identifying and assessing relevant information associated with certain entities (e.g., organizations and individuals) within a network.
139 | 
140 | # Goal
141 | Write a comprehensive report of a community, given a list of entities that belong to the community as well as their relationships and optional associated claims. The report will be used to inform decision-makers about information associated with the community and their potential impact. The content of this report includes an overview of the community's key entities, their legal compliance, technical capabilities, reputation, and noteworthy claims.
142 | 
143 | # Report Structure
144 | 
145 | The report should include the following sections:
146 | 
147 | - TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
148 | - SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
149 | - IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community.  IMPACT is the scored importance of a community.
150 | - RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
151 | - DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.
152 | 
153 | Return output as a well-formed JSON-formatted string with the following format:
154 |     {{
155 |         "title": <report_title>,
156 |         "summary": <executive_summary>,
157 |         "rating": <impact_severity_rating>,
158 |         "rating_explanation": <rating_explanation>,
159 |         "findings": [
160 |             {{
161 |                 "summary":<insight_1_summary>,
162 |                 "explanation": <insight_1_explanation>
163 |             }},
164 |             {{
165 |                 "summary":<insight_2_summary>,
166 |                 "explanation": <insight_2_explanation>
167 |             }}
168 |         ]
169 |     }}
170 | 
171 | # Grounding Rules
172 | 
173 | Points supported by data should list their data references as follows:
174 | 
175 | "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
176 | 
177 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
178 | 
179 | For example:
180 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."
181 | 
182 | where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.
183 | 
184 | Do not include information where the supporting evidence for it is not provided.
185 | 
186 | 
187 | # Example Input
188 | -----------
189 | Text:
190 | 
191 | Entities
192 | 
193 | id,entity,description
194 | 5,VERDANT OASIS PLAZA,Verdant Oasis Plaza is the location of the Unity March
195 | 6,HARMONY ASSEMBLY,Harmony Assembly is an organization that is holding a march at Verdant Oasis Plaza
196 | 
197 | Relationships
198 | 
199 | id,source,target,description
200 | 37,VERDANT OASIS PLAZA,UNITY MARCH,Verdant Oasis Plaza is the location of the Unity March
201 | 38,VERDANT OASIS PLAZA,HARMONY ASSEMBLY,Harmony Assembly is holding a march at Verdant Oasis Plaza
202 | 39,VERDANT OASIS PLAZA,UNITY MARCH,The Unity March is taking place at Verdant Oasis Plaza
203 | 40,VERDANT OASIS PLAZA,TRIBUNE SPOTLIGHT,Tribune Spotlight is reporting on the Unity march taking place at Verdant Oasis Plaza
204 | 41,VERDANT OASIS PLAZA,BAILEY ASADI,Bailey Asadi is speaking at Verdant Oasis Plaza about the march
205 | 43,HARMONY ASSEMBLY,UNITY MARCH,Harmony Assembly is organizing the Unity March
206 | 
207 | Output:
208 | {{
209 |     "title": "Verdant Oasis Plaza and Unity March",
210 |     "summary": "The community revolves around the Verdant Oasis Plaza, which is the location of the Unity March. The plaza has relationships with the Harmony Assembly, Unity March, and Tribune Spotlight, all of which are associated with the march event.",
211 |     "rating": 5.0,
212 |     "rating_explanation": "The impact severity rating is moderate due to the potential for unrest or conflict during the Unity March.",
213 |     "findings": [
214 |         {{
215 |             "summary": "Verdant Oasis Plaza as the central location",
216 |             "explanation": "Verdant Oasis Plaza is the central entity in this community, serving as the location for the Unity March. This plaza is the common link between all other entities, suggesting its significance in the community. The plaza's association with the march could potentially lead to issues such as public disorder or conflict, depending on the nature of the march and the reactions it provokes. [Data: Entities (5), Relationships (37, 38, 39, 40, 41,+more)]"
217 |         }},
218 |         {{
219 |             "summary": "Harmony Assembly's role in the community",
220 |             "explanation": "Harmony Assembly is another key entity in this community, being the organizer of the march at Verdant Oasis Plaza. The nature of Harmony Assembly and its march could be a potential source of threat, depending on their objectives and the reactions they provoke. The relationship between Harmony Assembly and the plaza is crucial in understanding the dynamics of this community. [Data: Entities(6), Relationships (38, 43)]"
221 |         }},
222 |         {{
223 |             "summary": "Unity March as a significant event",
224 |             "explanation": "The Unity March is a significant event taking place at Verdant Oasis Plaza. This event is a key factor in the community's dynamics and could be a potential source of threat, depending on the nature of the march and the reactions it provokes. The relationship between the march and the plaza is crucial in understanding the dynamics of this community. [Data: Relationships (39)]"
225 |         }},
226 |         {{
227 |             "summary": "Role of Tribune Spotlight",
228 |             "explanation": "Tribune Spotlight is reporting on the Unity March taking place in Verdant Oasis Plaza. This suggests that the event has attracted media attention, which could amplify its impact on the community. The role of Tribune Spotlight could be significant in shaping public perception of the event and the entities involved. [Data: Relationships (40)]"
229 |         }}
230 |     ]
231 | }}
232 | 
233 | 
234 | # Real Data
235 | 
236 | Use the following text for your answer. Do not make anything up in your answer.
237 | 
238 | Text:
239 | {input_text}
240 | 
241 | The report should include the following sections:
242 | 
243 | - TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title.
244 | - SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities.
245 | - IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community.  IMPACT is the scored importance of a community.
246 | - RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating.
247 | - DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive.
248 | 
249 | Return output as a well-formed JSON-formatted string with the following format:
250 |     {{
251 |         "title": <report_title>,
252 |         "summary": <executive_summary>,
253 |         "rating": <impact_severity_rating>,
254 |         "rating_explanation": <rating_explanation>,
255 |         "findings": [
256 |             {{
257 |                 "summary":<insight_1_summary>,
258 |                 "explanation": <insight_1_explanation>
259 |             }},
260 |             {{
261 |                 "summary":<insight_2_summary>,
262 |                 "explanation": <insight_2_explanation>
263 |             }}
264 |         ]
265 |     }}
266 | 
267 | # Grounding Rules
268 | 
269 | Points supported by data should list their data references as follows:
270 | 
271 | "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
272 | 
273 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
274 | 
275 | For example:
276 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]."
277 | 
278 | where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record.
279 | 
280 | Do not include information where the supporting evidence for it is not provided.
281 | 
282 | Output:"""
283 | 
284 | MAP_SYSTEM_PROMPT = """
285 | ---Role---
286 | 
287 | You are a helpful assistant responding to questions about data in the tables provided.
288 | 
289 | 
290 | ---Goal---
291 | 
292 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
293 | 
294 | You should use the data provided in the data tables below as the primary context for generating the response.
295 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
296 | 
297 | Each key point in the response should have the following element:
298 | - Description: A comprehensive description of the point.
299 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
300 | 
301 | The response should be JSON formatted as follows:
302 | {{
303 |     "points": [
304 |         {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
305 |         {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
306 |     ]
307 | }}
308 | 
309 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
310 | 
311 | Points supported by data should list the relevant reports as references as follows:
312 | "This is an example sentence supported by data references [Data: Reports (report ids)]"
313 | 
314 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
315 | 
316 | For example:
317 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
318 | 
319 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
320 | 
321 | Do not include information where the supporting evidence for it is not provided.
322 | 
323 | 
324 | ---Data tables---
325 | 
326 | {context_data}
327 | 
328 | ---Goal---
329 | 
330 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
331 | 
332 | You should use the data provided in the data tables below as the primary context for generating the response.
333 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
334 | 
335 | Each key point in the response should have the following element:
336 | - Description: A comprehensive description of the point.
337 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
338 | 
339 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
340 | 
341 | Points supported by data should list the relevant reports as references as follows:
342 | "This is an example sentence supported by data references [Data: Reports (report ids)]"
343 | 
344 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
345 | 
346 | For example:
347 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
348 | 
349 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
350 | 
351 | Do not include information where the supporting evidence for it is not provided.
352 | 
353 | The response should be JSON formatted as follows:
354 | {{
355 |     "points": [
356 |         {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
357 |         {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
358 |     ]
359 | }}
360 | """
361 | 
362 | REDUCE_SYSTEM_PROMPT = """
363 | ---Role---
364 | 
365 | You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.
366 | 
367 | 
368 | ---Goal---
369 | 
370 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
371 | 
372 | Note that the analysts' reports provided below are ranked in the **descending order of importance**.
373 | 
374 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
375 | 
376 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
377 | 
378 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
379 | 
380 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
381 | 
382 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
383 | 
384 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
385 | 
386 | For example:
387 | 
388 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
389 | 
390 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
391 | 
392 | Do not include information where the supporting evidence for it is not provided.
393 | 
394 | 
395 | ---Target response length and format---
396 | 
397 | {response_type}
398 | 
399 | 
400 | ---Analyst Reports---
401 | 
402 | {report_data}
403 | 
404 | 
405 | ---Goal---
406 | 
407 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
408 | 
409 | Note that the analysts' reports provided below are ranked in the **descending order of importance**.
410 | 
411 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
412 | 
413 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
414 | 
415 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
416 | 
417 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
418 | 
419 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
420 | 
421 | For example:
422 | 
423 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
424 | 
425 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
426 | 
427 | Do not include information where the supporting evidence for it is not provided.
428 | 
429 | 
430 | ---Target response length and format---
431 | 
432 | {response_type}
433 | 
434 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
435 | """
436 | 
437 | LOCAL_SEARCH_SYSTEM_PROMPT = """
438 | ---Role---
439 | 
440 | You are a helpful assistant responding to questions about data in the tables provided.
441 | 
442 | 
443 | ---Goal---
444 | 
445 | Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
446 | 
447 | If you don't know the answer, just say so. Do not make anything up.
448 | 
449 | Points supported by data should list their data references as follows:
450 | 
451 | "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
452 | 
453 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
454 | 
455 | For example:
456 | 
457 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
458 | 
459 | where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
460 | 
461 | Do not include information where the supporting evidence for it is not provided.
462 | 
463 | 
464 | ---Target response length and format---
465 | 
466 | {response_type}
467 | 
468 | 
469 | ---Data tables---
470 | 
471 | {context_data}
472 | 
473 | 
474 | ---Goal---
475 | 
476 | Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
477 | 
478 | If you don't know the answer, just say so. Do not make anything up.
479 | 
480 | Points supported by data should list their data references as follows:
481 | 
482 | "This is an example sentence supported by multiple data references [Data: <dataset name> (record ids); <dataset name> (record ids)]."
483 | 
484 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
485 | 
486 | For example:
487 | 
488 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]."
489 | 
490 | where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
491 | 
492 | Do not include information where the supporting evidence for it is not provided.
493 | 
494 | 
495 | ---Target response length and format---
496 | 
497 | {response_type}
498 | 
499 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
500 | """
501 | 
502 | MAP_SYSTEM_PROMPT = """
503 | ---Role---
504 | 
505 | You are a helpful assistant responding to questions about data in the tables provided.
506 | 
507 | 
508 | ---Goal---
509 | 
510 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
511 | 
512 | You should use the data provided in the data tables below as the primary context for generating the response.
513 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
514 | 
515 | Each key point in the response should have the following element:
516 | - Description: A comprehensive description of the point.
517 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
518 | 
519 | The response should be JSON formatted as follows:
520 | {{
521 |     "points": [
522 |         {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
523 |         {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
524 |     ]
525 | }}
526 | 
527 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
528 | 
529 | Points supported by data should list the relevant reports as references as follows:
530 | "This is an example sentence supported by data references [Data: Reports (report ids)]"
531 | 
532 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
533 | 
534 | For example:
535 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
536 | 
537 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
538 | 
539 | Do not include information where the supporting evidence for it is not provided.
540 | 
541 | 
542 | ---Data tables---
543 | 
544 | {context_data}
545 | 
546 | ---Goal---
547 | 
548 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables.
549 | 
550 | You should use the data provided in the data tables below as the primary context for generating the response.
551 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up.
552 | 
553 | Each key point in the response should have the following element:
554 | - Description: A comprehensive description of the point.
555 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0.
556 | 
557 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
558 | 
559 | Points supported by data should list the relevant reports as references as follows:
560 | "This is an example sentence supported by data references [Data: Reports (report ids)]"
561 | 
562 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
563 | 
564 | For example:
565 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
566 | 
567 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables.
568 | 
569 | Do not include information where the supporting evidence for it is not provided.
570 | 
571 | The response should be JSON formatted as follows:
572 | {{
573 |     "points": [
574 |         {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}},
575 |         {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}}
576 |     ]
577 | }}
578 | """
579 | 
580 | REDUCE_SYSTEM_PROMPT = """
581 | ---Role---
582 | 
583 | You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts.
584 | 
585 | 
586 | ---Goal---
587 | 
588 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
589 | 
590 | Note that the analysts' reports provided below are ranked in the **descending order of importance**.
591 | 
592 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
593 | 
594 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
595 | 
596 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
597 | 
598 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
599 | 
600 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
601 | 
602 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
603 | 
604 | For example:
605 | 
606 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
607 | 
608 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
609 | 
610 | Do not include information where the supporting evidence for it is not provided.
611 | 
612 | 
613 | ---Target response length and format---
614 | 
615 | {response_type}
616 | 
617 | 
618 | ---Analyst Reports---
619 | 
620 | {report_data}
621 | 
622 | 
623 | ---Goal---
624 | 
625 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset.
626 | 
627 | Note that the analysts' reports provided below are ranked in the **descending order of importance**.
628 | 
629 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up.
630 | 
631 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format.
632 | 
633 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will".
634 | 
635 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process.
636 | 
637 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more.
638 | 
639 | For example:
640 | 
641 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]"
642 | 
643 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record.
644 | 
645 | Do not include information where the supporting evidence for it is not provided.
646 | 
647 | 
648 | ---Target response length and format---
649 | 
650 | {response_type}
651 | 
652 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown.
653 | """
654 | 


--------------------------------------------------------------------------------
/src/ms_graphrag_neo4j/utils.py:
--------------------------------------------------------------------------------
  1 | import hashlib
  2 | import json
  3 | 
  4 | 
  5 | def get_hash(text):
  6 |     # Convert text to bytes if it's a string
  7 |     if isinstance(text, str):
  8 |         text = text.encode("utf-8")
  9 | 
 10 |     # MD5 is one of the fastest algorithms
 11 |     return hashlib.md5(text).hexdigest()
 12 | 
 13 | 
 14 | def parse_extraction_output(output_str, record_delimiter=None, tuple_delimiter=None):
 15 |     """
 16 |     Parse a structured output string containing "entity" and "relationship" records into a list of dictionaries.
 17 | 
 18 |     The expected format for each record is:
 19 | 
 20 |         ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
 21 | 
 22 |     or
 23 | 
 24 |         ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
 25 | 
 26 |     Records are separated by a record delimiter. The output string may end with a completion marker
 27 |     (for example, "{completion_delimiter}") which will be removed.
 28 | 
 29 |     If not provided, this function attempts to auto-detect:
 30 |       - record_delimiter: looks for "{record_delimiter}" then "|" then falls back to newlines.
 31 |       - tuple_delimiter: looks for "{tuple_delimiter}" then ";" then falls back to a tab.
 32 | 
 33 |     Parameters:
 34 |         output_str (str): The complete string output.
 35 |         record_delimiter (str, optional): The delimiter that separates records.
 36 |         tuple_delimiter (str, optional): The delimiter that separates fields within a record.
 37 | 
 38 |     Returns:
 39 |         List[dict]: A list of dictionaries where each dictionary represents an entity or relationship.
 40 | 
 41 |         For an "entity", the dictionary has the keys:
 42 |             - record_type (always "entity")
 43 |             - entity_name
 44 |             - entity_type
 45 |             - entity_description
 46 | 
 47 |         For a "relationship", the dictionary has the keys:
 48 |             - record_type (always "relationship")
 49 |             - source_entity
 50 |             - target_entity
 51 |             - relationship_description
 52 |             - relationship_strength (as an int or float)
 53 |     """
 54 |     # Remove the completion delimiter if present.
 55 |     completion_marker = "{completion_delimiter}"
 56 |     if completion_marker in output_str:
 57 |         output_str = output_str.replace(completion_marker, "")
 58 |     output_str = output_str.strip()
 59 | 
 60 |     # Determine the record delimiter if not provided.
 61 |     if record_delimiter is None:
 62 |         if "{record_delimiter}" in output_str:
 63 |             record_delimiter = "{record_delimiter}"
 64 |         elif "|" in output_str:
 65 |             record_delimiter = "|"
 66 |         else:
 67 |             # Fallback: split on newlines
 68 |             record_delimiter = "\n"
 69 | 
 70 |     # Determine the tuple delimiter if not provided.
 71 |     if tuple_delimiter is None:
 72 |         if "{tuple_delimiter}" in output_str:
 73 |             tuple_delimiter = "{tuple_delimiter}"
 74 |         elif ";" in output_str:
 75 |             tuple_delimiter = ";"
 76 |         else:
 77 |             tuple_delimiter = "\t"
 78 | 
 79 |     # Split the output into individual record strings.
 80 |     raw_records = [r.strip() for r in output_str.split(record_delimiter)]
 81 | 
 82 |     parsed_records = []
 83 |     for rec in raw_records:
 84 |         if not rec:
 85 |             continue  # skip empty strings
 86 | 
 87 |         # Remove leading/trailing parentheses if present.
 88 |         if rec.startswith("(") and rec.endswith(")"):
 89 |             rec = rec[1:-1]
 90 |         rec = rec.strip()
 91 | 
 92 |         # Split the record into tokens using the tuple delimiter.
 93 |         tokens = [token.strip() for token in rec.split(tuple_delimiter)]
 94 |         if not tokens:
 95 |             continue
 96 | 
 97 |         # The first token should be either "entity" or "relationship".
 98 |         rec_type = tokens[0].strip(" \"'").lower()
 99 | 
100 |         if rec_type == "entity":
101 |             if len(tokens) != 4:
102 |                 # Optionally log or raise an error for malformed records.
103 |                 continue
104 |             record = {
105 |                 "record_type": "entity",
106 |                 "entity_name": tokens[1],
107 |                 "entity_type": tokens[2],
108 |                 "entity_description": tokens[3],
109 |             }
110 |             parsed_records.append(record)
111 |         elif rec_type == "relationship":
112 |             if len(tokens) != 5:
113 |                 continue
114 |             # Attempt to convert relationship_strength to a number.
115 |             try:
116 |                 strength = float(tokens[4])
117 |                 # Convert to int if it has no fractional part.
118 |                 if strength.is_integer():
119 |                     strength = int(strength)
120 |             except ValueError:
121 |                 strength = tokens[4]
122 |             record = {
123 |                 "record_type": "relationship",
124 |                 "source_entity": tokens[1],
125 |                 "target_entity": tokens[2],
126 |                 "relationship_description": tokens[3],
127 |                 "relationship_strength": strength,
128 |             }
129 |             parsed_records.append(record)
130 |         else:
131 |             # Unknown record type; skip it or handle accordingly.
132 |             continue
133 |     nodes = [el for el in parsed_records if el.get("record_type") == "entity"]
134 |     relationships = [
135 |         el for el in parsed_records if el.get("record_type") == "relationship"
136 |     ]
137 |     return nodes, relationships
138 | 
139 | 
140 | def extract_json(input: str):
141 |     return json.loads(input.removeprefix("```json").removesuffix("```").strip())
142 | 


--------------------------------------------------------------------------------
/tests/dev.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "e57d11d7-c3b8-4c3f-87d8-2a34397952d1",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "from ms_graphrag_neo4j import MsGraphRAG"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 2,
 16 |    "id": "874a900b-b1fa-4e90-91b5-fc50af8053c1",
 17 |    "metadata": {},
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import os\n",
 21 |     "\n",
 22 |     "os.environ[\"OPENAI_API_KEY\"]= \"sk-proj-\"\n",
 23 |     "os.environ[\"NEO4J_URI\"]=\"bolt://localhost:7687\"\n",
 24 |     "os.environ[\"NEO4J_USERNAME\"]=\"neo4j\"\n",
 25 |     "os.environ[\"NEO4J_PASSWORD\"]=\"password\""
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 3,
 31 |    "id": "5e39e816-fb7e-4aa2-9943-f50af5a516a7",
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "from neo4j import GraphDatabase"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 4,
 41 |    "id": "0c5e397b-5ebe-4671-b64b-ddb3cf5440dc",
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "driver = GraphDatabase.driver(os.environ[\"NEO4J_URI\"], auth=(os.environ[\"NEO4J_USERNAME\"], os.environ[\"NEO4J_PASSWORD\"]))"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 5,
 51 |    "id": "cad48c77-34c2-4879-b19a-3a8cfceef316",
 52 |    "metadata": {},
 53 |    "outputs": [],
 54 |    "source": [
 55 |     "ms_graph = MsGraphRAG(driver=driver, model='gpt-4o')"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 6,
 61 |    "id": "88583a09-0465-4017-a967-1a1feee1cb8f",
 62 |    "metadata": {},
 63 |    "outputs": [
 64 |     {
 65 |      "name": "stderr",
 66 |      "output_type": "stream",
 67 |      "text": [
 68 |       "Extracting nodes & relationships: 100%|██████████| 3/3 [00:31<00:00, 10.42s/it]\n"
 69 |      ]
 70 |     },
 71 |     {
 72 |      "data": {
 73 |       "text/plain": [
 74 |        "'Successfuly extracted and imported 2 relationships'"
 75 |       ]
 76 |      },
 77 |      "execution_count": 6,
 78 |      "metadata": {},
 79 |      "output_type": "execute_result"
 80 |     }
 81 |    ],
 82 |    "source": [
 83 |     "await ms_graph.extract_nodes_and_rels([\"Tomaz works for Neo4j\", \"Tomaz lives in Grosuplje\", \"Tomaz went to school in Grosuplje\"], [\"Person\", \"Organization\", \"Location\"])"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 7,
 89 |    "id": "7a6aeb25-70a8-4132-8556-e7b7e36de199",
 90 |    "metadata": {},
 91 |    "outputs": [
 92 |     {
 93 |      "name": "stderr",
 94 |      "output_type": "stream",
 95 |      "text": [
 96 |       "Summarizing nodes: 100%|██████████| 2/2 [00:00<00:00,  2.47it/s]\n",
 97 |       "Summarizing relationships: 0it [00:00, ?it/s]\n"
 98 |      ]
 99 |     },
100 |     {
101 |      "data": {
102 |       "text/plain": [
103 |        "'Successfuly summarized nodes and relationships'"
104 |       ]
105 |      },
106 |      "execution_count": 7,
107 |      "metadata": {},
108 |      "output_type": "execute_result"
109 |     }
110 |    ],
111 |    "source": [
112 |     "await ms_graph.summarize_nodes_and_rels()"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "code",
117 |    "execution_count": 8,
118 |    "id": "4769d5b0-7d27-4b91-8fee-49de179103c7",
119 |    "metadata": {
120 |     "scrolled": true
121 |    },
122 |    "outputs": [
123 |     {
124 |      "name": "stdout",
125 |      "output_type": "stream",
126 |      "text": [
127 |       "Leiden algorithm identified 1 community levels with 1 communities on the last level.\n"
128 |      ]
129 |     },
130 |     {
131 |      "name": "stderr",
132 |      "output_type": "stream",
133 |      "text": [
134 |       "Summarizing communities: 100%|██████████| 1/1 [00:05<00:00,  5.59s/it]\n"
135 |      ]
136 |     },
137 |     {
138 |      "data": {
139 |       "text/plain": [
140 |        "'Generated 1 community summaries'"
141 |       ]
142 |      },
143 |      "execution_count": 8,
144 |      "metadata": {},
145 |      "output_type": "execute_result"
146 |     }
147 |    ],
148 |    "source": [
149 |     "await ms_graph.summarize_communities()"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": null,
155 |    "id": "80e35fc6-85ef-4c82-9b31-add6b2592756",
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": []
159 |   }
160 |  ],
161 |  "metadata": {
162 |   "kernelspec": {
163 |    "display_name": "Python 3 (ipykernel)",
164 |    "language": "python",
165 |    "name": "python3"
166 |   },
167 |   "language_info": {
168 |    "codemirror_mode": {
169 |     "name": "ipython",
170 |     "version": 3
171 |    },
172 |    "file_extension": ".py",
173 |    "mimetype": "text/x-python",
174 |    "name": "python",
175 |    "nbconvert_exporter": "python",
176 |    "pygments_lexer": "ipython3",
177 |    "version": "3.11.5"
178 |   }
179 |  },
180 |  "nbformat": 4,
181 |  "nbformat_minor": 5
182 | }
183 | 


--------------------------------------------------------------------------------