├── .gitignore ├── LICENSE ├── README.md ├── examples └── neo4j_weaviate_combined.ipynb ├── pyproject.toml ├── src └── ms_graphrag_neo4j │ ├── __init__.py │ ├── cypher_queries.py │ ├── ms_graphrag.py │ ├── prompts.py │ └── utils.py └── tests └── dev.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .ipynb_checkpoints 3 | .DS_Store 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Tomaz Bratanic 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MsGraphRAG-Neo4j 2 | 3 | A Neo4j implementation of Microsoft's GraphRAG approach for knowledge graph-based retrieval augmented generation based on QFS. 4 | Query-focused summarization (QFS) is a summarization task where the objective is to produce a concise and relevant summary of a given document or set of documents specifically tailored to answer or address a particular query or question provided by a user. 5 | 6 | Learn more about [GraphRAG](https://graphrag.com/). 7 | 8 | ## Overview 9 | 10 | MsGraphRAG-Neo4j is a Python library that implements Microsoft's GraphRAG methodology with Neo4j as the graph database backend. This library provides a seamless way to: 11 | 12 | 1. Extract entities and relationships from unstructured text 13 | 2. Build a knowledge graph in Neo4j 14 | 3. Generate summaries for nodes and relationships 15 | 4. Detect and summarize communities within the graph 16 | 5. Leverage this graph structure for enhanced RAG 17 | 18 | The implementation uses OpenAI's models for text processing and Neo4j's powerful graph capabilities including the Graph Data Science (GDS) library. 19 | 20 | > **⚠️ IMPORTANT NOTE**: This repository is experimental and provided as-is. The current implementation lacks optimizations for larger graphs, which may lead to exceptions or performance issues when processing substantial amounts of data. Use with caution in production environments and consider implementing additional error handling and optimization for large-scale deployments. 21 | 22 | 23 | ## Features 24 | 25 | - **Entity and Relationship Extraction**: Extract structured information from unstructured text using LLMs 26 | - **Graph Construction**: Automatically build a knowledge graph in Neo4j 27 | - **Node and Relationship Summarization**: Generate concise summaries to improve retrieval 28 | - **Community Detection**: Use Neo4j GDS to identify clusters of related information 29 | - **Community Summarization**: Provide high-level descriptions of concept clusters 30 | - **Neo4j Integration**: Seamless integration with Neo4j database for persistent storage 31 | 32 | ## Installation 33 | 34 | ```bash 35 | pip install -e . 36 | ``` 37 | 38 | ## Requirements 39 | 40 | - Neo4j database (5.26+) 41 | - APOC plugin installed in Neo4j 42 | - Graph Data Science (GDS) library installed in Neo4j 43 | - OpenAI API key 44 | 45 | ## Quick Start 46 | 47 | ```python 48 | import os 49 | 50 | from ms_graphrag_neo4j import MsGraphRAG 51 | from neo4j import GraphDatabase 52 | 53 | # Set your environment variables 54 | os.environ["OPENAI_API_KEY"] = "your-openai-api-key" 55 | os.environ["NEO4J_URI"] = "bolt://localhost:7687" 56 | os.environ["NEO4J_USERNAME"] = "neo4j" 57 | os.environ["NEO4J_PASSWORD"] = "password" 58 | 59 | # Connect to Neo4j 60 | driver = GraphDatabase.driver( 61 | os.environ["NEO4J_URI"], 62 | auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"]) 63 | ) 64 | 65 | # Initialize MsGraphRAG 66 | ms_graph = MsGraphRAG(driver=driver, model='gpt-4o') 67 | 68 | # Define example texts and entity types 69 | example_texts = [ 70 | "Tomaz works for Neo4j", 71 | "Tomaz lives in Grosuplje", 72 | "Tomaz went to school in Grosuplje" 73 | ] 74 | allowed_entities = ["Person", "Organization", "Location"] 75 | 76 | # Extract entities and relationships 77 | result = ms_graph.extract_nodes_and_rels(example_texts, allowed_entities) 78 | print(result) 79 | 80 | # Generate summaries for nodes and relationships 81 | result = ms_graph.summarize_nodes_and_rels() 82 | print(result) 83 | 84 | # Identify and summarize communities 85 | result = ms_graph.summarize_communities() 86 | print(result) 87 | 88 | # Close the connection 89 | ms_graph.close() 90 | ``` 91 | 92 | ## How It Works 93 | 94 | 1. **Extract Nodes and Relationships**: The library uses OpenAI's models to extract entities and relationships from your text data, creating a structured graph. 95 | 96 | 2. **Summarize Nodes and Relationships**: Each entity and relationship is summarized to capture its essence across all mentions in the source documents. 97 | 98 | 3. **Community Detection**: The Leiden algorithm is applied to identify communities of related entities. 99 | 100 | 4. **Community Summarization**: Each community is summarized to provide a high-level understanding of the concepts it contains. 101 | -------------------------------------------------------------------------------- /examples/neo4j_weaviate_combined.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "include_colab_link": true 8 | }, 9 | "kernelspec": { 10 | "name": "python3", 11 | "display_name": "Python 3" 12 | }, 13 | "language_info": { 14 | "name": "python" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "view-in-github", 22 | "colab_type": "text" 23 | }, 24 | "source": [ 25 | "\"Open" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "source": [ 31 | ">[Naive RAG vs GraphRAG with Neo4J & Weaviate](#scrollTo=n3QFDMrgAkCo)\n", 32 | "\n", 33 | ">>[Install Dependencies](#scrollTo=n3QFDMrgAkCo)\n", 34 | "\n", 35 | ">>[Write Documents to Weaviate Cloud](#scrollTo=nqwuGr0Xhgtm)\n", 36 | "\n", 37 | ">>[Classic RAG with OpenAI](#scrollTo=-uAAWPQXBUdX)\n", 38 | "\n", 39 | ">>[Graph RAG](#scrollTo=zzBnUF4bBYKG)\n", 40 | "\n", 41 | ">>>[Build a Graph with Neo4J](#scrollTo=zzBnUF4bBYKG)\n", 42 | "\n", 43 | ">>>[Extract Relevant Entities](#scrollTo=FVzpKJViBiJT)\n", 44 | "\n", 45 | ">>>[Summarize Nodes and Communities](#scrollTo=j1wAsUfIBrGc)\n", 46 | "\n", 47 | ">>>[Write the Entities to Weaviate](#scrollTo=n105cc-_B9bN)\n", 48 | "\n" 49 | ], 50 | "metadata": { 51 | "colab_type": "toc", 52 | "id": "GXFxB5p3CPfX" 53 | } 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "source": [ 58 | "# Naive RAG vs GraphRAG with Neo4J & Weaviate\n", 59 | "\n", 60 | "In this recipe, we will be walking through 2 ways of doing RAG:\n", 61 | "\n", 62 | "1. Classic RAG where we do simple vector search, followed be answer generation based on this context\n", 63 | "2. Graph RAG, making use of both vector search, combined by a graph representation of our dataset including community and node summaries\n", 64 | "\n", 65 | "For this example, we will be using a generated dataset called \"Financial Contracts\", that lists (fake) contracts sugned between individuals and companies.\n", 66 | "\n", 67 | "## Install Dependencies" 68 | ], 69 | "metadata": { 70 | "id": "n3QFDMrgAkCo" 71 | } 72 | }, 73 | { 74 | "cell_type": "code", 75 | "source": [ 76 | "!pip install --quiet --upgrade git+https://github.com/neo4j-contrib/ms-graphrag-neo4j.git datasets weaviate-client neo4j-graphrag" 77 | ], 78 | "metadata": { 79 | "id": "Wahz0DAFT4p7" 80 | }, 81 | "execution_count": null, 82 | "outputs": [] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "source": [ 87 | "## Write Documents to Weaviate Cloud\n", 88 | "\n", 89 | "To get started, you can use a free Weaviate Sandbox.\n", 90 | "\n", 91 | "1. Create a cluster\n", 92 | "2. Take note of the cluster URL and API key\n", 93 | "3. Go to 'Embeddings' and turn it on." 94 | ], 95 | "metadata": { 96 | "id": "nqwuGr0Xhgtm" 97 | } 98 | }, 99 | { 100 | "cell_type": "code", 101 | "source": [ 102 | "import os\n", 103 | "from getpass import getpass\n", 104 | "\n", 105 | "if \"WEAVIATE_API_KEY\" not in os.environ:\n", 106 | " os.environ[\"WEAVIATE_API_KEY\"] = getpass(\"Weaviate API Key\")\n", 107 | "if \"WEAVIATE_URL\" not in os.environ:\n", 108 | " os.environ[\"WEAVIATE_URL\"] = getpass(\"Weaviate URL\")" 109 | ], 110 | "metadata": { 111 | "id": "chKCzqNS8CvX" 112 | }, 113 | "execution_count": null, 114 | "outputs": [] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "source": [ 119 | "import weaviate\n", 120 | "from weaviate.auth import Auth\n", 121 | "\n", 122 | "client = weaviate.connect_to_weaviate_cloud(\n", 123 | " cluster_url=os.environ.get(\"WEAVIATE_URL\"),\n", 124 | " auth_credentials=Auth.api_key(os.environ.get(\"WEAVIATE_API_KEY\")),\n", 125 | ")" 126 | ], 127 | "metadata": { 128 | "id": "o5opqllkhzm6" 129 | }, 130 | "execution_count": 4, 131 | "outputs": [] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "source": [ 136 | "from weaviate.classes.config import Configure\n", 137 | "\n", 138 | "#client.collections.delete(\"Financial_contracts\")\n", 139 | "client.collections.create(\n", 140 | " \"Financial_contracts\",\n", 141 | " description=\"A dataset of financial contracts between indivicuals and/or companies, as well as information on the type of contract and who has authored them.\",\n", 142 | " vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),\n", 143 | ")" 144 | ], 145 | "metadata": { 146 | "id": "0ZzBFSLah2QQ" 147 | }, 148 | "execution_count": null, 149 | "outputs": [] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "source": [ 154 | "from datasets import load_dataset\n", 155 | "\n", 156 | "financial_dataset = load_dataset(\"weaviate/agents\", \"query-agent-financial-contracts\", split=\"train\", streaming=True)" 157 | ], 158 | "metadata": { 159 | "id": "KChqfZu8pTZP" 160 | }, 161 | "execution_count": null, 162 | "outputs": [] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "source": [ 167 | "financial_collection = client.collections.get(\"Financial_contracts\")\n", 168 | "\n", 169 | "with financial_collection.batch.dynamic() as batch:\n", 170 | " for item in financial_dataset:\n", 171 | " batch.add_object(properties=item[\"properties\"])" 172 | ], 173 | "metadata": { 174 | "id": "sigF9C03iBki" 175 | }, 176 | "execution_count": 7, 177 | "outputs": [] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "source": [ 182 | "## Classic RAG with OpenAI" 183 | ], 184 | "metadata": { 185 | "id": "-uAAWPQXBUdX" 186 | } 187 | }, 188 | { 189 | "cell_type": "code", 190 | "source": [ 191 | "os.environ[\"OPENAI_API_KEY\"]= getpass(\"Openai API Key:\")" 192 | ], 193 | "metadata": { 194 | "id": "XOdJPeg9U_sl" 195 | }, 196 | "execution_count": null, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "source": [ 202 | "from openai import AsyncOpenAI\n", 203 | "\n", 204 | "openai_client = AsyncOpenAI()\n", 205 | "\n", 206 | "async def achat(messages, model=\"gpt-4o\", temperature=0, config={}):\n", 207 | " response = await openai_client.chat.completions.create(\n", 208 | " model=model,\n", 209 | " temperature=temperature,\n", 210 | " messages=messages,\n", 211 | " **config,\n", 212 | " )\n", 213 | " return response.choices[0].message.content" 214 | ], 215 | "metadata": { 216 | "id": "WtoZ2oCIwpRL" 217 | }, 218 | "execution_count": 10, 219 | "outputs": [] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "source": [ 224 | "async def classic_rag(input: str) -> str:\n", 225 | " context = [str(obj.properties) for obj in financial_collection.query.near_text(query = input, limit=3).objects]\n", 226 | " messages = [\n", 227 | " {\n", 228 | " \"role\": \"user\",\n", 229 | " \"content\": \"Based on the given context: {context} \\n\\n Answer the following question: {question}\".format(context=context, question=input)\n", 230 | " },\n", 231 | " ]\n", 232 | " output = await achat(messages, model=\"gpt-4o\")\n", 233 | " return output" 234 | ], 235 | "metadata": { 236 | "id": "S3oJUcHF7IhA" 237 | }, 238 | "execution_count": 11, 239 | "outputs": [] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "source": [ 244 | "response = await classic_rag(\"What do you know about Weaviate\")\n", 245 | "print(response)" 246 | ], 247 | "metadata": { 248 | "colab": { 249 | "base_uri": "https://localhost:8080/" 250 | }, 251 | "id": "6Cv88Dv883ST", 252 | "outputId": "a8617e0e-29b9-4dce-ef01-dd45de748b55" 253 | }, 254 | "execution_count": 12, 255 | "outputs": [ 256 | { 257 | "output_type": "stream", 258 | "name": "stdout", 259 | "text": [ 260 | "Based on the provided context, Weaviate is a corporation involved in multiple partnership agreements with OpenAI. Here are some details about Weaviate from the context:\n", 261 | "\n", 262 | "1. **Location and Legal Organization**:\n", 263 | " - In the first agreement dated March 15, 2022, Weaviate is described as a corporation organized under the laws of the State of Delaware, with its principal place of business at 123 Innovation Drive, Wilmington, DE.\n", 264 | " - In the second agreement dated April 5, 2023, Weaviate is located at 123 Innovation Drive, Tech City.\n", 265 | " - In the third agreement dated November 15, 2023, Weaviate is described as a corporation organized under the laws of the state of California, with its principal office located at 123 Innovation Drive, San Francisco, CA.\n", 266 | "\n", 267 | "2. **Partnerships with OpenAI**:\n", 268 | " - Weaviate has entered into multiple partnership agreements with OpenAI to collaborate on various projects, particularly in the field of artificial intelligence.\n", 269 | " - The agreements outline financial contributions, responsibilities, and profit-sharing arrangements between Weaviate and OpenAI.\n", 270 | "\n", 271 | "3. **Responsibilities and Contributions**:\n", 272 | " - In the agreements, Weaviate is responsible for project management, client relations, marketing, and promotion of joint projects.\n", 273 | " - Weaviate has made financial contributions to the partnerships, with specific amounts mentioned in each agreement.\n", 274 | "\n", 275 | "4. **Duration and Termination**:\n", 276 | " - The agreements have specified durations, typically two to three years, with provisions for early termination upon written notice.\n", 277 | "\n", 278 | "5. **Representation**:\n", 279 | " - Jane Doe and Johnathan Smith are mentioned as representatives of Weaviate in different agreements.\n", 280 | "\n", 281 | "Overall, Weaviate appears to be an active participant in collaborative projects with OpenAI, focusing on AI-driven initiatives and sharing responsibilities and profits with its partner.\n" 282 | ] 283 | } 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "source": [ 289 | "## Graph RAG\n", 290 | "\n", 291 | "### Build a Graph with Neo4J\n" 292 | ], 293 | "metadata": { 294 | "id": "zzBnUF4bBYKG" 295 | } 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 13, 300 | "metadata": { 301 | "id": "TKTTyhgzT16d" 302 | }, 303 | "outputs": [], 304 | "source": [ 305 | "import os\n", 306 | "from getpass import getpass\n", 307 | "\n", 308 | "from ms_graphrag_neo4j import MsGraphRAG\n", 309 | "from neo4j import GraphDatabase\n", 310 | "import pandas as pd\n", 311 | "\n", 312 | "# Use Neo4j Sandbox - Blank Project https://sandbox.neo4j.com/\n", 313 | "\n", 314 | "os.environ[\"NEO4J_URI\"]=\"bolt://52.207.220.65:7687\"\n", 315 | "os.environ[\"NEO4J_USERNAME\"]=\"neo4j\"\n", 316 | "os.environ[\"NEO4J_PASSWORD\"]=\"dives-platform-eligibility\"" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "source": [ 322 | "driver = GraphDatabase.driver(\n", 323 | " os.environ[\"NEO4J_URI\"],\n", 324 | " auth=(os.environ[\"NEO4J_USERNAME\"], os.environ[\"NEO4J_PASSWORD\"]),\n", 325 | " #notifications_min_severity=\"OFF\",\n", 326 | ")\n", 327 | "ms_graph = MsGraphRAG(driver=driver, model=\"gpt-4o\", max_workers=10)" 328 | ], 329 | "metadata": { 330 | "id": "kYI-du6JUI00" 331 | }, 332 | "execution_count": 14, 333 | "outputs": [] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "source": [ 338 | "import pandas as pd\n", 339 | "\n", 340 | "# Login using e.g. `huggingface-cli login` to access this dataset\n", 341 | "df = pd.read_parquet(\"hf://datasets/weaviate/agents/query-agent/financial-contracts/0001.parquet\")\n", 342 | "df.head()" 343 | ], 344 | "metadata": { 345 | "id": "IkgiLkpyua2D" 346 | }, 347 | "execution_count": null, 348 | "outputs": [] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "source": [ 353 | "texts = [el['contract_text'] for el in df['properties']]\n", 354 | "texts[:2]" 355 | ], 356 | "metadata": { 357 | "colab": { 358 | "base_uri": "https://localhost:8080/" 359 | }, 360 | "id": "sB0MwT0X5cIJ", 361 | "outputId": "7af04b30-34a5-475f-9033-41408f04b724" 362 | }, 363 | "execution_count": 16, 364 | "outputs": [ 365 | { 366 | "output_type": "execute_result", 367 | "data": { 368 | "text/plain": [ 369 | "['PARTNERSHIP AGREEMENT\\n\\nThis Partnership Agreement (\"Agreement\") is made and entered into as of the 15th day of March, 2023, by and between Weaviate, a company registered in the State of California, and OpenAI, a research organization based in San Francisco, California.\\n\\n1. Purpose\\nThe parties agree to establish a partnership to collaborate on artificial intelligence research and development, sharing resources and expertise.\\n\\n2. Contributions\\nWeaviate shall contribute technology resources valued at $112.85 and staff time equivalent to a monetary value of $550.09. OpenAI shall contribute its research expertise and a project management team valued at $98.14.\\n\\n3. Profit Sharing\\nThe net profits generated from joint projects shall be distributed as follows: Weaviate shall receive 60% and OpenAI shall receive 40%.\\n\\n4. Duration\\nThis Agreement shall commence on the date hereof and shall continue in effect for a period of three (3) years, unless terminated earlier in accordance with the provisions herein.\\n\\n5. Termination\\nEither party may terminate this Agreement with a written notice of thirty (30) days. Upon termination, the parties shall settle any outstanding financial obligations.\\n\\nIN WITNESS WHEREOF, the parties hereto have executed this Partnership Agreement as of the date first above written.\\n\\nWeaviate\\n___________________________\\nMark Robson, CEO\\n\\nOpenAI\\n___________________________\\nDanny Williams, Director',\n", 370 | " 'LOAN AGREEMENT\\n\\nThis Loan Agreement (\"Agreement\") is made and entered into as of the 15th day of March, 2023, by and between Weaviate, a corporation organized and existing under the laws of the State, with a principal office located at 123 Innovation Drive, Tech City, and Mark Robson, an individual residing at 456 Creativity Lane, Idea Town.\\n\\n1. LOAN AMOUNT: Weaviate agrees to loan Mark Robson the total amount of $342.00 (Three Hundred Forty-Two Dollars).\\n\\n2. INTEREST RATE: The loan shall bear an interest rate of 2.68% per annum, calculated on the outstanding balance.\\n\\n3. REPAYMENT: The total amount, including interest, shall be repaid in full within a period of three years from the date of this Agreement. Payments shall be made monthly, with the first payment due on April 15, 2023.\\n\\n4. DEFAULT: In the event of default, the borrower agrees that Weaviate may demand immediate repayment of the entire unpaid balance and accrued interest.\\n\\n5. GOVERNING LAW: This Agreement shall be governed by and construed in accordance with the laws of the State.\\n\\nIN WITNESS WHEREOF, the parties hereto have executed this Loan Agreement as of the date first above written.\\n\\nWeaviate: ____________________ \\nAuthorized Signature\\n\\nMark Robson: __________________ \\nBorrower\\'s Signature']" 371 | ] 372 | }, 373 | "metadata": {}, 374 | "execution_count": 16 375 | } 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "source": [ 381 | "### Extract Relevant Entities\n", 382 | "\n", 383 | "Next, we will start extracting relevant entities and relations between these entities that we might be interested in." 384 | ], 385 | "metadata": { 386 | "id": "FVzpKJViBiJT" 387 | } 388 | }, 389 | { 390 | "cell_type": "code", 391 | "source": [ 392 | "allowed_entities = [\"Person\", \"Organization\", \"Location\"]\n", 393 | "\n", 394 | "await ms_graph.extract_nodes_and_rels(texts, allowed_entities)" 395 | ], 396 | "metadata": { 397 | "colab": { 398 | "base_uri": "https://localhost:8080/", 399 | "height": 53 400 | }, 401 | "id": "7TuxXPv251Ia", 402 | "outputId": "0b9ac964-586a-4cc4-da15-59d971454188" 403 | }, 404 | "execution_count": 17, 405 | "outputs": [ 406 | { 407 | "output_type": "stream", 408 | "name": "stderr", 409 | "text": [ 410 | "Extracting nodes & relationships: 100%|██████████| 100/100 [00:34<00:00, 2.87it/s]\n" 411 | ] 412 | }, 413 | { 414 | "output_type": "execute_result", 415 | "data": { 416 | "text/plain": [ 417 | "'Successfuly extracted and imported 274 relationships'" 418 | ], 419 | "application/vnd.google.colaboratory.intrinsic+json": { 420 | "type": "string" 421 | } 422 | }, 423 | "metadata": {}, 424 | "execution_count": 17 425 | } 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "source": [ 431 | "### Summarize Nodes and Communities" 432 | ], 433 | "metadata": { 434 | "id": "j1wAsUfIBrGc" 435 | } 436 | }, 437 | { 438 | "cell_type": "code", 439 | "source": [ 440 | "await ms_graph.summarize_nodes_and_rels()" 441 | ], 442 | "metadata": { 443 | "colab": { 444 | "base_uri": "https://localhost:8080/", 445 | "height": 70 446 | }, 447 | "id": "_ZlhpN9E6cNO", 448 | "outputId": "85d52a58-cb30-43d0-aa77-1523fb79b32b" 449 | }, 450 | "execution_count": 18, 451 | "outputs": [ 452 | { 453 | "output_type": "stream", 454 | "name": "stderr", 455 | "text": [ 456 | "Summarizing nodes: 100%|██████████| 33/33 [00:13<00:00, 2.42it/s]\n", 457 | "Summarizing relationships: 100%|██████████| 33/33 [00:08<00:00, 3.87it/s]\n" 458 | ] 459 | }, 460 | { 461 | "output_type": "execute_result", 462 | "data": { 463 | "text/plain": [ 464 | "'Successfuly summarized nodes and relationships'" 465 | ], 466 | "application/vnd.google.colaboratory.intrinsic+json": { 467 | "type": "string" 468 | } 469 | }, 470 | "metadata": {}, 471 | "execution_count": 18 472 | } 473 | ] 474 | }, 475 | { 476 | "cell_type": "code", 477 | "source": [ 478 | "await ms_graph.summarize_communities()" 479 | ], 480 | "metadata": { 481 | "colab": { 482 | "base_uri": "https://localhost:8080/", 483 | "height": 209 484 | }, 485 | "id": "L1do4-EX7AAL", 486 | "outputId": "707af976-9fc7-4799-84e2-b51d7cbc7477" 487 | }, 488 | "execution_count": 21, 489 | "outputs": [ 490 | { 491 | "output_type": "stream", 492 | "name": "stdout", 493 | "text": [ 494 | "Leiden algorithm identified 1 community levels with 3 communities on the last level.\n" 495 | ] 496 | }, 497 | { 498 | "output_type": "stream", 499 | "name": "stderr", 500 | "text": [ 501 | "\n", 502 | "\n", 503 | "Summarizing communities: 0%| | 0/3 [00:00" 605 | ] 606 | }, 607 | "metadata": {}, 608 | "execution_count": 25 609 | } 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "source": [ 615 | "from datasets import IterableDataset\n", 616 | "\n", 617 | "# Define a simple generator\n", 618 | "def list_generator(data):\n", 619 | " for item in data:\n", 620 | " yield item\n", 621 | "\n", 622 | "# Create the IterableDataset\n", 623 | "entities_dataset = IterableDataset.from_generator(list_generator, gen_kwargs={\"data\": entities})" 624 | ], 625 | "metadata": { 626 | "id": "4TG2awNzqYpA" 627 | }, 628 | "execution_count": 26, 629 | "outputs": [] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "source": [ 634 | "entities_collection = client.collections.get(\"Entities\")\n", 635 | "\n", 636 | "with entities_collection.batch.dynamic() as batch:\n", 637 | " for item in entities_dataset:\n", 638 | " batch.add_object(properties=item)" 639 | ], 640 | "metadata": { 641 | "id": "fCDnGfGXrKUd" 642 | }, 643 | "execution_count": 28, 644 | "outputs": [] 645 | }, 646 | { 647 | "cell_type": "code", 648 | "source": [ 649 | "from neo4j_graphrag.retrievers import WeaviateNeo4jRetriever\n", 650 | "\n", 651 | "retrieval_query = \"\"\"\n", 652 | " WITH collect(node) as nodes\n", 653 | "WITH collect {\n", 654 | " UNWIND nodes as n\n", 655 | " MATCH (n)<-[:MENTIONS]->(c:__Chunk__)\n", 656 | " WITH c, count(distinct n) as freq\n", 657 | " RETURN c.text AS chunkText\n", 658 | " ORDER BY freq DESC\n", 659 | " LIMIT 3\n", 660 | "} AS text_mapping,\n", 661 | "collect {\n", 662 | " UNWIND nodes as n\n", 663 | " MATCH (n)-[:IN_COMMUNITY*]->(c:__Community__)\n", 664 | " WHERE c.summary IS NOT NULL\n", 665 | " WITH c, c.rating as rank\n", 666 | " RETURN c.summary\n", 667 | " ORDER BY rank DESC\n", 668 | " LIMIT 3\n", 669 | "} AS report_mapping,\n", 670 | "collect {\n", 671 | " UNWIND nodes as n\n", 672 | " MATCH (n)-[r:SUMMARIZED_RELATIONSHIP]-(m)\n", 673 | " WHERE m IN nodes\n", 674 | " RETURN r.summary AS descriptionText\n", 675 | " LIMIT 3\n", 676 | "} as insideRels,\n", 677 | "collect {\n", 678 | " UNWIND nodes as n\n", 679 | " RETURN n.summary AS descriptionText\n", 680 | "} as entities\n", 681 | "RETURN {Chunks: text_mapping, Reports: report_mapping,\n", 682 | " Relationships: insideRels,\n", 683 | " Entities: entities} AS output\n", 684 | " \"\"\"\n", 685 | "\n", 686 | "retriever = WeaviateNeo4jRetriever(\n", 687 | " driver=driver,\n", 688 | " client=client,\n", 689 | " collection=\"Entities\",\n", 690 | " id_property_external=\"entity_id\",\n", 691 | " id_property_neo4j=\"name\",\n", 692 | " retrieval_query=retrieval_query\n", 693 | ")" 694 | ], 695 | "metadata": { 696 | "id": "tEmPiLVdr8ao" 697 | }, 698 | "execution_count": 29, 699 | "outputs": [] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "source": [ 704 | "async def hybrid_local_search_rag(input: str) -> str:\n", 705 | " context = [str(el[1]) for el in retriever.search(query_text=input, top_k=3)]\n", 706 | " messages = [\n", 707 | " {\n", 708 | " \"role\": \"user\",\n", 709 | " \"content\": \"Based on the given context: {context} \\n\\n Answer the following question: {question}\".format(context=context, question=input)\n", 710 | " },\n", 711 | " ]\n", 712 | " output = await achat(messages, model=\"gpt-4o\")\n", 713 | " return output" 714 | ], 715 | "metadata": { 716 | "id": "VszQsBQ3v00Z" 717 | }, 718 | "execution_count": 30, 719 | "outputs": [] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "source": [ 724 | "response = await hybrid_local_search_rag(input=\"What do you know about Weaviate\")\n", 725 | "print(response)" 726 | ], 727 | "metadata": { 728 | "colab": { 729 | "base_uri": "https://localhost:8080/" 730 | }, 731 | "id": "EmgyW2XuwC0C", 732 | "outputId": "8c46af34-9c68-4707-f7e0-d88f45adbd34" 733 | }, 734 | "execution_count": 31, 735 | "outputs": [ 736 | { 737 | "output_type": "stream", 738 | "name": "stdout", 739 | "text": [ 740 | "Weaviate is a corporation organized under the laws of both the State of California and the State of Delaware. Its principal place of business is primarily located in San Francisco, CA, with additional offices at 123 Innovation Drive, Tech City, CA, and 123 Tech Lane, Silicon Valley, CA. The company is involved in a wide range of activities, including consulting, software development, data analysis, cloud storage, technical support, and project management services. Weaviate is actively engaged in partnerships to develop innovative AI solutions and advanced data processing technologies, contributing resources and expertise to these collaborations.\n", 741 | "\n", 742 | "The organization acts as both a lessor and a lessee in various lease agreements and is involved in multiple business relationships under Non-Disclosure Agreements. Weaviate also participates in sales and purchase order agreements, acting as both a buyer and a seller, and is involved in loan agreements as a lender. The company is responsible for marketing, promotion, and project management in its partnerships and provides services to clients such as OpenAI.\n", 743 | "\n", 744 | "Weaviate's involvement in these diverse activities highlights its role as a key player in the technology and AI sectors, leveraging its expertise to drive innovation and support its clients' needs. Additionally, Weaviate's principal office is located in Wilmington, DE.\n" 745 | ] 746 | } 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "source": [], 752 | "metadata": { 753 | "id": "Mf-S6xR19t78" 754 | }, 755 | "execution_count": null, 756 | "outputs": [] 757 | } 758 | ] 759 | } -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["hatchling"] 3 | build-backend = "hatchling.build" 4 | 5 | [project] 6 | name = "ms_graphrag_neo4j" 7 | version = "0.1.0" 8 | description = "A simple Python package" 9 | readme = "README.md" 10 | authors = [ 11 | {name = "Tomaz Bratanic", email = "tomaz.bratanic@neo4j.com"}, 12 | ] 13 | license = {text = "MIT"} 14 | classifiers = [ 15 | "Programming Language :: Python :: 3", 16 | "License :: OSI Approved :: MIT License", 17 | "Operating System :: OS Independent", 18 | ] 19 | requires-python = ">=3.8" 20 | dependencies = [ 21 | "neo4j>=5.28.1", 22 | "openai>=1.69.0" 23 | ] 24 | 25 | [project.urls] 26 | "Homepage" = "https://github.com/yourusername/my_package" 27 | "Bug Tracker" = "https://github.com/yourusername/my_package/issues" 28 | 29 | [tool.hatch.build.targets.wheel] 30 | packages = ["src/ms_graphrag_neo4j"] -------------------------------------------------------------------------------- /src/ms_graphrag_neo4j/__init__.py: -------------------------------------------------------------------------------- 1 | from ms_graphrag_neo4j.ms_graphrag import MsGraphRAG 2 | 3 | __all__ = ["MsGraphRAG"] 4 | -------------------------------------------------------------------------------- /src/ms_graphrag_neo4j/cypher_queries.py: -------------------------------------------------------------------------------- 1 | import_nodes_query = """ 2 | CREATE (c:__Chunk__ {id: $chunk_id}) 3 | SET c.text = $text 4 | WITH c 5 | UNWIND $data AS row 6 | MERGE (n:__Entity__ {name: row.entity_name}) 7 | SET n:$(row.entity_type), 8 | n.description = coalesce(n.description, []) + [row.entity_description] 9 | MERGE (n)<-[:MENTIONS]-(c) 10 | """ 11 | 12 | import_relationships_query = """ 13 | UNWIND $data AS row 14 | MERGE (s:__Entity__ {name: row.source_entity}) 15 | MERGE (t:__Entity__ {name: row.target_entity}) 16 | CREATE (s)-[r:RELATIONSHIP {description: row.relationship_description, strength: row.relationship_strength}]->(t) 17 | """ 18 | 19 | import_community_query = """ 20 | UNWIND $data AS row 21 | MERGE (c:__Community__ {communityId: row.communityId}) 22 | SET c.title = row.community.title, 23 | c.summary = row.community.summary, 24 | c.rating = row.community.rating, 25 | c.rating_explanation = row.community.rating_explanation 26 | WITH c, row 27 | UNWIND row.nodes AS node 28 | MERGE (n:__Entity__ {name: node}) 29 | MERGE (n)-[:IN_COMMUNITY]->(c) 30 | """ 31 | 32 | candidate_nodes_summarization = """ 33 | MATCH (e:__Entity__) WHERE size(e.description) > 1 34 | RETURN e.name AS entity_name, e.description AS description_list 35 | """ 36 | 37 | candidate_rels_summarization = """ 38 | MATCH (s:__Entity__)-[r:RELATIONSHIP]-(t:__Entity__) 39 | WHERE elementId(s) < elementId(t) 40 | WITH s.name AS source, t.name AS target, 41 | collect(r.description) AS description_list, 42 | count(*) AS count 43 | WHERE count > 1 44 | RETURN source, target, description_list 45 | """ 46 | 47 | import_entity_summary = """ 48 | UNWIND $data AS row 49 | MATCH (e:__Entity__ {name: row.entity}) 50 | SET e.summary = row.summary 51 | """ 52 | 53 | import_entity_summary = """ 54 | UNWIND $data AS row 55 | MATCH (e:__Entity__ {name: row.entity}) 56 | SET e.summary = row.summary 57 | """ 58 | 59 | import_entity_summary_single = """ 60 | MATCH (e:__Entity__) 61 | WHERE size(e.description) = 1 62 | SET e.summary = e.description[0] 63 | """ 64 | 65 | import_rel_summary = """ 66 | UNWIND $data AS row 67 | MATCH (s:__Entity__ {name: row.source}), (t:__Entity__ {name: row.target}) 68 | MERGE (s)-[r:SUMMARIZED_RELATIONSHIP]-(t) 69 | SET r.summary = row.summary 70 | """ 71 | 72 | import_rel_summary_single = """ 73 | MATCH (s:__Entity__)-[e:RELATIONSHIP]-(t:__Entity__) 74 | WHERE NOT (s)-[:SUMMARIZED_RELATIONSHIP]-(t) 75 | MERGE (s)-[r:SUMMARIZED_RELATIONSHIP]-(t) 76 | SET r.summary = e.description 77 | """ 78 | 79 | drop_gds_graph_query = "CALL gds.graph.drop('entity', False) YIELD graphName" 80 | 81 | create_gds_graph_query = """ 82 | MATCH (source:__Entity__)-[r:RELATIONSHIP]->(target:__Entity__) 83 | WITH gds.graph.project('entity', source, target, {}, {undirectedRelationshipTypes: ['*']}) AS g 84 | RETURN 85 | g.graphName AS graph, g.nodeCount AS nodes, g.relationshipCount AS rels 86 | """ 87 | 88 | leiden_query = """ 89 | CALL gds.leiden.write("entity", {writeProperty:"communities", includeIntermediateCommunities: True}) 90 | """ 91 | 92 | community_hierarchy_query = """ 93 | MATCH (e:`__Entity__`) 94 | WHERE e.communities IS NOT NULL 95 | UNWIND range(0, size(e.communities) - 1 , 1) AS index 96 | CALL (e, index) { 97 | WITH e, index 98 | WHERE index = 0 99 | MERGE (c:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])}) 100 | ON CREATE SET c.level = index 101 | MERGE (e)-[:IN_COMMUNITY]->(c) 102 | RETURN count(*) AS count_0 103 | } 104 | CALL (e, index) { 105 | WITH e, index 106 | WHERE index > 0 107 | MERGE (current:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])}) 108 | ON CREATE SET current.level = index 109 | MERGE (previous:`__Community__` {id: toString(index - 1) + '-' + toString(e.communities[index - 1])}) 110 | ON CREATE SET previous.level = index - 1 111 | MERGE (previous)-[:IN_COMMUNITY]->(current) 112 | RETURN count(*) AS count_1 113 | } 114 | RETURN count(*) 115 | """ 116 | 117 | community_info_query = """ 118 | MATCH (c:`__Community__`)<-[:IN_COMMUNITY*]-(e:__Entity__) 119 | WHERE c.level IN $levels 120 | WITH c, collect(e ) AS nodes 121 | WHERE size(nodes) > 1 122 | CALL apoc.path.subgraphAll(nodes[0], { 123 | whitelistNodes:nodes, 124 | relationshipFilter: "SUMMARIZED_RELATIONSHIP" 125 | }) 126 | YIELD relationships 127 | RETURN c.id AS communityId, 128 | [n in nodes | {id: n.id, description: n.summary, type: [el in labels(n) WHERE el <> '__Entity__'][0]}] AS nodes, 129 | [r in relationships | {start: startNode(r).id, type: type(r), end: endNode(r).id, description: r.summary}] AS rels 130 | """ 131 | 132 | import_community_summary = """ 133 | UNWIND $data AS row 134 | MERGE (c:__Community__ {id: row.communityId}) 135 | SET c.title = row.community.title, 136 | c.summary = row.community.summary, 137 | c.rating = row.community.rating, 138 | c.rating_explanation = row.community.rating_explanation 139 | """ 140 | -------------------------------------------------------------------------------- /src/ms_graphrag_neo4j/ms_graphrag.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Any, Dict, List, Optional, Type 3 | from neo4j import Driver 4 | from openai import AsyncOpenAI 5 | import asyncio 6 | 7 | 8 | from tqdm.asyncio import tqdm, tqdm_asyncio 9 | 10 | 11 | from ms_graphrag_neo4j.cypher_queries import * 12 | from ms_graphrag_neo4j.utils import * 13 | from ms_graphrag_neo4j.prompts import * 14 | 15 | 16 | class MsGraphRAG: 17 | """ 18 | MsGraphRAG: Microsoft GraphRAG Implementation for Neo4j 19 | 20 | A class for implementing the Microsoft GraphRAG approach with Neo4j graph database. 21 | GraphRAG enhances retrieval-augmented generation by leveraging graph structures 22 | to provide context-aware information for LLM responses. 23 | 24 | This implementation features: 25 | - Entity and relationship extraction from unstructured text 26 | - Node and relationship summarization for improved retrieval 27 | - Community detection and summarization for concept clustering 28 | - Integration with OpenAI models for generation 29 | 30 | The class connects to Neo4j for graph storage and uses OpenAI for content generation 31 | and extraction, providing a seamless way to build knowledge graphs from text 32 | and perform graph-based retrieval. 33 | 34 | Requirements: 35 | - Neo4j database with APOC and GDS plugins installed 36 | - OpenAI API key for LLM interactions 37 | 38 | Example: 39 | ``` 40 | from ms_graphrag_neo4j import MsGraphRAG 41 | import os 42 | 43 | os.environ["OPENAI_API_KEY"]= "sk-proj-" 44 | os.environ["NEO4J_URI"]="bolt://localhost:7687" 45 | os.environ["NEO4J_USERNAME"]="neo4j" 46 | os.environ["NEO4J_PASSWORD"]="password" 47 | 48 | from neo4j import GraphDatabase 49 | driver = GraphDatabase.driver(os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])) 50 | ms_graph = MsGraphRAG(driver=driver, model='gpt-4o') 51 | 52 | example_texts = ["Tomaz works for Neo4j", "Tomaz lives in Grosuplje", "Tomaz went to school in Grosuplje"] 53 | allowed_entities = ["Person", "Organization", "Location"] 54 | 55 | await ms_graph.extract_nodes_and_rels(example_texts, allowed_entities) 56 | 57 | await ms_graph.summarize_nodes_and_rels() 58 | 59 | await ms_graph.summarize_communities() 60 | ``` 61 | 62 | References: 63 | - Microsoft GraphRAG: https://github.com/microsoft/graphrag 64 | """ 65 | 66 | def __init__( 67 | self, 68 | driver: Driver, 69 | model: str = "gpt-4o", 70 | database: str = "neo4j", 71 | max_workers: int = 10, 72 | create_constraints: bool = True, 73 | ) -> None: 74 | """ 75 | Initialize MsGraphRAG with Neo4j driver and LLM. 76 | 77 | Args: 78 | driver (Driver): Neo4j driver instance 79 | model (str, optional): The language model to use. Defaults to "gpt-4o". 80 | database (str, optional): Neo4j database name. Defaults to "neo4j". 81 | max_workers (int, optional): Maximum number of concurrent workers. Defaults to 10. 82 | create_constraints (bool, optional): Whether to create database constraints. Defaults to True. 83 | """ 84 | if not os.environ.get("OPENAI_API_KEY"): 85 | raise ValueError( 86 | "You need to define the `OPENAI_API_KEY` environment variable" 87 | ) 88 | 89 | self._driver = driver 90 | self.model = model 91 | self.max_workers = max_workers 92 | self._database = database 93 | self._openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY")) 94 | # Test for APOC 95 | try: 96 | self.query("CALL apoc.help('test')") 97 | except: 98 | raise ValueError("You need to install and allow APOC functions") 99 | # Test for GDS 100 | try: 101 | self.query("CALL gds.list('test')") 102 | except: 103 | raise ValueError("You need to install and allow GDS functions") 104 | if create_constraints: 105 | self.query( 106 | "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Chunk__) REQUIRE e.id IS UNIQUE;" 107 | ) 108 | self.query( 109 | "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Entity__) REQUIRE e.name IS UNIQUE;" 110 | ) 111 | self.query( 112 | "CREATE CONSTRAINT IF NOT EXISTS FOR (e:__Community__) REQUIRE e.id IS UNIQUE;" 113 | ) 114 | 115 | async def extract_nodes_and_rels( 116 | self, input_texts: list, allowed_entities: list 117 | ) -> str: 118 | """ 119 | Extract nodes and relationships from input texts using LLM and store them in Neo4j. 120 | 121 | Args: 122 | input_texts (list): List of text documents to process and extract entities from 123 | allowed_entities (list): List of entity types to extract from the texts 124 | 125 | Returns: 126 | str: Success message with count of extracted relationships 127 | 128 | Notes: 129 | - Uses parallel processing with tqdm progress tracking 130 | - Extracted entities and relationships are stored directly in Neo4j 131 | - Each text document is processed independently by the LLM 132 | """ 133 | 134 | async def process_text(input_text): 135 | prompt = GRAPH_EXTRACTION_PROMPT.format( 136 | entity_types=allowed_entities, 137 | input_text=input_text, 138 | tuple_delimiter=";", 139 | record_delimiter="|", 140 | completion_delimiter="\n\n", 141 | ) 142 | messages = [ 143 | {"role": "user", "content": prompt}, 144 | ] 145 | # Make the LLM call 146 | output = await self.achat(messages, model=self.model) 147 | # Construct JSON from output 148 | return parse_extraction_output(output.content) 149 | 150 | # Create tasks for all input texts 151 | tasks = [process_text(text) for text in input_texts] 152 | 153 | # Process tasks with tqdm progress bar 154 | # Use semaphore to limit concurrent tasks if max_workers is specified 155 | if self.max_workers: 156 | semaphore = asyncio.Semaphore(self.max_workers) 157 | 158 | async def process_with_semaphore(task): 159 | async with semaphore: 160 | return await task 161 | 162 | results = [] 163 | for task in tqdm.as_completed( 164 | [process_with_semaphore(task) for task in tasks], 165 | total=len(tasks), 166 | desc="Extracting nodes & relationships", 167 | ): 168 | results.append(await task) 169 | else: 170 | results = [] 171 | for task in tqdm.as_completed( 172 | tasks, total=len(tasks), desc="Extracting nodes & relationships" 173 | ): 174 | results.append(await task) 175 | 176 | total_relationships = 0 177 | # Import nodes and relationships 178 | for text, output in zip(input_texts, results): 179 | nodes, relationships = output 180 | total_relationships += len(relationships) 181 | # Import nodes 182 | self.query( 183 | import_nodes_query, 184 | params={"text": text, "chunk_id": get_hash(text), "data": nodes}, 185 | ) 186 | # Import relationships 187 | self.query(import_relationships_query, params={"data": relationships}) 188 | 189 | return f"Successfuly extracted and imported {total_relationships} relationships" 190 | 191 | async def summarize_nodes_and_rels(self) -> str: 192 | """ 193 | Generate summaries for all nodes and relationships in the graph. 194 | 195 | Returns: 196 | str: Success message indicating completion of summarization 197 | 198 | Notes: 199 | - Retrieves candidate nodes and relationships from Neo4j 200 | - Uses LLM to generate concise summaries for each entity and relationship 201 | - Stores summarized properties in the graph 202 | """ 203 | # Summarize nodes 204 | nodes = self.query(candidate_nodes_summarization) 205 | 206 | async def process_node(node): 207 | messages = [ 208 | { 209 | "role": "user", 210 | "content": SUMMARIZE_PROMPT.format( 211 | entity_name=node["entity_name"], 212 | description_list=node["description_list"], 213 | ), 214 | }, 215 | ] 216 | summary = await self.achat(messages, model=self.model) 217 | return {"entity": node["entity_name"], "summary": summary.content} 218 | 219 | # Create a progress bar for node processing with max_workers limit 220 | if self.max_workers: 221 | semaphore = asyncio.Semaphore(self.max_workers) 222 | 223 | async def process_with_semaphore(node): 224 | async with semaphore: 225 | return await process_node(node) 226 | 227 | summaries = await tqdm_asyncio.gather( 228 | *[process_with_semaphore(node) for node in nodes], 229 | desc="Summarizing nodes", 230 | ) 231 | else: 232 | summaries = await tqdm_asyncio.gather( 233 | *[process_node(node) for node in nodes], desc="Summarizing nodes" 234 | ) 235 | 236 | # Summarize relationships 237 | rels = self.query(candidate_rels_summarization) 238 | 239 | async def process_rel(rel): 240 | entity_name = f"{rel['source']} relationship to {rel['target']}" 241 | messages = [ 242 | { 243 | "role": "user", 244 | "content": SUMMARIZE_PROMPT.format( 245 | entity_name=entity_name, 246 | description_list=rel["description_list"], 247 | ), 248 | }, 249 | ] 250 | summary = await self.achat(messages, model=self.model) 251 | return { 252 | "source": rel["source"], 253 | "target": rel["target"], 254 | "summary": summary.content, 255 | } 256 | 257 | # Create a progress bar for relationship processing with max_workers limit 258 | if self.max_workers: 259 | semaphore = asyncio.Semaphore(self.max_workers) 260 | 261 | async def process_rel_with_semaphore(rel): 262 | async with semaphore: 263 | return await process_rel(rel) 264 | 265 | rel_summaries = await tqdm_asyncio.gather( 266 | *[process_rel_with_semaphore(rel) for rel in rels], 267 | desc="Summarizing relationships", 268 | ) 269 | else: 270 | rel_summaries = await tqdm_asyncio.gather( 271 | *[process_rel(rel) for rel in rels], desc="Summarizing relationships" 272 | ) 273 | 274 | # Import nodes 275 | self.query(import_entity_summary, params={"data": summaries}) 276 | self.query(import_entity_summary_single) 277 | 278 | # Import relationships 279 | self.query(import_rel_summary, params={"data": rel_summaries}) 280 | self.query(import_rel_summary_single) 281 | 282 | return "Successfuly summarized nodes and relationships" 283 | 284 | async def summarize_communities(self, summarize_all_levels: bool = False) -> str: 285 | """ 286 | Detect and summarize communities within the graph using the Leiden algorithm. 287 | 288 | Args: 289 | summarize_all_levels (bool, optional): Whether to summarize all community levels 290 | or just the final level. Defaults to False. 291 | 292 | Returns: 293 | str: Success message with count of generated community summaries 294 | 295 | Notes: 296 | - Uses Neo4j GDS library to run Leiden community detection algorithm 297 | - Generates hierarchical community structures in the graph 298 | - Uses LLM to create descriptive summaries of each community 299 | - The community summaries include key entities, relationships, and themes 300 | """ 301 | # Calculate communities 302 | self.query(drop_gds_graph_query) 303 | self.query(create_gds_graph_query) 304 | community_summary = self.query(leiden_query) 305 | community_levels = community_summary[0]["ranLevels"] 306 | print( 307 | f"Leiden algorithm identified {community_levels} community levels " 308 | f"with {community_summary[0]['communityCount']} communities on the last level." 309 | ) 310 | self.query(community_hierarchy_query) 311 | 312 | # Community summarization 313 | if summarize_all_levels: 314 | levels = list(range(community_levels)) 315 | else: 316 | levels = [community_levels - 1] 317 | communities = self.query(community_info_query, params={"levels": levels}) 318 | 319 | # Define async function for processing a single community 320 | async def process_community(community): 321 | input_text = f"""Entities: 322 | {community['nodes']} 323 | 324 | Relationships: 325 | {community['rels']}""" 326 | 327 | messages = [ 328 | { 329 | "role": "user", 330 | "content": COMMUNITY_REPORT_PROMPT.format(input_text=input_text), 331 | }, 332 | ] 333 | summary = await self.achat(messages, model=self.model) 334 | return { 335 | "community": extract_json(summary.content), 336 | "communityId": community["communityId"], 337 | } 338 | 339 | # Process all communities concurrently with tqdm progress bar and max_workers limit 340 | if self.max_workers: 341 | semaphore = asyncio.Semaphore(self.max_workers) 342 | 343 | async def process_community_with_semaphore(community): 344 | async with semaphore: 345 | return await process_community(community) 346 | 347 | community_summary = await tqdm_asyncio.gather( 348 | *( 349 | process_community_with_semaphore(community) 350 | for community in communities 351 | ), 352 | desc="Summarizing communities", 353 | total=len(communities), 354 | ) 355 | else: 356 | community_summary = await tqdm_asyncio.gather( 357 | *(process_community(community) for community in communities), 358 | desc="Summarizing communities", 359 | total=len(communities), 360 | ) 361 | 362 | self.query(import_community_summary, params={"data": community_summary}) 363 | return f"Generated {len(community_summary)} community summaries" 364 | 365 | def _check_driver_state(self) -> None: 366 | """ 367 | Check if the Neo4j driver is still available. 368 | 369 | Raises: 370 | RuntimeError: If the Neo4j driver has been closed. 371 | """ 372 | if not hasattr(self, "_driver"): 373 | raise RuntimeError( 374 | "This MsGraphRAG instance has been closed, and cannot be used anymore." 375 | ) 376 | 377 | def query( 378 | self, 379 | query: str, 380 | params: dict = {}, 381 | session_params: dict = {}, 382 | ) -> List[Dict[str, Any]]: 383 | """Query Neo4j database. 384 | 385 | Args: 386 | query (str): The Cypher query to execute. 387 | params (dict): The parameters to pass to the query. 388 | session_params (dict): Parameters to pass to the session used for executing 389 | the query. 390 | 391 | Returns: 392 | List[Dict[str, Any]]: The list of dictionaries containing the query results. 393 | 394 | Raises: 395 | RuntimeError: If the connection has been closed. 396 | """ 397 | self._check_driver_state() 398 | from neo4j import Query 399 | from neo4j.exceptions import Neo4jError 400 | 401 | if not session_params: 402 | try: 403 | data, _, _ = self._driver.execute_query( 404 | Query(text=query), 405 | database_=self._database, 406 | parameters_=params, 407 | ) 408 | return [r.data() for r in data] 409 | except Neo4jError as e: 410 | if not ( 411 | ( 412 | ( # isCallInTransactionError 413 | e.code == "Neo.DatabaseError.Statement.ExecutionFailed" 414 | or e.code 415 | == "Neo.DatabaseError.Transaction.TransactionStartFailed" 416 | ) 417 | and e.message is not None 418 | and "in an implicit transaction" in e.message 419 | ) 420 | or ( # isPeriodicCommitError 421 | e.code == "Neo.ClientError.Statement.SemanticError" 422 | and e.message is not None 423 | and ( 424 | "in an open transaction is not possible" in e.message 425 | or "tried to execute in an explicit transaction" 426 | in e.message 427 | ) 428 | ) 429 | ): 430 | raise 431 | # fallback to allow implicit transactions 432 | session_params.setdefault("database", self._database) 433 | with self._driver.session(**session_params) as session: 434 | result = session.run(Query(text=query, timeout=self.timeout), params) 435 | return [r.data() for r in result] 436 | 437 | async def achat(self, messages, model="gpt-4o", temperature=0, config={}): 438 | response = await self._openai_client.chat.completions.create( 439 | model=model, 440 | temperature=temperature, 441 | messages=messages, 442 | **config, 443 | ) 444 | return response.choices[0].message 445 | 446 | def close(self) -> None: 447 | """ 448 | Explicitly close the Neo4j driver connection. 449 | 450 | Delegates connection management to the Neo4j driver. 451 | """ 452 | if hasattr(self, "_driver"): 453 | self._driver.close() 454 | # Remove the driver attribute to indicate closure 455 | delattr(self, "_driver") 456 | 457 | def __enter__(self) -> "MsGraphRAG": 458 | """ 459 | Enter the runtime context for the Neo4j graph connection. 460 | 461 | Enables use of the graph connection with the 'with' statement. 462 | This method allows for automatic resource management and ensures 463 | that the connection is properly handled. 464 | 465 | Returns: 466 | MsGraphRAG: The current graph connection instance 467 | 468 | Example: 469 | with MsGraphRAG(...) as graph: 470 | graph.query(...) # Connection automatically managed 471 | """ 472 | return self 473 | 474 | def __exit__( 475 | self, 476 | exc_type: Optional[Type[BaseException]], 477 | exc_val: Optional[BaseException], 478 | exc_tb: Optional[Any], 479 | ) -> None: 480 | """ 481 | Exit the runtime context for the Neo4j graph connection. 482 | 483 | This method is automatically called when exiting a 'with' statement. 484 | It ensures that the database connection is closed, regardless of 485 | whether an exception occurred during the context's execution. 486 | 487 | Args: 488 | exc_type: The type of exception that caused the context to exit 489 | (None if no exception occurred) 490 | exc_val: The exception instance that caused the context to exit 491 | (None if no exception occurred) 492 | exc_tb: The traceback for the exception (None if no exception occurred) 493 | 494 | Note: 495 | Any exception is re-raised after the connection is closed. 496 | """ 497 | self.close() 498 | 499 | def __del__(self) -> None: 500 | """ 501 | Destructor for the Neo4j graph connection. 502 | 503 | This method is called during garbage collection to ensure that 504 | database resources are released if not explicitly closed. 505 | 506 | Caution: 507 | - Do not rely on this method for deterministic resource cleanup 508 | - Always prefer explicit .close() or context manager 509 | 510 | Best practices: 511 | 1. Use context manager: 512 | with MsGraphRAG(...) as graph: 513 | ... 514 | 2. Explicitly close: 515 | graph = MsGraphRAG(...) 516 | try: 517 | ... 518 | finally: 519 | graph.close() 520 | """ 521 | try: 522 | self.close() 523 | except Exception: 524 | # Suppress any exceptions during garbage collection 525 | pass 526 | -------------------------------------------------------------------------------- /src/ms_graphrag_neo4j/prompts.py: -------------------------------------------------------------------------------- 1 | GRAPH_EXTRACTION_PROMPT = """-Goal- 2 | Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities. 3 | 4 | -Steps- 5 | 1. Identify all entities. For each identified entity, extract the following information: 6 | - entity_name: Name of the entity, capitalized 7 | - entity_type: One of the following types: [{entity_types}] 8 | - entity_description: Comprehensive description of the entity's attributes and activities 9 | Format each entity as ("entity"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) 10 | 11 | 2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other. 12 | For each pair of related entities, extract the following information: 13 | - source_entity: name of the source entity, as identified in step 1 14 | - target_entity: name of the target entity, as identified in step 1 15 | - relationship_description: explanation as to why you think the source entity and the target entity are related to each other 16 | - relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity 17 | Format each relationship as ("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) 18 | 19 | 3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter. 20 | 21 | 4. When finished, output {completion_delimiter} 22 | 23 | ###################### 24 | -Examples- 25 | ###################### 26 | Example 1: 27 | Entity_types: ORGANIZATION,PERSON 28 | Text: 29 | The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%. 30 | ###################### 31 | Output: 32 | ("entity"{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday) 33 | {record_delimiter} 34 | ("entity"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}PERSON{tuple_delimiter}Martin Smith is the chair of the Central Institution) 35 | {record_delimiter} 36 | ("entity"{tuple_delimiter}MARKET STRATEGY COMMITTEE{tuple_delimiter}ORGANIZATION{tuple_delimiter}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply) 37 | {record_delimiter} 38 | ("relationship"{tuple_delimiter}MARTIN SMITH{tuple_delimiter}CENTRAL INSTITUTION{tuple_delimiter}Martin Smith is the Chair of the Central Institution and will answer questions at a press conference{tuple_delimiter}9) 39 | {completion_delimiter} 40 | 41 | ###################### 42 | Example 2: 43 | Entity_types: ORGANIZATION 44 | Text: 45 | TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform. 46 | 47 | TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones. 48 | ###################### 49 | Output: 50 | ("entity"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}ORGANIZATION{tuple_delimiter}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones) 51 | {record_delimiter} 52 | ("entity"{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}ORGANIZATION{tuple_delimiter}Vision Holdings is a firm that previously owned TechGlobal) 53 | {record_delimiter} 54 | ("relationship"{tuple_delimiter}TECHGLOBAL{tuple_delimiter}VISION HOLDINGS{tuple_delimiter}Vision Holdings formerly owned TechGlobal from 2014 until present{tuple_delimiter}5) 55 | {completion_delimiter} 56 | 57 | ###################### 58 | Example 3: 59 | Entity_types: ORGANIZATION,GEO,PERSON 60 | Text: 61 | Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia. 62 | 63 | The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara. 64 | 65 | The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara. 66 | 67 | They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion. 68 | 69 | The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality. 70 | ###################### 71 | Output: 72 | ("entity"{tuple_delimiter}FIRUZABAD{tuple_delimiter}GEO{tuple_delimiter}Firuzabad held Aurelians as hostages) 73 | {record_delimiter} 74 | ("entity"{tuple_delimiter}AURELIA{tuple_delimiter}GEO{tuple_delimiter}Country seeking to release hostages) 75 | {record_delimiter} 76 | ("entity"{tuple_delimiter}QUINTARA{tuple_delimiter}GEO{tuple_delimiter}Country that negotiated a swap of money in exchange for hostages) 77 | {record_delimiter} 78 | {record_delimiter} 79 | ("entity"{tuple_delimiter}TIRUZIA{tuple_delimiter}GEO{tuple_delimiter}Capital of Firuzabad where the Aurelians were being held) 80 | {record_delimiter} 81 | ("entity"{tuple_delimiter}KROHAARA{tuple_delimiter}GEO{tuple_delimiter}Capital city in Quintara) 82 | {record_delimiter} 83 | ("entity"{tuple_delimiter}CASHION{tuple_delimiter}GEO{tuple_delimiter}Capital city in Aurelia) 84 | {record_delimiter} 85 | ("entity"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}PERSON{tuple_delimiter}Aurelian who spent time in Tiruzia's Alhamia Prison) 86 | {record_delimiter} 87 | ("entity"{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}GEO{tuple_delimiter}Prison in Tiruzia) 88 | {record_delimiter} 89 | ("entity"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}PERSON{tuple_delimiter}Aurelian journalist who was held hostage) 90 | {record_delimiter} 91 | ("entity"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}PERSON{tuple_delimiter}Bratinas national and environmentalist who was held hostage) 92 | {record_delimiter} 93 | ("relationship"{tuple_delimiter}FIRUZABAD{tuple_delimiter}AURELIA{tuple_delimiter}Firuzabad negotiated a hostage exchange with Aurelia{tuple_delimiter}2) 94 | {record_delimiter} 95 | ("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}AURELIA{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2) 96 | {record_delimiter} 97 | ("relationship"{tuple_delimiter}QUINTARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Quintara brokered the hostage exchange between Firuzabad and Aurelia{tuple_delimiter}2) 98 | {record_delimiter} 99 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}ALHAMIA PRISON{tuple_delimiter}Samuel Namara was a prisoner at Alhamia prison{tuple_delimiter}8) 100 | {record_delimiter} 101 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}Samuel Namara and Meggie Tazbah were exchanged in the same hostage release{tuple_delimiter}2) 102 | {record_delimiter} 103 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Samuel Namara and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2) 104 | {record_delimiter} 105 | ("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}Meggie Tazbah and Durke Bataglani were exchanged in the same hostage release{tuple_delimiter}2) 106 | {record_delimiter} 107 | ("relationship"{tuple_delimiter}SAMUEL NAMARA{tuple_delimiter}FIRUZABAD{tuple_delimiter}Samuel Namara was a hostage in Firuzabad{tuple_delimiter}2) 108 | {record_delimiter} 109 | ("relationship"{tuple_delimiter}MEGGIE TAZBAH{tuple_delimiter}FIRUZABAD{tuple_delimiter}Meggie Tazbah was a hostage in Firuzabad{tuple_delimiter}2) 110 | {record_delimiter} 111 | ("relationship"{tuple_delimiter}DURKE BATAGLANI{tuple_delimiter}FIRUZABAD{tuple_delimiter}Durke Bataglani was a hostage in Firuzabad{tuple_delimiter}2) 112 | {completion_delimiter} 113 | 114 | ###################### 115 | -Real Data- 116 | ###################### 117 | Entity_types: {entity_types} 118 | Text: {input_text} 119 | ###################### 120 | Output:""" 121 | 122 | SUMMARIZE_PROMPT = """ 123 | You are a helpful assistant responsible for generating a comprehensive summary of the data provided below. 124 | Given one or two entities, and a list of descriptions, all related to the same entity or group of entities. 125 | Please concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions. 126 | If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary. 127 | Make sure it is written in third person, and include the entity names so we have the full context. 128 | 129 | ####### 130 | -Data- 131 | Entities: {entity_name} 132 | Description List: {description_list} 133 | ####### 134 | Output: 135 | """ 136 | 137 | COMMUNITY_REPORT_PROMPT = """ 138 | You are an AI assistant that helps a human analyst to perform general information discovery. Information discovery is the process of identifying and assessing relevant information associated with certain entities (e.g., organizations and individuals) within a network. 139 | 140 | # Goal 141 | Write a comprehensive report of a community, given a list of entities that belong to the community as well as their relationships and optional associated claims. The report will be used to inform decision-makers about information associated with the community and their potential impact. The content of this report includes an overview of the community's key entities, their legal compliance, technical capabilities, reputation, and noteworthy claims. 142 | 143 | # Report Structure 144 | 145 | The report should include the following sections: 146 | 147 | - TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title. 148 | - SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities. 149 | - IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community. IMPACT is the scored importance of a community. 150 | - RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating. 151 | - DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive. 152 | 153 | Return output as a well-formed JSON-formatted string with the following format: 154 | {{ 155 | "title": , 156 | "summary": , 157 | "rating": , 158 | "rating_explanation": , 159 | "findings": [ 160 | {{ 161 | "summary":, 162 | "explanation": 163 | }}, 164 | {{ 165 | "summary":, 166 | "explanation": 167 | }} 168 | ] 169 | }} 170 | 171 | # Grounding Rules 172 | 173 | Points supported by data should list their data references as follows: 174 | 175 | "This is an example sentence supported by multiple data references [Data: (record ids); (record ids)]." 176 | 177 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 178 | 179 | For example: 180 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]." 181 | 182 | where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record. 183 | 184 | Do not include information where the supporting evidence for it is not provided. 185 | 186 | 187 | # Example Input 188 | ----------- 189 | Text: 190 | 191 | Entities 192 | 193 | id,entity,description 194 | 5,VERDANT OASIS PLAZA,Verdant Oasis Plaza is the location of the Unity March 195 | 6,HARMONY ASSEMBLY,Harmony Assembly is an organization that is holding a march at Verdant Oasis Plaza 196 | 197 | Relationships 198 | 199 | id,source,target,description 200 | 37,VERDANT OASIS PLAZA,UNITY MARCH,Verdant Oasis Plaza is the location of the Unity March 201 | 38,VERDANT OASIS PLAZA,HARMONY ASSEMBLY,Harmony Assembly is holding a march at Verdant Oasis Plaza 202 | 39,VERDANT OASIS PLAZA,UNITY MARCH,The Unity March is taking place at Verdant Oasis Plaza 203 | 40,VERDANT OASIS PLAZA,TRIBUNE SPOTLIGHT,Tribune Spotlight is reporting on the Unity march taking place at Verdant Oasis Plaza 204 | 41,VERDANT OASIS PLAZA,BAILEY ASADI,Bailey Asadi is speaking at Verdant Oasis Plaza about the march 205 | 43,HARMONY ASSEMBLY,UNITY MARCH,Harmony Assembly is organizing the Unity March 206 | 207 | Output: 208 | {{ 209 | "title": "Verdant Oasis Plaza and Unity March", 210 | "summary": "The community revolves around the Verdant Oasis Plaza, which is the location of the Unity March. The plaza has relationships with the Harmony Assembly, Unity March, and Tribune Spotlight, all of which are associated with the march event.", 211 | "rating": 5.0, 212 | "rating_explanation": "The impact severity rating is moderate due to the potential for unrest or conflict during the Unity March.", 213 | "findings": [ 214 | {{ 215 | "summary": "Verdant Oasis Plaza as the central location", 216 | "explanation": "Verdant Oasis Plaza is the central entity in this community, serving as the location for the Unity March. This plaza is the common link between all other entities, suggesting its significance in the community. The plaza's association with the march could potentially lead to issues such as public disorder or conflict, depending on the nature of the march and the reactions it provokes. [Data: Entities (5), Relationships (37, 38, 39, 40, 41,+more)]" 217 | }}, 218 | {{ 219 | "summary": "Harmony Assembly's role in the community", 220 | "explanation": "Harmony Assembly is another key entity in this community, being the organizer of the march at Verdant Oasis Plaza. The nature of Harmony Assembly and its march could be a potential source of threat, depending on their objectives and the reactions they provoke. The relationship between Harmony Assembly and the plaza is crucial in understanding the dynamics of this community. [Data: Entities(6), Relationships (38, 43)]" 221 | }}, 222 | {{ 223 | "summary": "Unity March as a significant event", 224 | "explanation": "The Unity March is a significant event taking place at Verdant Oasis Plaza. This event is a key factor in the community's dynamics and could be a potential source of threat, depending on the nature of the march and the reactions it provokes. The relationship between the march and the plaza is crucial in understanding the dynamics of this community. [Data: Relationships (39)]" 225 | }}, 226 | {{ 227 | "summary": "Role of Tribune Spotlight", 228 | "explanation": "Tribune Spotlight is reporting on the Unity March taking place in Verdant Oasis Plaza. This suggests that the event has attracted media attention, which could amplify its impact on the community. The role of Tribune Spotlight could be significant in shaping public perception of the event and the entities involved. [Data: Relationships (40)]" 229 | }} 230 | ] 231 | }} 232 | 233 | 234 | # Real Data 235 | 236 | Use the following text for your answer. Do not make anything up in your answer. 237 | 238 | Text: 239 | {input_text} 240 | 241 | The report should include the following sections: 242 | 243 | - TITLE: community's name that represents its key entities - title should be short but specific. When possible, include representative named entities in the title. 244 | - SUMMARY: An executive summary of the community's overall structure, how its entities are related to each other, and significant information associated with its entities. 245 | - IMPACT SEVERITY RATING: a float score between 0-10 that represents the severity of IMPACT posed by entities within the community. IMPACT is the scored importance of a community. 246 | - RATING EXPLANATION: Give a single sentence explanation of the IMPACT severity rating. 247 | - DETAILED FINDINGS: A list of 5-10 key insights about the community. Each insight should have a short summary followed by multiple paragraphs of explanatory text grounded according to the grounding rules below. Be comprehensive. 248 | 249 | Return output as a well-formed JSON-formatted string with the following format: 250 | {{ 251 | "title": , 252 | "summary": , 253 | "rating": , 254 | "rating_explanation": , 255 | "findings": [ 256 | {{ 257 | "summary":, 258 | "explanation": 259 | }}, 260 | {{ 261 | "summary":, 262 | "explanation": 263 | }} 264 | ] 265 | }} 266 | 267 | # Grounding Rules 268 | 269 | Points supported by data should list their data references as follows: 270 | 271 | "This is an example sentence supported by multiple data references [Data: (record ids); (record ids)]." 272 | 273 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 274 | 275 | For example: 276 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (1), Entities (5, 7); Relationships (23); Claims (7, 2, 34, 64, 46, +more)]." 277 | 278 | where 1, 5, 7, 23, 2, 34, 46, and 64 represent the id (not the index) of the relevant data record. 279 | 280 | Do not include information where the supporting evidence for it is not provided. 281 | 282 | Output:""" 283 | 284 | MAP_SYSTEM_PROMPT = """ 285 | ---Role--- 286 | 287 | You are a helpful assistant responding to questions about data in the tables provided. 288 | 289 | 290 | ---Goal--- 291 | 292 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. 293 | 294 | You should use the data provided in the data tables below as the primary context for generating the response. 295 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. 296 | 297 | Each key point in the response should have the following element: 298 | - Description: A comprehensive description of the point. 299 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. 300 | 301 | The response should be JSON formatted as follows: 302 | {{ 303 | "points": [ 304 | {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, 305 | {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} 306 | ] 307 | }} 308 | 309 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 310 | 311 | Points supported by data should list the relevant reports as references as follows: 312 | "This is an example sentence supported by data references [Data: Reports (report ids)]" 313 | 314 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 315 | 316 | For example: 317 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 318 | 319 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. 320 | 321 | Do not include information where the supporting evidence for it is not provided. 322 | 323 | 324 | ---Data tables--- 325 | 326 | {context_data} 327 | 328 | ---Goal--- 329 | 330 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. 331 | 332 | You should use the data provided in the data tables below as the primary context for generating the response. 333 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. 334 | 335 | Each key point in the response should have the following element: 336 | - Description: A comprehensive description of the point. 337 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. 338 | 339 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 340 | 341 | Points supported by data should list the relevant reports as references as follows: 342 | "This is an example sentence supported by data references [Data: Reports (report ids)]" 343 | 344 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 345 | 346 | For example: 347 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 348 | 349 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. 350 | 351 | Do not include information where the supporting evidence for it is not provided. 352 | 353 | The response should be JSON formatted as follows: 354 | {{ 355 | "points": [ 356 | {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, 357 | {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} 358 | ] 359 | }} 360 | """ 361 | 362 | REDUCE_SYSTEM_PROMPT = """ 363 | ---Role--- 364 | 365 | You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts. 366 | 367 | 368 | ---Goal--- 369 | 370 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. 371 | 372 | Note that the analysts' reports provided below are ranked in the **descending order of importance**. 373 | 374 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. 375 | 376 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. 377 | 378 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. 379 | 380 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 381 | 382 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. 383 | 384 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 385 | 386 | For example: 387 | 388 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 389 | 390 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 391 | 392 | Do not include information where the supporting evidence for it is not provided. 393 | 394 | 395 | ---Target response length and format--- 396 | 397 | {response_type} 398 | 399 | 400 | ---Analyst Reports--- 401 | 402 | {report_data} 403 | 404 | 405 | ---Goal--- 406 | 407 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. 408 | 409 | Note that the analysts' reports provided below are ranked in the **descending order of importance**. 410 | 411 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. 412 | 413 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. 414 | 415 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 416 | 417 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. 418 | 419 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 420 | 421 | For example: 422 | 423 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 424 | 425 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 426 | 427 | Do not include information where the supporting evidence for it is not provided. 428 | 429 | 430 | ---Target response length and format--- 431 | 432 | {response_type} 433 | 434 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. 435 | """ 436 | 437 | LOCAL_SEARCH_SYSTEM_PROMPT = """ 438 | ---Role--- 439 | 440 | You are a helpful assistant responding to questions about data in the tables provided. 441 | 442 | 443 | ---Goal--- 444 | 445 | Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. 446 | 447 | If you don't know the answer, just say so. Do not make anything up. 448 | 449 | Points supported by data should list their data references as follows: 450 | 451 | "This is an example sentence supported by multiple data references [Data: (record ids); (record ids)]." 452 | 453 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 454 | 455 | For example: 456 | 457 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]." 458 | 459 | where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 460 | 461 | Do not include information where the supporting evidence for it is not provided. 462 | 463 | 464 | ---Target response length and format--- 465 | 466 | {response_type} 467 | 468 | 469 | ---Data tables--- 470 | 471 | {context_data} 472 | 473 | 474 | ---Goal--- 475 | 476 | Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. 477 | 478 | If you don't know the answer, just say so. Do not make anything up. 479 | 480 | Points supported by data should list their data references as follows: 481 | 482 | "This is an example sentence supported by multiple data references [Data: (record ids); (record ids)]." 483 | 484 | Do not list more than 5 record ids in a single reference. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 485 | 486 | For example: 487 | 488 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Sources (15, 16), Reports (1), Entities (5, 7); Relationships (23); Claims (2, 7, 34, 46, 64, +more)]." 489 | 490 | where 15, 16, 1, 5, 7, 23, 2, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 491 | 492 | Do not include information where the supporting evidence for it is not provided. 493 | 494 | 495 | ---Target response length and format--- 496 | 497 | {response_type} 498 | 499 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. 500 | """ 501 | 502 | MAP_SYSTEM_PROMPT = """ 503 | ---Role--- 504 | 505 | You are a helpful assistant responding to questions about data in the tables provided. 506 | 507 | 508 | ---Goal--- 509 | 510 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. 511 | 512 | You should use the data provided in the data tables below as the primary context for generating the response. 513 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. 514 | 515 | Each key point in the response should have the following element: 516 | - Description: A comprehensive description of the point. 517 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. 518 | 519 | The response should be JSON formatted as follows: 520 | {{ 521 | "points": [ 522 | {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, 523 | {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} 524 | ] 525 | }} 526 | 527 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 528 | 529 | Points supported by data should list the relevant reports as references as follows: 530 | "This is an example sentence supported by data references [Data: Reports (report ids)]" 531 | 532 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 533 | 534 | For example: 535 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 536 | 537 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. 538 | 539 | Do not include information where the supporting evidence for it is not provided. 540 | 541 | 542 | ---Data tables--- 543 | 544 | {context_data} 545 | 546 | ---Goal--- 547 | 548 | Generate a response consisting of a list of key points that responds to the user's question, summarizing all relevant information in the input data tables. 549 | 550 | You should use the data provided in the data tables below as the primary context for generating the response. 551 | If you don't know the answer or if the input data tables do not contain sufficient information to provide an answer, just say so. Do not make anything up. 552 | 553 | Each key point in the response should have the following element: 554 | - Description: A comprehensive description of the point. 555 | - Importance Score: An integer score between 0-100 that indicates how important the point is in answering the user's question. An 'I don't know' type of response should have a score of 0. 556 | 557 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 558 | 559 | Points supported by data should list the relevant reports as references as follows: 560 | "This is an example sentence supported by data references [Data: Reports (report ids)]" 561 | 562 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 563 | 564 | For example: 565 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 64, 46, 34, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 566 | 567 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data report in the provided tables. 568 | 569 | Do not include information where the supporting evidence for it is not provided. 570 | 571 | The response should be JSON formatted as follows: 572 | {{ 573 | "points": [ 574 | {{"description": "Description of point 1 [Data: Reports (report ids)]", "score": score_value}}, 575 | {{"description": "Description of point 2 [Data: Reports (report ids)]", "score": score_value}} 576 | ] 577 | }} 578 | """ 579 | 580 | REDUCE_SYSTEM_PROMPT = """ 581 | ---Role--- 582 | 583 | You are a helpful assistant responding to questions about a dataset by synthesizing perspectives from multiple analysts. 584 | 585 | 586 | ---Goal--- 587 | 588 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. 589 | 590 | Note that the analysts' reports provided below are ranked in the **descending order of importance**. 591 | 592 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. 593 | 594 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. 595 | 596 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. 597 | 598 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 599 | 600 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. 601 | 602 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 603 | 604 | For example: 605 | 606 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 607 | 608 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 609 | 610 | Do not include information where the supporting evidence for it is not provided. 611 | 612 | 613 | ---Target response length and format--- 614 | 615 | {response_type} 616 | 617 | 618 | ---Analyst Reports--- 619 | 620 | {report_data} 621 | 622 | 623 | ---Goal--- 624 | 625 | Generate a response of the target length and format that responds to the user's question, summarize all the reports from multiple analysts who focused on different parts of the dataset. 626 | 627 | Note that the analysts' reports provided below are ranked in the **descending order of importance**. 628 | 629 | If you don't know the answer or if the provided reports do not contain sufficient information to provide an answer, just say so. Do not make anything up. 630 | 631 | The final response should remove all irrelevant information from the analysts' reports and merge the cleaned information into a comprehensive answer that provides explanations of all the key points and implications appropriate for the response length and format. 632 | 633 | The response shall preserve the original meaning and use of modal verbs such as "shall", "may" or "will". 634 | 635 | The response should also preserve all the data references previously included in the analysts' reports, but do not mention the roles of multiple analysts in the analysis process. 636 | 637 | **Do not list more than 5 record ids in a single reference**. Instead, list the top 5 most relevant record ids and add "+more" to indicate that there are more. 638 | 639 | For example: 640 | 641 | "Person X is the owner of Company Y and subject to many allegations of wrongdoing [Data: Reports (2, 7, 34, 46, 64, +more)]. He is also CEO of company X [Data: Reports (1, 3)]" 642 | 643 | where 1, 2, 3, 7, 34, 46, and 64 represent the id (not the index) of the relevant data record. 644 | 645 | Do not include information where the supporting evidence for it is not provided. 646 | 647 | 648 | ---Target response length and format--- 649 | 650 | {response_type} 651 | 652 | Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. 653 | """ 654 | -------------------------------------------------------------------------------- /src/ms_graphrag_neo4j/utils.py: -------------------------------------------------------------------------------- 1 | import hashlib 2 | import json 3 | 4 | 5 | def get_hash(text): 6 | # Convert text to bytes if it's a string 7 | if isinstance(text, str): 8 | text = text.encode("utf-8") 9 | 10 | # MD5 is one of the fastest algorithms 11 | return hashlib.md5(text).hexdigest() 12 | 13 | 14 | def parse_extraction_output(output_str, record_delimiter=None, tuple_delimiter=None): 15 | """ 16 | Parse a structured output string containing "entity" and "relationship" records into a list of dictionaries. 17 | 18 | The expected format for each record is: 19 | 20 | ("entity"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) 21 | 22 | or 23 | 24 | ("relationship"{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}{tuple_delimiter}) 25 | 26 | Records are separated by a record delimiter. The output string may end with a completion marker 27 | (for example, "{completion_delimiter}") which will be removed. 28 | 29 | If not provided, this function attempts to auto-detect: 30 | - record_delimiter: looks for "{record_delimiter}" then "|" then falls back to newlines. 31 | - tuple_delimiter: looks for "{tuple_delimiter}" then ";" then falls back to a tab. 32 | 33 | Parameters: 34 | output_str (str): The complete string output. 35 | record_delimiter (str, optional): The delimiter that separates records. 36 | tuple_delimiter (str, optional): The delimiter that separates fields within a record. 37 | 38 | Returns: 39 | List[dict]: A list of dictionaries where each dictionary represents an entity or relationship. 40 | 41 | For an "entity", the dictionary has the keys: 42 | - record_type (always "entity") 43 | - entity_name 44 | - entity_type 45 | - entity_description 46 | 47 | For a "relationship", the dictionary has the keys: 48 | - record_type (always "relationship") 49 | - source_entity 50 | - target_entity 51 | - relationship_description 52 | - relationship_strength (as an int or float) 53 | """ 54 | # Remove the completion delimiter if present. 55 | completion_marker = "{completion_delimiter}" 56 | if completion_marker in output_str: 57 | output_str = output_str.replace(completion_marker, "") 58 | output_str = output_str.strip() 59 | 60 | # Determine the record delimiter if not provided. 61 | if record_delimiter is None: 62 | if "{record_delimiter}" in output_str: 63 | record_delimiter = "{record_delimiter}" 64 | elif "|" in output_str: 65 | record_delimiter = "|" 66 | else: 67 | # Fallback: split on newlines 68 | record_delimiter = "\n" 69 | 70 | # Determine the tuple delimiter if not provided. 71 | if tuple_delimiter is None: 72 | if "{tuple_delimiter}" in output_str: 73 | tuple_delimiter = "{tuple_delimiter}" 74 | elif ";" in output_str: 75 | tuple_delimiter = ";" 76 | else: 77 | tuple_delimiter = "\t" 78 | 79 | # Split the output into individual record strings. 80 | raw_records = [r.strip() for r in output_str.split(record_delimiter)] 81 | 82 | parsed_records = [] 83 | for rec in raw_records: 84 | if not rec: 85 | continue # skip empty strings 86 | 87 | # Remove leading/trailing parentheses if present. 88 | if rec.startswith("(") and rec.endswith(")"): 89 | rec = rec[1:-1] 90 | rec = rec.strip() 91 | 92 | # Split the record into tokens using the tuple delimiter. 93 | tokens = [token.strip() for token in rec.split(tuple_delimiter)] 94 | if not tokens: 95 | continue 96 | 97 | # The first token should be either "entity" or "relationship". 98 | rec_type = tokens[0].strip(" \"'").lower() 99 | 100 | if rec_type == "entity": 101 | if len(tokens) != 4: 102 | # Optionally log or raise an error for malformed records. 103 | continue 104 | record = { 105 | "record_type": "entity", 106 | "entity_name": tokens[1], 107 | "entity_type": tokens[2], 108 | "entity_description": tokens[3], 109 | } 110 | parsed_records.append(record) 111 | elif rec_type == "relationship": 112 | if len(tokens) != 5: 113 | continue 114 | # Attempt to convert relationship_strength to a number. 115 | try: 116 | strength = float(tokens[4]) 117 | # Convert to int if it has no fractional part. 118 | if strength.is_integer(): 119 | strength = int(strength) 120 | except ValueError: 121 | strength = tokens[4] 122 | record = { 123 | "record_type": "relationship", 124 | "source_entity": tokens[1], 125 | "target_entity": tokens[2], 126 | "relationship_description": tokens[3], 127 | "relationship_strength": strength, 128 | } 129 | parsed_records.append(record) 130 | else: 131 | # Unknown record type; skip it or handle accordingly. 132 | continue 133 | nodes = [el for el in parsed_records if el.get("record_type") == "entity"] 134 | relationships = [ 135 | el for el in parsed_records if el.get("record_type") == "relationship" 136 | ] 137 | return nodes, relationships 138 | 139 | 140 | def extract_json(input: str): 141 | return json.loads(input.removeprefix("```json").removesuffix("```").strip()) 142 | -------------------------------------------------------------------------------- /tests/dev.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "e57d11d7-c3b8-4c3f-87d8-2a34397952d1", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "from ms_graphrag_neo4j import MsGraphRAG" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "874a900b-b1fa-4e90-91b5-fc50af8053c1", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import os\n", 21 | "\n", 22 | "os.environ[\"OPENAI_API_KEY\"]= \"sk-proj-\"\n", 23 | "os.environ[\"NEO4J_URI\"]=\"bolt://localhost:7687\"\n", 24 | "os.environ[\"NEO4J_USERNAME\"]=\"neo4j\"\n", 25 | "os.environ[\"NEO4J_PASSWORD\"]=\"password\"" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 3, 31 | "id": "5e39e816-fb7e-4aa2-9943-f50af5a516a7", 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "from neo4j import GraphDatabase" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 4, 41 | "id": "0c5e397b-5ebe-4671-b64b-ddb3cf5440dc", 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "driver = GraphDatabase.driver(os.environ[\"NEO4J_URI\"], auth=(os.environ[\"NEO4J_USERNAME\"], os.environ[\"NEO4J_PASSWORD\"]))" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 5, 51 | "id": "cad48c77-34c2-4879-b19a-3a8cfceef316", 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "ms_graph = MsGraphRAG(driver=driver, model='gpt-4o')" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 6, 61 | "id": "88583a09-0465-4017-a967-1a1feee1cb8f", 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "name": "stderr", 66 | "output_type": "stream", 67 | "text": [ 68 | "Extracting nodes & relationships: 100%|██████████| 3/3 [00:31<00:00, 10.42s/it]\n" 69 | ] 70 | }, 71 | { 72 | "data": { 73 | "text/plain": [ 74 | "'Successfuly extracted and imported 2 relationships'" 75 | ] 76 | }, 77 | "execution_count": 6, 78 | "metadata": {}, 79 | "output_type": "execute_result" 80 | } 81 | ], 82 | "source": [ 83 | "await ms_graph.extract_nodes_and_rels([\"Tomaz works for Neo4j\", \"Tomaz lives in Grosuplje\", \"Tomaz went to school in Grosuplje\"], [\"Person\", \"Organization\", \"Location\"])" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 7, 89 | "id": "7a6aeb25-70a8-4132-8556-e7b7e36de199", 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "name": "stderr", 94 | "output_type": "stream", 95 | "text": [ 96 | "Summarizing nodes: 100%|██████████| 2/2 [00:00<00:00, 2.47it/s]\n", 97 | "Summarizing relationships: 0it [00:00, ?it/s]\n" 98 | ] 99 | }, 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "'Successfuly summarized nodes and relationships'" 104 | ] 105 | }, 106 | "execution_count": 7, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "await ms_graph.summarize_nodes_and_rels()" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 8, 118 | "id": "4769d5b0-7d27-4b91-8fee-49de179103c7", 119 | "metadata": { 120 | "scrolled": true 121 | }, 122 | "outputs": [ 123 | { 124 | "name": "stdout", 125 | "output_type": "stream", 126 | "text": [ 127 | "Leiden algorithm identified 1 community levels with 1 communities on the last level.\n" 128 | ] 129 | }, 130 | { 131 | "name": "stderr", 132 | "output_type": "stream", 133 | "text": [ 134 | "Summarizing communities: 100%|██████████| 1/1 [00:05<00:00, 5.59s/it]\n" 135 | ] 136 | }, 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "'Generated 1 community summaries'" 141 | ] 142 | }, 143 | "execution_count": 8, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "await ms_graph.summarize_communities()" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "id": "80e35fc6-85ef-4c82-9b31-add6b2592756", 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3 (ipykernel)", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.11.5" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 5 182 | } 183 | --------------------------------------------------------------------------------