├── tests ├── unit │ ├── query │ │ ├── data │ │ │ ├── hidden │ │ │ │ └── output │ │ │ │ │ ├── .hidden │ │ │ │ │ ├── .another │ │ │ │ │ └── empty.txt │ │ │ │ │ ├── 20240812-120000 │ │ │ │ │ └── empty.txt │ │ │ │ │ └── 20240812-121000 │ │ │ │ │ └── empty.txt │ │ │ ├── empty │ │ │ │ └── something-else │ │ │ │ │ └── empty.txt │ │ │ ├── defaults │ │ │ │ └── output │ │ │ │ │ ├── 20240812-120000 │ │ │ │ │ └── empty.txt │ │ │ │ │ └── 20240812-121000 │ │ │ │ │ └── empty.txt │ │ │ └── non-numeric │ │ │ │ └── output │ │ │ │ ├── 20240812-120000 │ │ │ │ └── empty.txt │ │ │ │ ├── 20240812-121000 │ │ │ │ └── empty.txt │ │ │ │ └── something-else │ │ │ │ └── empty.txt │ │ ├── __init__.py │ │ ├── input │ │ │ ├── __init__.py │ │ │ └── retrieval │ │ │ │ └── __init__.py │ │ └── context_builder │ │ │ └── __init__.py │ ├── config │ │ ├── prompt-a.txt │ │ ├── prompt-b.txt │ │ ├── prompt-c.txt │ │ ├── prompt-d.txt │ │ ├── fixtures │ │ │ ├── timestamp_dirs │ │ │ │ └── 20240812-120000 │ │ │ │ │ └── empty.txt │ │ │ ├── minimal_config │ │ │ │ └── settings.yaml │ │ │ └── minimal_config_missing_env_var │ │ │ │ └── settings.yaml │ │ └── __init__.py │ ├── indexing │ │ ├── input │ │ │ ├── data │ │ │ │ ├── multiple-txts │ │ │ │ │ ├── input2.txt │ │ │ │ │ └── input1.txt │ │ │ │ ├── one-txt │ │ │ │ │ └── input.txt │ │ │ │ ├── multiple-csvs │ │ │ │ │ ├── input3.csv │ │ │ │ │ ├── input2.csv │ │ │ │ │ └── input1.csv │ │ │ │ ├── multiple-jsons │ │ │ │ │ ├── input2.json │ │ │ │ │ └── input1.json │ │ │ │ ├── one-csv │ │ │ │ │ └── input.csv │ │ │ │ ├── one-json-one-object │ │ │ │ │ └── input.json │ │ │ │ └── one-json-multiple-objects │ │ │ │ │ └── input.json │ │ │ └── __init__.py │ │ ├── __init__.py │ │ ├── cache │ │ │ └── __init__.py │ │ ├── graph │ │ │ ├── __init__.py │ │ │ ├── utils │ │ │ │ └── __init__.py │ │ │ └── extractors │ │ │ │ ├── __init__.py │ │ │ │ └── community_reports │ │ │ │ └── __init__.py │ │ ├── verbs │ │ │ ├── __init__.py │ │ │ ├── entities │ │ │ │ ├── __init__.py │ │ │ │ └── extraction │ │ │ │ │ ├── __init__.py │ │ │ │ │ └── strategies │ │ │ │ │ ├── __init__.py │ │ │ │ │ └── graph_intelligence │ │ │ │ │ └── __init__.py │ │ │ └── helpers │ │ │ │ ├── __init__.py │ │ │ │ └── mock_llm.py │ │ ├── operations │ │ │ ├── __init__.py │ │ │ └── chunk_text │ │ │ │ └── __init__.py │ │ ├── text_splitting │ │ │ └── __init__.py │ │ └── test_init_content.py │ ├── __init__.py │ ├── utils │ │ ├── __init__.py │ │ ├── test_encoding.py │ │ └── test_embeddings.py │ └── litellm_services │ │ ├── __init__.py │ │ └── utils.py ├── smoke │ └── __init__.py ├── verbs │ ├── __init__.py │ ├── data │ │ ├── documents.parquet │ │ ├── entities.parquet │ │ ├── communities.parquet │ │ ├── covariates.parquet │ │ ├── text_units.parquet │ │ ├── relationships.parquet │ │ ├── community_reports.parquet │ │ ├── text_units_metadata.parquet │ │ └── text_units_metadata_included_chunk.parquet │ ├── test_prune_graph.py │ ├── test_extract_graph_nlp.py │ ├── test_create_final_text_units.py │ └── test_create_communities.py ├── notebook │ ├── __init__.py │ └── test_notebooks.py ├── integration │ ├── __init__.py │ ├── cache │ │ └── __init__.py │ ├── storage │ │ └── __init__.py │ ├── language_model │ │ └── __init__.py │ ├── logging │ │ └── __init__.py │ └── vector_stores │ │ └── __init__.py ├── fixtures │ ├── azure │ │ ├── input │ │ │ └── ABOUT.md │ │ ├── config.json │ │ └── settings.yml │ ├── text │ │ ├── input │ │ │ └── ABOUT.md │ │ └── settings.yml │ └── min-csv │ │ ├── input │ │ └── ABOUT.md │ │ └── settings.yml ├── conftest.py └── __init__.py ├── .github ├── ISSUE_TEMPLATE │ └── config.yml ├── workflows │ ├── semver.yml │ ├── spellcheck.yml │ ├── issues-autoresolve.yml │ ├── gh-pages.yml │ └── python-publish.yml ├── dependabot.yml └── pull_request_template.md ├── scripts ├── spellcheck.sh ├── start-azurite.sh └── semver-check.sh ├── docs ├── img │ ├── GraphRag-Figure1.jpg │ ├── pipeline-running.png │ ├── auto-tune-diagram.png │ ├── drift-search-diagram.png │ └── viz_guide │ │ ├── gephi-layout-pane.png │ │ ├── gephi-appearance-pane.png │ │ ├── gephi-initial-graph-example.png │ │ ├── gephi-layout-forceatlas2-pane.png │ │ └── gephi-network-overview-settings.png ├── data │ └── operation_dulce │ │ ├── dataset.zip │ │ └── ABOUT.md ├── examples_notebooks │ └── inputs │ │ └── operation dulce │ │ ├── documents.parquet │ │ ├── entities.parquet │ │ ├── communities.parquet │ │ ├── covariates.parquet │ │ ├── text_units.parquet │ │ ├── relationships.parquet │ │ ├── community_reports.parquet │ │ ├── ABOUT.md │ │ └── lancedb │ │ ├── default-text_unit-text.lance │ │ ├── _versions │ │ │ ├── 1.manifest │ │ │ ├── 2.manifest │ │ │ ├── 3.manifest │ │ │ └── 4.manifest │ │ ├── data │ │ │ ├── 2794bf5b-de3d-4202-ab16-e76bc27c8e6a.lance │ │ │ └── 2f74c8e8-3f35-4209-889c-a13cf0780eb3.lance │ │ └── _transactions │ │ │ ├── 0-fd0434ac-e5cd-4ddd-9dd5-e5048d4edb59.txn │ │ │ ├── 1-14bb4b1d-cc00-420b-9b14-3626f0bd8c0b.txn │ │ │ ├── 2-8e74264c-f72d-44f5-a6f4-b3b61ae6a43b.txn │ │ │ └── 3-7516fb71-9db3-4666-bdef-ea04c1eb9697.txn │ │ ├── default-entity-description.lance │ │ ├── _versions │ │ │ ├── 1.manifest │ │ │ ├── 2.manifest │ │ │ ├── 3.manifest │ │ │ └── 4.manifest │ │ ├── data │ │ │ ├── a34575c4-5260-457f-bebe-3f40bc0e2ee3.lance │ │ │ └── eabd7580-86f5-4022-8aa7-fe0aff816d98.lance │ │ └── _transactions │ │ │ ├── 0-92c031e5-7558-451e-9d0f-f5514db9616d.txn │ │ │ ├── 1-7b3cb8d8-3512-4584-a003-91838fed8911.txn │ │ │ ├── 2-7de627d2-4c57-49e9-bf73-c17a9582ead4.txn │ │ │ └── 3-9ad29d69-9a69-43a8-8b26-252ea267958d.txn │ │ └── default-community-full_content.lance │ │ ├── _versions │ │ ├── 1.manifest │ │ ├── 2.manifest │ │ ├── 3.manifest │ │ └── 4.manifest │ │ ├── data │ │ ├── 1e7b2d94-ed06-4aa0-b22e-86a71d416bc6.lance │ │ └── 1ed9f301-ce30-46a8-8c0b-9c2a60a3cf43.lance │ │ └── _transactions │ │ ├── 0-2fed1d8b-daac-41b0-a93a-e115cda75be3.txn │ │ ├── 1-61dbb7c2-aec3-4796-b223-941fc7cc93cc.txn │ │ ├── 2-60012692-a153-48f9-8f4e-c479b44cbf3f.txn │ │ └── 3-0d2dc9a1-094f-4220-83c7-6ad6f26fac2b.txn ├── cli.md ├── config │ └── overview.md ├── scripts │ └── create_cookie_banner.js ├── query │ └── notebooks │ │ └── overview.md ├── stylesheets │ └── extra.css └── prompt_tuning │ └── overview.md ├── unified-search-app ├── images │ ├── image-1.png │ ├── image-2.png │ ├── image-3.png │ └── image-4.png ├── app │ ├── __init__.py │ ├── rag │ │ ├── __init__.py │ │ └── typing.py │ ├── ui │ │ ├── __init__.py │ │ ├── questions_list.py │ │ └── report_list.py │ ├── state │ │ ├── __init__.py │ │ └── query_variable.py │ ├── knowledge_loader │ │ ├── __init__.py │ │ └── data_sources │ │ │ ├── __init__.py │ │ │ └── default.py │ └── data_config.py ├── Dockerfile ├── .vsts-ci.yml └── pyproject.toml ├── graphrag ├── py.typed ├── __init__.py ├── cli │ └── __init__.py ├── factory │ └── __init__.py ├── config │ ├── __init__.py │ ├── models │ │ ├── __init__.py │ │ ├── umap_config.py │ │ ├── cluster_graph_config.py │ │ ├── snapshots_config.py │ │ ├── basic_search_config.py │ │ ├── reporting_config.py │ │ └── cache_config.py │ ├── read_dotenv.py │ ├── create_graphrag_config.py │ └── get_embedding_settings.py ├── tokenizer │ ├── __init__.py │ ├── tiktoken_tokenizer.py │ └── litellm_tokenizer.py ├── data_model │ ├── __init__.py │ ├── types.py │ ├── named.py │ └── identified.py ├── index │ ├── run │ │ └── __init__.py │ ├── utils │ │ ├── __init__.py │ │ ├── uuid.py │ │ ├── is_null.py │ │ ├── hashing.py │ │ ├── string.py │ │ └── dicts.py │ ├── __init__.py │ ├── typing │ │ ├── __init__.py │ │ ├── state.py │ │ ├── error_handler.py │ │ ├── stats.py │ │ ├── pipeline_run_result.py │ │ ├── pipeline.py │ │ ├── workflow.py │ │ └── context.py │ ├── input │ │ ├── __init__.py │ │ └── text.py │ ├── operations │ │ ├── __init__.py │ │ ├── chunk_text │ │ │ ├── __init__.py │ │ │ ├── typing.py │ │ │ └── bootstrap.py │ │ ├── embed_text │ │ │ ├── __init__.py │ │ │ └── strategies │ │ │ │ ├── __init__.py │ │ │ │ ├── typing.py │ │ │ │ └── mock.py │ │ ├── summarize_communities │ │ │ ├── __init__.py │ │ │ ├── graph_context │ │ │ │ └── __init__.py │ │ │ ├── text_unit_context │ │ │ │ └── __init__.py │ │ │ ├── utils.py │ │ │ └── explode_communities.py │ │ ├── build_noun_graph │ │ │ ├── np_extractors │ │ │ │ ├── __init__.py │ │ │ │ ├── stop_words.py │ │ │ │ ├── np_validator.py │ │ │ │ └── resource_loader.py │ │ │ └── __init__.py │ │ ├── embed_graph │ │ │ ├── __init__.py │ │ │ ├── typing.py │ │ │ └── embed_node2vec.py │ │ ├── layout_graph │ │ │ ├── __init__.py │ │ │ └── typing.py │ │ ├── extract_graph │ │ │ ├── __init__.py │ │ │ └── typing.py │ │ ├── summarize_descriptions │ │ │ ├── __init__.py │ │ │ └── typing.py │ │ ├── extract_covariates │ │ │ ├── __init__.py │ │ │ └── typing.py │ │ ├── compute_degree.py │ │ ├── snapshot_graphml.py │ │ ├── create_graph.py │ │ ├── finalize_community_reports.py │ │ ├── graph_to_dataframes.py │ │ └── compute_edge_combined_degree.py │ ├── update │ │ └── __init__.py │ ├── text_splitting │ │ ├── __init__.py │ │ └── check_token_limit.py │ └── workflows │ │ ├── update_clean_state.py │ │ └── update_final_documents.py ├── query │ ├── __init__.py │ ├── input │ │ ├── __init__.py │ │ ├── loaders │ │ │ └── __init__.py │ │ └── retrieval │ │ │ └── __init__.py │ ├── llm │ │ └── __init__.py │ ├── question_gen │ │ └── __init__.py │ ├── structured_search │ │ ├── __init__.py │ │ ├── drift_search │ │ │ └── __init__.py │ │ ├── basic_search │ │ │ └── __init__.py │ │ ├── global_search │ │ │ └── __init__.py │ │ └── local_search │ │ │ └── __init__.py │ └── context_builder │ │ ├── __init__.py │ │ └── rate_prompt.py ├── storage │ └── __init__.py ├── logger │ └── __init__.py ├── prompt_tune │ ├── __init__.py │ ├── generator │ │ ├── __init__.py │ │ ├── language.py │ │ ├── domain.py │ │ ├── persona.py │ │ ├── community_reporter_role.py │ │ ├── community_report_rating.py │ │ └── entity_summarization_prompt.py │ ├── loader │ │ └── __init__.py │ ├── prompt │ │ ├── __init__.py │ │ ├── domain.py │ │ ├── language.py │ │ ├── persona.py │ │ └── community_reporter_role.py │ ├── template │ │ ├── __init__.py │ │ └── entity_summarization.py │ ├── types.py │ └── defaults.py ├── prompts │ ├── __init__.py │ ├── query │ │ ├── __init__.py │ │ ├── global_search_knowledge_system_prompt.py │ │ └── question_gen_system_prompt.py │ └── index │ │ ├── __init__.py │ │ └── summarize_descriptions.py ├── utils │ └── __init__.py ├── cache │ └── __init__.py ├── language_model │ ├── events │ │ ├── __init__.py │ │ └── base.py │ ├── providers │ │ ├── __init__.py │ │ ├── fnllm │ │ │ ├── __init__.py │ │ │ ├── events.py │ │ │ └── cache.py │ │ └── litellm │ │ │ ├── services │ │ │ ├── __init__.py │ │ │ ├── retry │ │ │ │ ├── __init__.py │ │ │ │ ├── retry_factory.py │ │ │ │ └── retry.py │ │ │ └── rate_limiter │ │ │ │ ├── __init__.py │ │ │ │ ├── rate_limiter_factory.py │ │ │ │ └── rate_limiter.py │ │ │ ├── request_wrappers │ │ │ └── __init__.py │ │ │ └── __init__.py │ ├── protocol │ │ └── __init__.py │ ├── cache │ │ ├── __init__.py │ │ └── base.py │ ├── response │ │ └── __init__.py │ └── __init__.py ├── callbacks │ ├── __init__.py │ ├── llm_callbacks.py │ ├── noop_workflow_callbacks.py │ ├── query_callbacks.py │ ├── noop_query_callbacks.py │ └── workflow_callbacks.py ├── vector_stores │ └── __init__.py ├── __main__.py └── api │ └── __init__.py ├── .semversioner ├── 0.1.0.json ├── 1.1.2.json ├── 0.3.6.json ├── 1.1.1.json ├── 0.3.4.json ├── 2.5.0.json ├── 2.2.1.json ├── 2.7.0.json ├── 0.2.2.json ├── 1.0.1.json ├── 1.2.0.json ├── 1.0.0.json ├── 2.1.0.json ├── 2.4.0.json ├── 0.3.0.json ├── 0.3.1.json ├── 0.4.1.json ├── 2.3.0.json ├── 0.5.0.json ├── 0.3.2.json ├── 0.9.0.json ├── 2.2.0.json ├── 0.3.5.json └── 1.1.0.json ├── examples_notebooks ├── inputs │ └── operation dulce │ │ └── lancedb │ │ └── entity_description_embeddings.lance │ │ ├── _latest.manifest │ │ ├── _versions │ │ ├── 1.manifest │ │ └── 2.manifest │ │ ├── data │ │ └── fe64774f-5412-4c9c-8dea-f6ed55c81119.lance │ │ └── _transactions │ │ ├── 0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn │ │ └── 1-bf5aa024-a229-461f-8d78-699841a302fe.txn └── community_contrib │ └── README.md ├── .gitattributes ├── CODEOWNERS ├── .vscode ├── extensions.json └── settings.json ├── CODE_OF_CONDUCT.md ├── cspell.config.yaml ├── .gitignore ├── SUPPORT.md ├── .vsts-ci.yml └── LICENSE /tests/unit/query/data/hidden/output/.hidden: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/config/prompt-a.txt: -------------------------------------------------------------------------------- 1 | Hello, World! A -------------------------------------------------------------------------------- /tests/unit/config/prompt-b.txt: -------------------------------------------------------------------------------- 1 | Hello, World! B -------------------------------------------------------------------------------- /tests/unit/config/prompt-c.txt: -------------------------------------------------------------------------------- 1 | Hello, World! C -------------------------------------------------------------------------------- /tests/unit/config/prompt-d.txt: -------------------------------------------------------------------------------- 1 | Hello, World! D -------------------------------------------------------------------------------- /tests/unit/query/data/empty/something-else/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/hidden/output/.another/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/defaults/output/20240812-120000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/defaults/output/20240812-121000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/hidden/output/20240812-120000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/hidden/output/20240812-121000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/config.yml: -------------------------------------------------------------------------------- 1 | blank_issues_enabled: true 2 | -------------------------------------------------------------------------------- /tests/unit/config/fixtures/timestamp_dirs/20240812-120000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/non-numeric/output/20240812-120000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/non-numeric/output/20240812-121000/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/query/data/non-numeric/output/something-else/empty.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-txts/input2.txt: -------------------------------------------------------------------------------- 1 | I'm outta here -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/one-txt/input.txt: -------------------------------------------------------------------------------- 1 | Hi how are you today? -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-txts/input1.txt: -------------------------------------------------------------------------------- 1 | Hi how are you today? -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-csvs/input3.csv: -------------------------------------------------------------------------------- 1 | title,text 2 | Hi,I'm here -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-csvs/input2.csv: -------------------------------------------------------------------------------- 1 | title,text 2 | Adios,See you later -------------------------------------------------------------------------------- /scripts/spellcheck.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | npx --yes cspell -c cspell.config.yaml --no-progress lint . -------------------------------------------------------------------------------- /scripts/start-azurite.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | npx --yes azurite -L -l ./temp_azurite -d ./temp_azurite/debug.log -------------------------------------------------------------------------------- /tests/smoke/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /docs/img/GraphRag-Figure1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/GraphRag-Figure1.jpg -------------------------------------------------------------------------------- /docs/img/pipeline-running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/pipeline-running.png -------------------------------------------------------------------------------- /tests/notebook/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/query/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /docs/img/auto-tune-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/auto-tune-diagram.png -------------------------------------------------------------------------------- /tests/integration/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/config/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-jsons/input2.json: -------------------------------------------------------------------------------- 1 | { 2 | "title": "Hi", 3 | "text": "I'm here" 4 | } -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/one-csv/input.csv: -------------------------------------------------------------------------------- 1 | title,text 2 | Hello,Hi how are you today? 3 | Goodbye,I'm outta here -------------------------------------------------------------------------------- /docs/img/drift-search-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/drift-search-diagram.png -------------------------------------------------------------------------------- /tests/integration/cache/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/integration/storage/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/cache/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/graph/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/query/input/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/documents.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/documents.parquet -------------------------------------------------------------------------------- /tests/verbs/data/entities.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/entities.parquet -------------------------------------------------------------------------------- /docs/data/operation_dulce/dataset.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/data/operation_dulce/dataset.zip -------------------------------------------------------------------------------- /tests/unit/indexing/graph/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-csvs/input1.csv: -------------------------------------------------------------------------------- 1 | title,text 2 | Hello,Hi how are you today? 3 | Goodbye,I'm outta here -------------------------------------------------------------------------------- /tests/unit/indexing/operations/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/litellm_services/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/communities.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/communities.parquet -------------------------------------------------------------------------------- /tests/verbs/data/covariates.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/covariates.parquet -------------------------------------------------------------------------------- /tests/verbs/data/text_units.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/text_units.parquet -------------------------------------------------------------------------------- /unified-search-app/images/image-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/unified-search-app/images/image-1.png -------------------------------------------------------------------------------- /unified-search-app/images/image-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/unified-search-app/images/image-2.png -------------------------------------------------------------------------------- /unified-search-app/images/image-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/unified-search-app/images/image-3.png -------------------------------------------------------------------------------- /unified-search-app/images/image-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/unified-search-app/images/image-4.png -------------------------------------------------------------------------------- /graphrag/py.typed: -------------------------------------------------------------------------------- 1 | # This package supports type hinting, 2 | # see https://www.python.org/dev/peps/pep-0561/#packaging-type-information -------------------------------------------------------------------------------- /tests/integration/language_model/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/graph/extractors/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/text_splitting/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/entities/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/helpers/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/query/context_builder/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/query/input/retrieval/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/relationships.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/relationships.parquet -------------------------------------------------------------------------------- /docs/img/viz_guide/gephi-layout-pane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/viz_guide/gephi-layout-pane.png -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/one-json-one-object/input.json: -------------------------------------------------------------------------------- 1 | { 2 | "title": "Hello", 3 | "text": "Hi how are you today?" 4 | } -------------------------------------------------------------------------------- /tests/unit/indexing/operations/chunk_text/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/community_reports.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/community_reports.parquet -------------------------------------------------------------------------------- /docs/img/viz_guide/gephi-appearance-pane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/viz_guide/gephi-appearance-pane.png -------------------------------------------------------------------------------- /graphrag/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The GraphRAG package.""" 5 | -------------------------------------------------------------------------------- /graphrag/cli/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """CLI for GraphRAG.""" 5 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/entities/extraction/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/text_units_metadata.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/text_units_metadata.parquet -------------------------------------------------------------------------------- /graphrag/factory/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Factory module.""" 5 | -------------------------------------------------------------------------------- /unified-search-app/app/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """App module.""" 5 | -------------------------------------------------------------------------------- /graphrag/config/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The config package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/tokenizer/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG tokenizer.""" 5 | -------------------------------------------------------------------------------- /tests/unit/indexing/graph/extractors/community_reports/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/entities/extraction/strategies/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /unified-search-app/app/rag/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Rag module.""" 5 | -------------------------------------------------------------------------------- /unified-search-app/app/ui/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """App UI module.""" 5 | -------------------------------------------------------------------------------- /docs/img/viz_guide/gephi-initial-graph-example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/viz_guide/gephi-initial-graph-example.png -------------------------------------------------------------------------------- /docs/img/viz_guide/gephi-layout-forceatlas2-pane.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/viz_guide/gephi-layout-forceatlas2-pane.png -------------------------------------------------------------------------------- /graphrag/data_model/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Knowledge model package.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/run/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Run module for GraphRAG.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Utils methods definition.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The query engine package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/storage/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The storage package root.""" 5 | -------------------------------------------------------------------------------- /unified-search-app/app/state/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """App state module.""" 5 | -------------------------------------------------------------------------------- /docs/img/viz_guide/gephi-network-overview-settings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/img/viz_guide/gephi-network-overview-settings.png -------------------------------------------------------------------------------- /graphrag/index/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The indexing engine package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/typing/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Root typings for GraphRAG.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/input/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG Orchestration Inputs.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/llm/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Orchestration LLM utilities.""" 5 | -------------------------------------------------------------------------------- /tests/integration/logging/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Tests for logger module.""" 5 | -------------------------------------------------------------------------------- /graphrag/logger/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Logger utilities and implementations.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The prompt-tuning package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Prompt generation module.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompts/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """All prompts for the GraphRAG system.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompts/query/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """All prompts for the query engine.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/question_gen/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Question Generation Module.""" 5 | -------------------------------------------------------------------------------- /graphrag/utils/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Util functions for the GraphRAG package.""" 5 | -------------------------------------------------------------------------------- /graphrag/cache/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A package containing cache implementations.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/input/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine input package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Reusable data frame operations.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/events/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Model Event handler modules.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Model Providers module.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompts/index/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """All prompts for the indexing engine.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/structured_search/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Structured Search package.""" 5 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/entities/extraction/strategies/graph_intelligence/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | -------------------------------------------------------------------------------- /tests/verbs/data/text_units_metadata_included_chunk.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/tests/verbs/data/text_units_metadata_included_chunk.parquet -------------------------------------------------------------------------------- /graphrag/callbacks/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A package containing callback implementations.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/update/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Incremental Indexing main module definition.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/fnllm/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """FNLLM provider module.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/input/loaders/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG Orchestartion Input Loaders.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/structured_search/drift_search/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """DriftSearch module.""" 5 | -------------------------------------------------------------------------------- /unified-search-app/app/knowledge_loader/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Knowledge loader module.""" 5 | -------------------------------------------------------------------------------- /docs/data/operation_dulce/ABOUT.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of integration testing. -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/documents.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/documents.parquet -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/entities.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/entities.parquet -------------------------------------------------------------------------------- /graphrag/config/models/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Interfaces for Default Config parameterization.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/protocol/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Base protocol definitions for LLMs.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Services.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/loader/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning config and data loader module.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/input/retrieval/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG Orchestration Input Retrieval.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/structured_search/basic_search/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The BasicSearch package.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/structured_search/global_search/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GlobalSearch module.""" 5 | -------------------------------------------------------------------------------- /graphrag/query/structured_search/local_search/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The LocalSearch package.""" 5 | -------------------------------------------------------------------------------- /tests/fixtures/azure/input/ABOUT.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | This document (Operation Dulce) in an AI-generated science fiction novella, included here for the purposes of integration testing. -------------------------------------------------------------------------------- /tests/fixtures/text/input/ABOUT.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | This document (Operation Dulce) in an AI-generated science fiction novella, included here for the purposes of integration testing. -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/communities.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/communities.parquet -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/covariates.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/covariates.parquet -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/text_units.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/text_units.parquet -------------------------------------------------------------------------------- /graphrag/vector_stores/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A package containing vector store implementations.""" 5 | -------------------------------------------------------------------------------- /tests/fixtures/min-csv/input/ABOUT.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | This document (Operation Dulce) in an AI-generated science fiction novella, included here for the purposes of integration testing. -------------------------------------------------------------------------------- /unified-search-app/app/knowledge_loader/data_sources/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Data sources module.""" 5 | -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/relationships.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/relationships.parquet -------------------------------------------------------------------------------- /graphrag/index/operations/chunk_text/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine text chunk package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_text/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine text embed package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_communities/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Community summarization modules.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/text_splitting/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine Text Splitting package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/cache/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Cache provider definitions for Language Models.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/retry/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Retry Services.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/response/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing Model response definitions.""" 5 | -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/community_reports.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/community_reports.parquet -------------------------------------------------------------------------------- /graphrag/index/operations/build_noun_graph/np_extractors/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """NLP-based graph extractors.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_graph/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine graph embed package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/layout_graph/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine graph layout package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/rate_limiter/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Rate Limiter.""" 5 | -------------------------------------------------------------------------------- /tests/integration/vector_stores/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Integration tests for vector store implementations.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/build_noun_graph/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine noun graph package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/extract_graph/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine entities extraction package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_descriptions/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Root package for description summarization.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_text/strategies/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine embed strategies package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/operations/extract_covariates/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The Indexing Engine text extract claims package root.""" 5 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/prompt/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Persona, entity type, relationships and domain generation prompts module.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/request_wrappers/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM completion/embedding function wrappers.""" 5 | -------------------------------------------------------------------------------- /graphrag/__main__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """The GraphRAG package.""" 5 | 6 | from graphrag.cli.main import app 7 | 8 | app(prog_name="graphrag") 9 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_communities/graph_context/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Package of context builders for graph-based reports.""" 5 | -------------------------------------------------------------------------------- /graphrag/index/typing/state.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Pipeline state types.""" 5 | 6 | from typing import Any 7 | 8 | PipelineState = dict[Any, Any] 9 | -------------------------------------------------------------------------------- /graphrag/query/context_builder/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Functions to build context for system prompt to generate responses for a user query.""" 5 | -------------------------------------------------------------------------------- /.semversioner/0.1.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Initial Release", 5 | "type": "minor" 6 | } 7 | ], 8 | "created_at": "2024-07-01T21:48:50+00:00", 9 | "version": "0.1.0" 10 | } -------------------------------------------------------------------------------- /.semversioner/1.1.2.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Basic Rag minor fix", 5 | "type": "patch" 6 | } 7 | ], 8 | "created_at": "2025-01-09T22:29:23+00:00", 9 | "version": "1.2.0" 10 | } -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_communities/text_unit_context/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Package of context builders for text unit-based reports.""" 5 | -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/ABOUT.md: -------------------------------------------------------------------------------- 1 | # About 2 | 3 | This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of providing a starting point for notebook experimentation. 4 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/template/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for entity extraction, entity summarization, and community report summarization.""" 5 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG LiteLLM module. Provides LiteLLM-based implementations of chat and embedding models.""" 5 | -------------------------------------------------------------------------------- /docs/cli.md: -------------------------------------------------------------------------------- 1 | # CLI Reference 2 | 3 | This page documents the command-line interface of the graphrag library. 4 | 5 | ::: mkdocs-typer 6 | :module: graphrag.cli.main 7 | :prog_name: graphrag 8 | :command: app 9 | :depth: 0 10 | -------------------------------------------------------------------------------- /graphrag/language_model/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """GraphRAG Language Models module. Allows for provider registrations while providing some out-of-the-box solutions.""" 5 | -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_latest.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_latest.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/1.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/1.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/2.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/2.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/3.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/3.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/4.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_versions/4.manifest -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/1.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/1.manifest -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/2.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/2.manifest -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | 5 | def pytest_addoption(parser): 6 | parser.addoption( 7 | "--run_slow", action="store_true", default=False, help="run slow tests" 8 | ) 9 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | *.txt text eol=lf 2 | *.md text eol=lf 3 | *.yml text eol=lf 4 | *.html text eol=lf 5 | *.py text eol=lf 6 | *.toml text eol=lf 7 | .gitattributes text eol=lf 8 | .gitignore text eol=lf 9 | *.lock 10 | CODEOWNERS text eol=lf 11 | LICENSE text eol=lf -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/1.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/1.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/2.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/2.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/3.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/3.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/4.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_versions/4.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/1.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/1.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/2.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/2.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/3.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/3.manifest -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/4.manifest: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_versions/4.manifest -------------------------------------------------------------------------------- /graphrag/data_model/types.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Common types for the GraphRAG knowledge model.""" 5 | 6 | from collections.abc import Callable 7 | 8 | TextEmbedder = Callable[[str], list[float]] 9 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/multiple-jsons/input1.json: -------------------------------------------------------------------------------- 1 | [{ 2 | "title": "Hello", 3 | "text": "Hi how are you today?" 4 | }, { 5 | "title": "Goodbye", 6 | "text": "I'm outta here" 7 | }, { 8 | "title": "Adios", 9 | "text": "See you later" 10 | }] 11 | -------------------------------------------------------------------------------- /tests/unit/indexing/input/data/one-json-multiple-objects/input.json: -------------------------------------------------------------------------------- 1 | [{ 2 | "title": "Hello", 3 | "text": "Hi how are you today?" 4 | }, { 5 | "title": "Goodbye", 6 | "text": "I'm outta here" 7 | }, { 8 | "title": "Adios", 9 | "text": "See you later" 10 | }] 11 | -------------------------------------------------------------------------------- /graphrag/index/typing/error_handler.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Shared error handler types.""" 5 | 6 | from collections.abc import Callable 7 | 8 | ErrorHandlerFn = Callable[[BaseException | None, str | None, dict | None], None] 9 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | # These owners will be the default owners for everything in 2 | # the repo. Unless a later match takes precedence, 3 | # @global-owner1 and @global-owner2 will be requested for 4 | # review when someone opens a pull request. 5 | * @microsoft/societal-resilience @microsoft/graphrag-core-team 6 | -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/data/2794bf5b-de3d-4202-ab16-e76bc27c8e6a.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/data/2794bf5b-de3d-4202-ab16-e76bc27c8e6a.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/data/2f74c8e8-3f35-4209-889c-a13cf0780eb3.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/data/2f74c8e8-3f35-4209-889c-a13cf0780eb3.lance -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/data/fe64774f-5412-4c9c-8dea-f6ed55c81119.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/data/fe64774f-5412-4c9c-8dea-f6ed55c81119.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/data/a34575c4-5260-457f-bebe-3f40bc0e2ee3.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/data/a34575c4-5260-457f-bebe-3f40bc0e2ee3.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/data/eabd7580-86f5-4022-8aa7-fe0aff816d98.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/data/eabd7580-86f5-4022-8aa7-fe0aff816d98.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/data/1e7b2d94-ed06-4aa0-b22e-86a71d416bc6.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/data/1e7b2d94-ed06-4aa0-b22e-86a71d416bc6.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/data/1ed9f301-ce30-46a8-8c0b-9c2a60a3cf43.lance: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/data/1ed9f301-ce30-46a8-8c0b-9c2a60a3cf43.lance -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/0-fd0434ac-e5cd-4ddd-9dd5-e5048d4edb59.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/0-fd0434ac-e5cd-4ddd-9dd5-e5048d4edb59.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/1-14bb4b1d-cc00-420b-9b14-3626f0bd8c0b.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/1-14bb4b1d-cc00-420b-9b14-3626f0bd8c0b.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/2-8e74264c-f72d-44f5-a6f4-b3b61ae6a43b.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/2-8e74264c-f72d-44f5-a6f4-b3b61ae6a43b.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/3-7516fb71-9db3-4666-bdef-ea04c1eb9697.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-text_unit-text.lance/_transactions/3-7516fb71-9db3-4666-bdef-ea04c1eb9697.txn -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn -------------------------------------------------------------------------------- /examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/1-bf5aa024-a229-461f-8d78-699841a302fe.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/1-bf5aa024-a229-461f-8d78-699841a302fe.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/0-92c031e5-7558-451e-9d0f-f5514db9616d.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/0-92c031e5-7558-451e-9d0f-f5514db9616d.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/1-7b3cb8d8-3512-4584-a003-91838fed8911.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/1-7b3cb8d8-3512-4584-a003-91838fed8911.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/2-7de627d2-4c57-49e9-bf73-c17a9582ead4.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/2-7de627d2-4c57-49e9-bf73-c17a9582ead4.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/3-9ad29d69-9a69-43a8-8b26-252ea267958d.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-entity-description.lance/_transactions/3-9ad29d69-9a69-43a8-8b26-252ea267958d.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/0-2fed1d8b-daac-41b0-a93a-e115cda75be3.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/0-2fed1d8b-daac-41b0-a93a-e115cda75be3.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/1-61dbb7c2-aec3-4796-b223-941fc7cc93cc.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/1-61dbb7c2-aec3-4796-b223-941fc7cc93cc.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/2-60012692-a153-48f9-8f4e-c479b44cbf3f.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/2-60012692-a153-48f9-8f4e-c479b44cbf3f.txn -------------------------------------------------------------------------------- /docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/3-0d2dc9a1-094f-4220-83c7-6ad6f26fac2b.txn: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/graphrag/HEAD/docs/examples_notebooks/inputs/operation dulce/lancedb/default-community-full_content.lance/_transactions/3-0d2dc9a1-094f-4220-83c7-6ad6f26fac2b.txn -------------------------------------------------------------------------------- /.semversioner/0.3.6.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Collapse create_final_relationships.", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Dependency update and cleanup", 9 | "type": "patch" 10 | } 11 | ], 12 | "created_at": "2024-09-20T00:09:13+00:00", 13 | "version": "0.3.6" 14 | } -------------------------------------------------------------------------------- /.vscode/extensions.json: -------------------------------------------------------------------------------- 1 | { 2 | "recommendations": [ 3 | "arcanis.vscode-zipfs", 4 | "ms-python.python", 5 | "charliermarsh.ruff", 6 | "ms-python.vscode-pylance", 7 | "bierner.markdown-mermaid", 8 | "streetsidesoftware.code-spell-checker", 9 | "ronnidc.nunjucks", 10 | "lucien-martijn.parquet-visualizer", 11 | ] 12 | } 13 | -------------------------------------------------------------------------------- /tests/unit/config/fixtures/minimal_config/settings.yaml: -------------------------------------------------------------------------------- 1 | models: 2 | default_chat_model: 3 | api_key: ${CUSTOM_API_KEY} 4 | type: chat 5 | model_provider: openai 6 | model: gpt-4-turbo-preview 7 | default_embedding_model: 8 | api_key: ${CUSTOM_API_KEY} 9 | type: embedding 10 | model_provider: openai 11 | model: text-embedding-3-small -------------------------------------------------------------------------------- /tests/fixtures/azure/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "input_path": "./tests/fixtures/azure", 3 | "input_file_type": "text", 4 | "workflow_config": { 5 | "skip_assert": true, 6 | "azure": { 7 | "input_container": "azurefixture", 8 | "input_base_dir": "input" 9 | } 10 | }, 11 | "query_config": [], 12 | "slow": false 13 | } -------------------------------------------------------------------------------- /.semversioner/1.1.1.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Fix a bug on creating community hierarchy for dynamic search", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Increase LOCAL_SEARCH_COMMUNITY_PROP to 15%", 9 | "type": "patch" 10 | } 11 | ], 12 | "created_at": "2025-01-08T21:53:16+00:00", 13 | "version": "1.1.1" 14 | } -------------------------------------------------------------------------------- /.semversioner/0.3.4.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Deep copy txt units on local search to avoid race conditions", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Fix summarization including empty descriptions", 9 | "type": "patch" 10 | } 11 | ], 12 | "created_at": "2024-09-11T22:31:58+00:00", 13 | "version": "0.3.4" 14 | } -------------------------------------------------------------------------------- /graphrag/callbacks/llm_callbacks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LLM Callbacks.""" 5 | 6 | from typing import Protocol 7 | 8 | 9 | class BaseLLMCallback(Protocol): 10 | """Base class for LLM callbacks.""" 11 | 12 | def on_llm_new_token(self, token: str): 13 | """Handle when a new token is generated.""" 14 | ... 15 | -------------------------------------------------------------------------------- /tests/unit/config/fixtures/minimal_config_missing_env_var/settings.yaml: -------------------------------------------------------------------------------- 1 | models: 2 | default_chat_model: 3 | api_key: ${SOME_NON_EXISTENT_ENV_VAR} 4 | type: chat 5 | model_provider: openai 6 | model: gpt-4-turbo-preview 7 | default_embedding_model: 8 | api_key: ${SOME_NON_EXISTENT_ENV_VAR} 9 | type: embedding 10 | model_provider: openai 11 | model: text-embedding-3-small -------------------------------------------------------------------------------- /.semversioner/2.5.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add additional context variable to build index signature for custom parameter bag", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "swap package management from Poetry -> UV", 9 | "type": "minor" 10 | } 11 | ], 12 | "created_at": "2025-08-14T00:59:46+00:00", 13 | "version": "2.5.0" 14 | } -------------------------------------------------------------------------------- /graphrag/index/operations/embed_graph/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing different lists and dictionaries.""" 5 | 6 | # Use this for now instead of a wrapper 7 | from typing import Any 8 | 9 | NodeList = list[str] 10 | EmbeddingList = list[Any] 11 | NodeEmbeddings = dict[str, list[float]] 12 | """Label -> Embedding""" 13 | -------------------------------------------------------------------------------- /scripts/semver-check.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | changes=$(git diff --name-only origin/main) 3 | has_change_doc=$(echo $changes | grep .semversioner/next-release) 4 | has_impacting_changes=$(echo $changes | grep graphrag) 5 | 6 | if [ "$has_impacting_changes" ] && [ -z "$has_change_doc" ]; then 7 | echo "Check failed. Run 'uv run semversioner add-change' to update the next release version" 8 | exit 1 9 | fi 10 | echo "OK" 11 | -------------------------------------------------------------------------------- /graphrag/index/utils/uuid.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """UUID utilities.""" 5 | 6 | import uuid 7 | from random import Random, getrandbits 8 | 9 | 10 | def gen_uuid(rd: Random | None = None): 11 | """Generate a random UUID v4.""" 12 | return uuid.UUID( 13 | int=rd.getrandbits(128) if rd is not None else getrandbits(128), version=4 14 | ).hex 15 | -------------------------------------------------------------------------------- /graphrag/data_model/named.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A package containing the 'Named' protocol.""" 5 | 6 | from dataclasses import dataclass 7 | 8 | from graphrag.data_model.identified import Identified 9 | 10 | 11 | @dataclass 12 | class Named(Identified): 13 | """A protocol for an item with a name/title.""" 14 | 15 | title: str 16 | """The name/title of the item.""" 17 | -------------------------------------------------------------------------------- /.semversioner/2.2.1.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Fix Community Report prompt tuning response", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Fix graph creation missing edge weights.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Update as workflows", 13 | "type": "patch" 14 | } 15 | ], 16 | "created_at": "2025-04-30T23:50:31+00:00", 17 | "version": "2.2.1" 18 | } -------------------------------------------------------------------------------- /.semversioner/2.7.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Set LiteLLM as default in init_content.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Fix Azure auth scope issue with LiteLLM.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Housekeeping toward 2.7.", 13 | "type": "patch" 14 | } 15 | ], 16 | "created_at": "2025-10-08T22:39:42+00:00", 17 | "version": "2.7.0" 18 | } -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Microsoft Open Source Code of Conduct 2 | 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 4 | 5 | Resources: 6 | 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns 10 | -------------------------------------------------------------------------------- /graphrag/prompts/query/global_search_knowledge_system_prompt.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Global Search system prompts.""" 5 | 6 | GENERAL_KNOWLEDGE_INSTRUCTION = """ 7 | The response may also include relevant real-world knowledge outside the dataset, but it must be explicitly annotated with a verification tag [LLM: verify]. For example: 8 | "This is an example sentence supported by real-world knowledge [LLM: verify]." 9 | """ 10 | -------------------------------------------------------------------------------- /graphrag/index/operations/build_noun_graph/np_extractors/stop_words.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Custom list of stop words to be excluded by noun phrase extractors.""" 5 | 6 | EN_STOP_WORDS = [ 7 | "stuff", 8 | "thing", 9 | "things", 10 | "bunch", 11 | "bit", 12 | "bits", 13 | "people", 14 | "person", 15 | "okay", 16 | "hey", 17 | "hi", 18 | "hello", 19 | "laughter", 20 | "oh", 21 | ] 22 | -------------------------------------------------------------------------------- /.github/workflows/semver.yml: -------------------------------------------------------------------------------- 1 | name: Semver Check 2 | on: 3 | pull_request: 4 | types: 5 | - opened 6 | - reopened 7 | - synchronize 8 | - ready_for_review 9 | branches: [main] 10 | 11 | jobs: 12 | semver: 13 | # skip draft PRs 14 | if: github.event.pull_request.draft == false 15 | runs-on: ubuntu-latest 16 | steps: 17 | - uses: actions/checkout@v4 18 | with: 19 | fetch-depth: 0 20 | - name: Check Semver 21 | run: ./scripts/semver-check.sh -------------------------------------------------------------------------------- /graphrag/prompt_tune/types.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Types for prompt tuning.""" 5 | 6 | from enum import Enum 7 | 8 | 9 | class DocSelectionType(str, Enum): 10 | """The type of document selection to use.""" 11 | 12 | ALL = "all" 13 | RANDOM = "random" 14 | TOP = "top" 15 | AUTO = "auto" 16 | 17 | def __str__(self): 18 | """Return the string representation of the enum value.""" 19 | return self.value 20 | -------------------------------------------------------------------------------- /graphrag/index/utils/is_null.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Defines the is_null utility.""" 5 | 6 | import math 7 | from typing import Any 8 | 9 | 10 | def is_null(value: Any) -> bool: 11 | """Check if value is null or is nan.""" 12 | 13 | def is_none() -> bool: 14 | return value is None 15 | 16 | def is_nan() -> bool: 17 | return isinstance(value, float) and math.isnan(value) 18 | 19 | return is_none() or is_nan() 20 | -------------------------------------------------------------------------------- /.github/workflows/spellcheck.yml: -------------------------------------------------------------------------------- 1 | name: Spellcheck 2 | on: 3 | push: 4 | branches: [main] 5 | pull_request: 6 | types: 7 | - opened 8 | - reopened 9 | - synchronize 10 | - ready_for_review 11 | paths: 12 | - "**/*" 13 | jobs: 14 | spellcheck: 15 | # skip draft PRs 16 | if: github.event.pull_request.draft == false 17 | runs-on: ubuntu-latest 18 | steps: 19 | - uses: actions/checkout@v4 20 | 21 | - name: Spellcheck 22 | run: ./scripts/spellcheck.sh 23 | -------------------------------------------------------------------------------- /graphrag/index/utils/hashing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Hashing utilities.""" 5 | 6 | from collections.abc import Iterable 7 | from hashlib import sha512 8 | from typing import Any 9 | 10 | 11 | def gen_sha512_hash(item: dict[str, Any], hashcode: Iterable[str]): 12 | """Generate a SHA512 hash.""" 13 | hashed = "".join([str(item[column]) for column in hashcode]) 14 | return f"{sha512(hashed.encode('utf-8'), usedforsecurity=False).hexdigest()}" 15 | -------------------------------------------------------------------------------- /graphrag/index/operations/compute_degree.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing create_graph definition.""" 5 | 6 | import networkx as nx 7 | import pandas as pd 8 | 9 | 10 | def compute_degree(graph: nx.Graph) -> pd.DataFrame: 11 | """Create a new DataFrame with the degree of each node in the graph.""" 12 | return pd.DataFrame([ 13 | {"title": node, "degree": int(degree)} 14 | for node, degree in graph.degree # type: ignore 15 | ]) 16 | -------------------------------------------------------------------------------- /graphrag/index/text_splitting/check_token_limit.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Token limit method definition.""" 5 | 6 | from graphrag.index.text_splitting.text_splitting import TokenTextSplitter 7 | 8 | 9 | def check_token_limit(text, max_token): 10 | """Check token limit.""" 11 | text_splitter = TokenTextSplitter(chunk_size=max_token, chunk_overlap=0) 12 | docs = text_splitter.split_text(text) 13 | if len(docs) > 1: 14 | return 0 15 | return 1 16 | -------------------------------------------------------------------------------- /examples_notebooks/community_contrib/README.md: -------------------------------------------------------------------------------- 1 | ## Disclaimer 2 | 3 | This folder contains community contributed notebooks that are not officially supported by the GraphRAG team. The notebooks are provided as-is and are not guaranteed to work with the latest version of GraphRAG. If you have any questions or issues, please reach out to the author of the notebook directly. 4 | 5 | For more information on how to contribute to the GraphRAG project, please refer to the [contribution guidelines](https://github.com/microsoft/graphrag/blob/main/CONTRIBUTING.md) 6 | -------------------------------------------------------------------------------- /graphrag/data_model/identified.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A package containing the 'Identified' protocol.""" 5 | 6 | from dataclasses import dataclass 7 | 8 | 9 | @dataclass 10 | class Identified: 11 | """A protocol for an item with an ID.""" 12 | 13 | id: str 14 | """The ID of the item.""" 15 | 16 | short_id: str | None 17 | """Human readable ID used to refer to this community in prompts or texts displayed to users, such as in a report text (optional).""" 18 | -------------------------------------------------------------------------------- /graphrag/config/models/umap_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | 10 | 11 | class UmapConfig(BaseModel): 12 | """Configuration section for UMAP.""" 13 | 14 | enabled: bool = Field( 15 | description="A flag indicating whether to enable UMAP.", 16 | default=graphrag_config_defaults.umap.enabled, 17 | ) 18 | -------------------------------------------------------------------------------- /graphrag/language_model/events/base.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Base model events protocol.""" 5 | 6 | from typing import Any, Protocol 7 | 8 | 9 | class ModelEventHandler(Protocol): 10 | """Protocol for Model event handling.""" 11 | 12 | async def on_error( 13 | self, 14 | error: BaseException | None, 15 | traceback: str | None = None, 16 | arguments: dict[str, Any] | None = None, 17 | ) -> None: 18 | """Handle an model error.""" 19 | ... 20 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/prompt/domain.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for domain generation.""" 5 | 6 | GENERATE_DOMAIN_PROMPT = """ 7 | You are an intelligent assistant that helps a human to analyze the information in a text document. 8 | Given a sample text, help the user by assigning a descriptive domain that summarizes what the text is about. 9 | Example domains are: "Social studies", "Algorithmic analysis", "Medical science", among others. 10 | 11 | Text: {input_text} 12 | Domain:""" 13 | -------------------------------------------------------------------------------- /tests/unit/indexing/verbs/helpers/mock_llm.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | from pydantic import BaseModel 4 | 5 | from graphrag.language_model.manager import ModelManager 6 | from graphrag.language_model.protocol.base import ChatModel 7 | 8 | 9 | def create_mock_llm(responses: list[str | BaseModel], name: str = "mock") -> ChatModel: 10 | """Creates a mock LLM that returns the given responses.""" 11 | return ModelManager().get_or_create_chat_model( 12 | name, "mock_chat", responses=responses 13 | ) 14 | -------------------------------------------------------------------------------- /.semversioner/0.2.2.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add a check if there is no community record added in local search context", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Add sepparate workflow for Python Tests", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Docs updates", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Run smoke tests on 4o", 17 | "type": "patch" 18 | } 19 | ], 20 | "created_at": "2024-08-08T22:40:57+00:00", 21 | "version": "0.2.2" 22 | } -------------------------------------------------------------------------------- /graphrag/prompt_tune/prompt/language.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for language detection.""" 5 | 6 | DETECT_LANGUAGE_PROMPT = """ 7 | You are an intelligent assistant that helps a human to analyze the information in a text document. 8 | Given a sample text, help the user by determining what's the primary language of the provided texts. 9 | Examples are: "English", "Spanish", "Japanese", "Portuguese" among others. Reply ONLY with the language name. 10 | 11 | Text: {input_text} 12 | Language:""" 13 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | 5 | """Tests for the GraphRAG LLM module.""" 6 | 7 | # Register MOCK providers 8 | from graphrag.config.enums import ModelType 9 | from graphrag.language_model.factory import ModelFactory 10 | from tests.mock_provider import MockChatLLM, MockEmbeddingLLM 11 | 12 | ModelFactory.register_chat(ModelType.MockChat, lambda **kwargs: MockChatLLM(**kwargs)) 13 | ModelFactory.register_embedding( 14 | ModelType.MockEmbedding, lambda **kwargs: MockEmbeddingLLM(**kwargs) 15 | ) 16 | -------------------------------------------------------------------------------- /tests/unit/utils/test_encoding.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | from graphrag.tokenizer.get_tokenizer import get_tokenizer 5 | 6 | 7 | def test_encode_basic(): 8 | tokenizer = get_tokenizer() 9 | result = tokenizer.encode("abc def") 10 | 11 | assert result == [13997, 711], "Encoding failed to return expected tokens" 12 | 13 | 14 | def test_num_tokens_empty_input(): 15 | tokenizer = get_tokenizer() 16 | result = len(tokenizer.encode("")) 17 | 18 | assert result == 0, "Token count for empty input should be 0" 19 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_communities/utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing community report generation utilities.""" 5 | 6 | import pandas as pd 7 | 8 | import graphrag.data_model.schemas as schemas 9 | 10 | 11 | def get_levels( 12 | df: pd.DataFrame, level_column: str = schemas.COMMUNITY_LEVEL 13 | ) -> list[int]: 14 | """Get the levels of the communities.""" 15 | levels = df[level_column].dropna().unique() 16 | levels = [int(lvl) for lvl in levels if lvl != -1] 17 | return sorted(levels, reverse=True) 18 | -------------------------------------------------------------------------------- /.semversioner/1.0.1.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Fix encoding model config parsing", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Fix exception on error callbacks", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Manage llm instances inside a cached singleton. Check for empty dfs after entity/relationship extraction", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Respect encoding_model option", 17 | "type": "patch" 18 | } 19 | ], 20 | "created_at": "2024-12-18T23:12:52+00:00", 21 | "version": "1.0.1" 22 | } -------------------------------------------------------------------------------- /.semversioner/1.2.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add Drift Reduce response and streaming endpoint", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "add cosmosdb vector store", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "Fix example notebooks", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Set default rate limits.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "unit tests for text_splitting", 21 | "type": "patch" 22 | } 23 | ], 24 | "created_at": "2025-01-15T20:32:00+00:00", 25 | "version": "1.2.0" 26 | } -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "editor.formatOnSave": false, 3 | "explorer.fileNesting.enabled": true, 4 | "debug.internalConsoleOptions": "neverOpen", 5 | "python.defaultInterpreterPath": "${workspaceRoot}/.venv/bin/python", 6 | "python.languageServer": "Pylance", 7 | "cSpell.customDictionaries": { 8 | "project-words": { 9 | "name": "project-words", 10 | "path": "${workspaceRoot}/dictionary.txt", 11 | "description": "Words used in this project", 12 | "addWords": true 13 | }, 14 | "custom": true, // Enable the `custom` dictionary 15 | "internal-terms": true // Disable the `internal-terms` dictionary 16 | } 17 | } 18 | -------------------------------------------------------------------------------- /graphrag/index/operations/snapshot_graphml.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing snapshot_graphml method definition.""" 5 | 6 | import networkx as nx 7 | 8 | from graphrag.storage.pipeline_storage import PipelineStorage 9 | 10 | 11 | async def snapshot_graphml( 12 | input: str | nx.Graph, 13 | name: str, 14 | storage: PipelineStorage, 15 | ) -> None: 16 | """Take a entire snapshot of a graph to standard graphml format.""" 17 | graphml = input if isinstance(input, str) else "\n".join(nx.generate_graphml(input)) 18 | await storage.set(name + ".graphml", graphml) 19 | -------------------------------------------------------------------------------- /unified-search-app/Dockerfile: -------------------------------------------------------------------------------- 1 | 2 | # Copyright (c) Microsoft Corporation. All rights reserved. 3 | # Dockerfile 4 | # https://eng.ms/docs/more/containers-secure-supply-chain/approved-images 5 | FROM mcr.microsoft.com/oryx/python:3.11 6 | 7 | RUN curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg 8 | RUN apt-get update -y 9 | 10 | # Install dependencies 11 | WORKDIR ./ 12 | COPY . . 13 | RUN curl -LsSf https://astral.sh/uv/install.sh | sh 14 | ENV PATH="${PATH}:/root/.local/bin" 15 | RUN uv sync --no-install-project 16 | 17 | # Run application 18 | EXPOSE 8501 19 | ENTRYPOINT ["uv","run","poe","start_prod"] -------------------------------------------------------------------------------- /docs/config/overview.md: -------------------------------------------------------------------------------- 1 | # Configuring GraphRAG Indexing 2 | 3 | The GraphRAG system is highly configurable. This page provides an overview of the configuration options available for the GraphRAG indexing engine. 4 | 5 | ## Default Configuration Mode 6 | 7 | The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The main ways to set up GraphRAG in Default Configuration mode are via: 8 | 9 | - [Init command](init.md) (recommended first step) 10 | - [Edit settings.yaml for deeper control](yaml.md) 11 | - [Purely using environment variables](env_vars.md) (not recommended) 12 | -------------------------------------------------------------------------------- /docs/scripts/create_cookie_banner.js: -------------------------------------------------------------------------------- 1 | function onConsentChanged(categoryPreferences) { 2 | console.log("onConsentChanged", categoryPreferences); 3 | } 4 | 5 | 6 | cb = document.createElement("div"); 7 | cb.id = "cookie-banner"; 8 | document.body.insertBefore(cb, document.body.children[0]); 9 | 10 | window.WcpConsent && WcpConsent.init("en-US", "cookie-banner", function (err, consent) { 11 | if (!err) { 12 | console.log("consent: ", consent); 13 | window.manageConsent = () => consent.manageConsent(); 14 | siteConsent = consent; 15 | } else { 16 | console.log("Error initializing WcpConsent: "+ err); 17 | } 18 | }, onConsentChanged, WcpConsent.themes.light); -------------------------------------------------------------------------------- /.semversioner/1.0.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add Parent id to communities data model", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Add migration notebook.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Create separate community workflow, collapse subflows.", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Dependency Updates", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "cleanup and refactor factory classes.", 21 | "type": "patch" 22 | } 23 | ], 24 | "created_at": "2024-12-11T21:41:49+00:00", 25 | "version": "1.0.0" 26 | } -------------------------------------------------------------------------------- /docs/query/notebooks/overview.md: -------------------------------------------------------------------------------- 1 | # API Notebooks 2 | 3 | - [API Overview Notebook](../../examples_notebooks/api_overview.ipynb) 4 | - [Bring-Your-Own Vector Store](../../examples_notebooks/custom_vector_store.ipynb) 5 | 6 | # Query Engine Notebooks 7 | 8 | For examples about running Query please refer to the following notebooks: 9 | 10 | - [Global Search Notebook](../../examples_notebooks/global_search.ipynb) 11 | - [Local Search Notebook](../../examples_notebooks/local_search.ipynb) 12 | - [DRIFT Search Notebook](../../examples_notebooks/drift_search.ipynb) 13 | 14 | The test dataset for these notebooks can be found in [dataset.zip](../../data/operation_dulce/dataset.zip){:download}. 15 | -------------------------------------------------------------------------------- /unified-search-app/app/knowledge_loader/data_sources/default.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Data sources default module.""" 5 | 6 | import os 7 | 8 | container_name = "data" 9 | blob_container_name = os.getenv("BLOB_CONTAINER_NAME", container_name) 10 | blob_account_name = os.getenv("BLOB_ACCOUNT_NAME") 11 | 12 | local_data_root = os.getenv("DATA_ROOT") 13 | 14 | LISTING_FILE = "listing.json" 15 | 16 | if local_data_root is None and blob_account_name is None: 17 | error_message = ( 18 | "Either DATA_ROOT or BLOB_ACCOUNT_NAME environment variable must be set." 19 | ) 20 | raise ValueError(error_message) 21 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/prompt/persona.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for persona generation.""" 5 | 6 | GENERATE_PERSONA_PROMPT = """ 7 | You are an intelligent assistant that helps a human to analyze the information in a text document. 8 | Given a specific type of task and sample text, help the user by generating a 3 to 4 sentence description of an expert who could help solve the problem. 9 | Use a format similar to the following: 10 | You are an expert {{role}}. You are skilled at {{relevant skills}}. You are adept at helping people with {{specific task}}. 11 | 12 | task: {sample_task} 13 | persona description:""" 14 | -------------------------------------------------------------------------------- /.semversioner/2.1.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add support for JSON input files.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Updated the prompt tunning client to support csv-metadata injection and updated output file types to match the new naming convention.", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "Add check for custom model types while config loading", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Adds general-purpose pipeline run state object.", 17 | "type": "patch" 18 | } 19 | ], 20 | "created_at": "2025-03-11T23:53:00+00:00", 21 | "version": "2.1.0" 22 | } -------------------------------------------------------------------------------- /graphrag/index/utils/string.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """String utilities.""" 5 | 6 | import html 7 | import re 8 | from typing import Any 9 | 10 | 11 | def clean_str(input: Any) -> str: 12 | """Clean an input string by removing HTML escapes, control characters, and other unwanted characters.""" 13 | # If we get non-string input, just give it back 14 | if not isinstance(input, str): 15 | return input 16 | 17 | result = html.unescape(input.strip()) 18 | # https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python 19 | return re.sub(r"[\x00-\x1f\x7f-\x9f]", "", result) 20 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/defaults.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Default values for the prompt-tuning module. 5 | 6 | Note: These values get accessed from the CLI to set default behavior. 7 | To maintain fast responsiveness from the CLI, do not add long-running code in this file and be mindful of imports. 8 | """ 9 | 10 | DEFAULT_TASK = """ 11 | Identify the relations and structure of the community of interest, specifically within the {domain} domain. 12 | """ 13 | 14 | K = 15 15 | LIMIT = 15 16 | MAX_TOKEN_COUNT = 2000 17 | MIN_CHUNK_SIZE = 200 18 | N_SUBSET_MAX = 300 19 | MIN_CHUNK_OVERLAP = 0 20 | PROMPT_TUNING_MODEL_ID = "default_chat_model" 21 | -------------------------------------------------------------------------------- /unified-search-app/app/rag/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Typing module.""" 5 | 6 | from dataclasses import dataclass 7 | from enum import Enum 8 | 9 | import pandas as pd 10 | 11 | 12 | class SearchType(Enum): 13 | """SearchType class definition.""" 14 | 15 | Basic = "basic" 16 | Local = "local" 17 | Global = "global" 18 | Drift = "drift" 19 | 20 | 21 | @dataclass 22 | class SearchResult: 23 | """SearchResult class definition.""" 24 | 25 | # create a dataclass to store the search result of each algorithm 26 | search_type: SearchType 27 | response: str 28 | context: dict[str, pd.DataFrame] 29 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/retry/retry_factory.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Retry Factory.""" 5 | 6 | from graphrag.config.defaults import DEFAULT_RETRY_SERVICES 7 | from graphrag.factory.factory import Factory 8 | from graphrag.language_model.providers.litellm.services.retry.retry import Retry 9 | 10 | 11 | class RetryFactory(Factory[Retry]): 12 | """Singleton factory for creating retry services.""" 13 | 14 | 15 | retry_factory = RetryFactory() 16 | 17 | for service_name, service_cls in DEFAULT_RETRY_SERVICES.items(): 18 | retry_factory.register(strategy=service_name, service_initializer=service_cls) 19 | -------------------------------------------------------------------------------- /.semversioner/2.4.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Allow injection of custom pipelines.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Refactored StorageFactory to use a registration-based approach", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "Fix default values for tpm and rpm limiters on embeddings", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Update typer.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "cleaned up logging to follow python standards.", 21 | "type": "patch" 22 | } 23 | ], 24 | "created_at": "2025-07-15T00:04:15+00:00", 25 | "version": "2.4.0" 26 | } -------------------------------------------------------------------------------- /tests/unit/utils/test_embeddings.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | import pytest 5 | 6 | from graphrag.config.embeddings import create_index_name 7 | 8 | 9 | def test_create_index_name(): 10 | collection = create_index_name("default", "entity.title") 11 | assert collection == "default-entity-title" 12 | 13 | 14 | def test_create_index_name_invalid_embedding_throws(): 15 | with pytest.raises(KeyError): 16 | create_index_name("default", "invalid.name") 17 | 18 | 19 | def test_create_index_name_invalid_embedding_does_not_throw(): 20 | collection = create_index_name("default", "invalid.name", validate=False) 21 | assert collection == "default-invalid-name" 22 | -------------------------------------------------------------------------------- /graphrag/index/operations/layout_graph/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | # Use this for now instead of a wrapper 5 | """A module containing 'NodePosition' model.""" 6 | 7 | from dataclasses import dataclass 8 | 9 | 10 | @dataclass 11 | class NodePosition: 12 | """Node position class definition.""" 13 | 14 | label: str 15 | cluster: str 16 | size: float 17 | 18 | x: float 19 | y: float 20 | z: float | None = None 21 | 22 | def to_pandas(self) -> tuple[str, float, float, str, float]: 23 | """To pandas method definition.""" 24 | return self.label, self.x, self.y, self.cluster, self.size 25 | 26 | 27 | GraphLayout = list[NodePosition] 28 | -------------------------------------------------------------------------------- /graphrag/index/operations/create_graph.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing create_graph definition.""" 5 | 6 | import networkx as nx 7 | import pandas as pd 8 | 9 | 10 | def create_graph( 11 | edges: pd.DataFrame, 12 | edge_attr: list[str | int] | None = None, 13 | nodes: pd.DataFrame | None = None, 14 | node_id: str = "title", 15 | ) -> nx.Graph: 16 | """Create a networkx graph from nodes and edges dataframes.""" 17 | graph = nx.from_pandas_edgelist(edges, edge_attr=edge_attr) 18 | 19 | if nodes is not None: 20 | nodes.set_index(node_id, inplace=True) 21 | graph.add_nodes_from((n, dict(d)) for n, d in nodes.iterrows()) 22 | 23 | return graph 24 | -------------------------------------------------------------------------------- /cspell.config.yaml: -------------------------------------------------------------------------------- 1 | $schema: https://raw.githubusercontent.com/streetsidesoftware/cspell/main/cspell.schema.json 2 | version: "0.2" 3 | allowCompoundWords: true 4 | dictionaryDefinitions: 5 | - name: dictionary 6 | path: "./dictionary.txt" 7 | addWords: true 8 | dictionaries: 9 | - dictionary 10 | ignorePaths: 11 | - cspell.config.yaml 12 | - node_modules 13 | - _site 14 | - /project-words.txt 15 | - default_pipeline.yml 16 | - .turbo 17 | - output/ 18 | - dist/ 19 | - temp_azurite/ 20 | - __pycache__ 21 | - pyproject.toml 22 | - entity_extraction.txt 23 | - package.json 24 | - tests/fixtures/ 25 | - examples_notebooks/inputs/ 26 | - docs/examples_notebooks/inputs/ 27 | - "*.csv" 28 | - "*.parquet" 29 | - "*.faiss" 30 | - "*.ipynb" 31 | - "*.log" 32 | -------------------------------------------------------------------------------- /.semversioner/0.3.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Implement auto templating API.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Implement query engine API.", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "Fix file dumps using json for non ASCII chars", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Stabilize smoke tests for query context building", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "fix query embedding", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "fix sort_context & max_tokens params in verb", 25 | "type": "patch" 26 | } 27 | ], 28 | "created_at": "2024-08-12T23:51:49+00:00", 29 | "version": "0.3.0" 30 | } -------------------------------------------------------------------------------- /graphrag/index/typing/stats.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Pipeline stats types.""" 5 | 6 | from dataclasses import dataclass, field 7 | 8 | 9 | @dataclass 10 | class PipelineRunStats: 11 | """Pipeline running stats.""" 12 | 13 | total_runtime: float = field(default=0) 14 | """Float representing the total runtime.""" 15 | 16 | num_documents: int = field(default=0) 17 | """Number of documents.""" 18 | update_documents: int = field(default=0) 19 | """Number of update documents.""" 20 | 21 | input_load_time: float = field(default=0) 22 | """Float representing the input load time.""" 23 | 24 | workflows: dict[str, dict[str, float]] = field(default_factory=dict) 25 | """A dictionary of workflows.""" 26 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_text/strategies/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing 'TextEmbeddingResult' model.""" 5 | 6 | from collections.abc import Awaitable, Callable 7 | from dataclasses import dataclass 8 | 9 | from graphrag.cache.pipeline_cache import PipelineCache 10 | from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks 11 | 12 | 13 | @dataclass 14 | class TextEmbeddingResult: 15 | """Text embedding result class definition.""" 16 | 17 | embeddings: list[list[float] | None] | None 18 | 19 | 20 | TextEmbeddingStrategy = Callable[ 21 | [ 22 | list[str], 23 | WorkflowCallbacks, 24 | PipelineCache, 25 | dict, 26 | ], 27 | Awaitable[TextEmbeddingResult], 28 | ] 29 | -------------------------------------------------------------------------------- /graphrag/index/utils/dicts.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A utility module containing methods for inspecting and verifying dictionary types.""" 5 | 6 | 7 | def dict_has_keys_with_types( 8 | data: dict, expected_fields: list[tuple[str, type]], inplace: bool = False 9 | ) -> bool: 10 | """Return True if the given dictionary has the given keys with the given types.""" 11 | for field, field_type in expected_fields: 12 | if field not in data: 13 | return False 14 | 15 | value = data[field] 16 | try: 17 | cast_value = field_type(value) 18 | if inplace: 19 | data[field] = cast_value 20 | except (TypeError, ValueError): 21 | return False 22 | return True 23 | -------------------------------------------------------------------------------- /graphrag/config/read_dotenv.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing the read_dotenv utility.""" 5 | 6 | import logging 7 | import os 8 | from pathlib import Path 9 | 10 | from dotenv import dotenv_values 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | 15 | def read_dotenv(root: str) -> None: 16 | """Read a .env file in the given root path.""" 17 | env_path = Path(root) / ".env" 18 | if env_path.exists(): 19 | logger.info("Loading pipeline .env file") 20 | env_config = dotenv_values(f"{env_path}") 21 | for key, value in env_config.items(): 22 | if key not in os.environ: 23 | os.environ[key] = value or "" 24 | else: 25 | logger.info("No .env file found at %s", root) 26 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_communities/explode_communities.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Explode a list of communities into nodes for filtering.""" 5 | 6 | import pandas as pd 7 | 8 | from graphrag.data_model.schemas import ( 9 | COMMUNITY_ID, 10 | ) 11 | 12 | 13 | def explode_communities( 14 | communities: pd.DataFrame, entities: pd.DataFrame 15 | ) -> pd.DataFrame: 16 | """Explode a list of communities into nodes for filtering.""" 17 | community_join = communities.explode("entity_ids").loc[ 18 | :, ["community", "level", "entity_ids"] 19 | ] 20 | nodes = entities.merge( 21 | community_join, left_on="id", right_on="entity_ids", how="left" 22 | ) 23 | return nodes.loc[nodes.loc[:, COMMUNITY_ID] != -1] 24 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/fnllm/events.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """FNLLM llm events provider.""" 5 | 6 | from typing import Any 7 | 8 | from fnllm.events import LLMEvents 9 | 10 | from graphrag.index.typing.error_handler import ErrorHandlerFn 11 | 12 | 13 | class FNLLMEvents(LLMEvents): 14 | """FNLLM events handler that calls the error handler.""" 15 | 16 | def __init__(self, on_error: ErrorHandlerFn): 17 | self._on_error = on_error 18 | 19 | async def on_error( 20 | self, 21 | error: BaseException | None, 22 | traceback: str | None = None, 23 | arguments: dict[str, Any] | None = None, 24 | ) -> None: 25 | """Handle an fnllm error.""" 26 | self._on_error(error, traceback, arguments) 27 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/rate_limiter/rate_limiter_factory.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Rate Limiter Factory.""" 5 | 6 | from graphrag.config.defaults import DEFAULT_RATE_LIMITER_SERVICES 7 | from graphrag.factory.factory import Factory 8 | from graphrag.language_model.providers.litellm.services.rate_limiter.rate_limiter import ( 9 | RateLimiter, 10 | ) 11 | 12 | 13 | class RateLimiterFactory(Factory[RateLimiter]): 14 | """Singleton factory for creating rate limiter services.""" 15 | 16 | 17 | rate_limiter_factory = RateLimiterFactory() 18 | 19 | for service_name, service_cls in DEFAULT_RATE_LIMITER_SERVICES.items(): 20 | rate_limiter_factory.register( 21 | strategy=service_name, service_initializer=service_cls 22 | ) 23 | -------------------------------------------------------------------------------- /unified-search-app/app/ui/questions_list.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Question list module.""" 5 | 6 | import streamlit as st 7 | from state.session_variables import SessionVariables 8 | 9 | 10 | def create_questions_list_ui(sv: SessionVariables): 11 | """Return question list UI component.""" 12 | selection = st.dataframe( 13 | sv.generated_questions.value, 14 | use_container_width=True, 15 | hide_index=True, 16 | selection_mode="single-row", 17 | column_config={"value": "question"}, 18 | on_select="rerun", 19 | ) 20 | rows = selection.selection.rows 21 | if len(rows) > 0: 22 | question_index = selection.selection.rows[0] 23 | sv.selected_question.value = sv.generated_questions.value[question_index] 24 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Python Artifacts 2 | python/*/lib/ 3 | dist/ 4 | build/ 5 | *.egg-info/ 6 | 7 | # Test Output 8 | .coverage 9 | coverage/ 10 | licenses.txt 11 | examples_notebooks/*/data 12 | tests/fixtures/cache 13 | tests/fixtures/*/cache 14 | tests/fixtures/*/output 15 | output/lancedb 16 | 17 | 18 | # Random 19 | .DS_Store 20 | *.log* 21 | .venv 22 | venv/ 23 | .conda 24 | .tmp 25 | 26 | .env 27 | build.zip 28 | 29 | .turbo 30 | 31 | __pycache__ 32 | 33 | .pipeline 34 | 35 | # Azurite 36 | temp_azurite/ 37 | __azurite*.json 38 | __blobstorage*.json 39 | __blobstorage__/ 40 | 41 | # Getting started example 42 | ragtest/ 43 | .ragtest/ 44 | .pipelines 45 | .pipeline 46 | 47 | 48 | # mkdocs 49 | site/ 50 | 51 | # Docs migration 52 | docsite/ 53 | .yarn/ 54 | .pnp* 55 | 56 | # PyCharm 57 | .idea/ 58 | 59 | # Jupyter notebook 60 | .ipynb_checkpoints/ 61 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | # To get started with Dependabot version updates, you'll need to specify which 2 | # package ecosystems to update and where the package manifests are located. 3 | # Please see the documentation for all configuration options: 4 | # https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file 5 | version: 2 6 | updates: 7 | - package-ecosystem: "pip" # See documentation for possible values 8 | directory: "/" # Location of package manifests 9 | schedule: 10 | interval: "weekly" 11 | - package-ecosystem: "github-actions" 12 | # Workflow files stored in the default location of `.github/workflows`. (You don't need to specify `/.github/workflows` for `directory`. You can use `directory: "/"`.) 13 | directory: "/" 14 | schedule: 15 | interval: "weekly" 16 | -------------------------------------------------------------------------------- /graphrag/index/typing/pipeline_run_result.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing the PipelineRunResult class.""" 5 | 6 | from dataclasses import dataclass 7 | from typing import Any 8 | 9 | from graphrag.index.typing.state import PipelineState 10 | 11 | 12 | @dataclass 13 | class PipelineRunResult: 14 | """Pipeline run result class definition.""" 15 | 16 | workflow: str 17 | """The name of the workflow that was executed.""" 18 | result: Any | None 19 | """The result of the workflow function. This can be anything - we use it only for logging downstream, and expect each workflow function to write official outputs to the provided storage.""" 20 | state: PipelineState 21 | """Ongoing pipeline context state object.""" 22 | errors: list[BaseException] | None 23 | -------------------------------------------------------------------------------- /unified-search-app/app/ui/report_list.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Report list module.""" 5 | 6 | import streamlit as st 7 | from state.session_variables import SessionVariables 8 | 9 | 10 | def create_report_list_ui(sv: SessionVariables): 11 | """Return report list UI component.""" 12 | selection = st.dataframe( 13 | sv.community_reports.value, 14 | height=1000, 15 | hide_index=True, 16 | column_order=["id", "title"], 17 | selection_mode="single-row", 18 | on_select="rerun", 19 | ) 20 | rows = selection.selection.rows 21 | if len(rows) > 0: 22 | report_index = selection.selection.rows[0] 23 | sv.selected_report.value = sv.community_reports.value.iloc[report_index] 24 | else: 25 | sv.selected_report.value = None 26 | -------------------------------------------------------------------------------- /docs/stylesheets/extra.css: -------------------------------------------------------------------------------- 1 | [data-md-color-scheme="default"] { 2 | --md-primary-fg-color: #3c4cab; 3 | --md-code-hl-color: #3772d9; 4 | --md-code-hl-comment-color: #6b6b6b; 5 | --md-code-hl-operator-color: #6b6b6b; 6 | --md-footer-fg-color--light: #ffffff; 7 | --md-footer-fg-color--lighter: #ffffff; 8 | } 9 | 10 | [data-md-color-scheme="slate"] { 11 | --md-primary-fg-color: #364499; 12 | --md-code-hl-color: #246be5; 13 | --md-code-hl-constant-color: #9a89ed; 14 | --md-code-hl-number-color: #f16e5f; 15 | --md-footer-fg-color--light: #ffffff; 16 | --md-footer-fg-color--lighter: #ffffff; 17 | } 18 | 19 | .md-tabs__item--active { 20 | background-color: var(--md-primary-bg-color); 21 | } 22 | 23 | .md-tabs__item--active .md-tabs__link { 24 | color: var(--md-code-hl-color); 25 | } 26 | 27 | .md-typeset a { 28 | text-decoration: underline; 29 | } -------------------------------------------------------------------------------- /graphrag/index/operations/chunk_text/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing 'TextChunk' model.""" 5 | 6 | from collections.abc import Callable, Iterable 7 | from dataclasses import dataclass 8 | 9 | from graphrag.config.models.chunking_config import ChunkingConfig 10 | from graphrag.logger.progress import ProgressTicker 11 | 12 | 13 | @dataclass 14 | class TextChunk: 15 | """Text chunk class definition.""" 16 | 17 | text_chunk: str 18 | source_doc_indices: list[int] 19 | n_tokens: int | None = None 20 | 21 | 22 | ChunkInput = str | list[str] | list[tuple[str, str]] 23 | """Input to a chunking strategy. Can be a string, a list of strings, or a list of tuples of (id, text).""" 24 | 25 | ChunkStrategy = Callable[ 26 | [list[str], ChunkingConfig, ProgressTicker], Iterable[TextChunk] 27 | ] 28 | -------------------------------------------------------------------------------- /graphrag/index/typing/pipeline.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing the Pipeline class.""" 5 | 6 | from collections.abc import Generator 7 | 8 | from graphrag.index.typing.workflow import Workflow 9 | 10 | 11 | class Pipeline: 12 | """Encapsulates running workflows.""" 13 | 14 | def __init__(self, workflows: list[Workflow]): 15 | self.workflows = workflows 16 | 17 | def run(self) -> Generator[Workflow]: 18 | """Return a Generator over the pipeline workflows.""" 19 | yield from self.workflows 20 | 21 | def names(self) -> list[str]: 22 | """Return the names of the workflows in the pipeline.""" 23 | return [name for name, _ in self.workflows] 24 | 25 | def remove(self, name: str) -> None: 26 | """Remove a workflow from the pipeline by name.""" 27 | self.workflows = [w for w in self.workflows if w[0] != name] 28 | -------------------------------------------------------------------------------- /graphrag/config/models/cluster_graph_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | 10 | 11 | class ClusterGraphConfig(BaseModel): 12 | """Configuration section for clustering graphs.""" 13 | 14 | max_cluster_size: int = Field( 15 | description="The maximum cluster size to use.", 16 | default=graphrag_config_defaults.cluster_graph.max_cluster_size, 17 | ) 18 | use_lcc: bool = Field( 19 | description="Whether to use the largest connected component.", 20 | default=graphrag_config_defaults.cluster_graph.use_lcc, 21 | ) 22 | seed: int = Field( 23 | description="The seed to use for the clustering.", 24 | default=graphrag_config_defaults.cluster_graph.seed, 25 | ) 26 | -------------------------------------------------------------------------------- /tests/fixtures/azure/settings.yml: -------------------------------------------------------------------------------- 1 | extract_claims: 2 | enabled: true 3 | 4 | vector_store: 5 | default_vector_store: 6 | type: "azure_ai_search" 7 | url: ${AZURE_AI_SEARCH_URL_ENDPOINT} 8 | api_key: ${AZURE_AI_SEARCH_API_KEY} 9 | container_name: "azure_ci" 10 | 11 | input: 12 | storage: 13 | type: blob 14 | connection_string: ${LOCAL_BLOB_STORAGE_CONNECTION_STRING} 15 | container_name: azurefixture 16 | base_dir: input 17 | file_type: text 18 | 19 | 20 | cache: 21 | type: blob 22 | connection_string: ${BLOB_STORAGE_CONNECTION_STRING} 23 | container_name: cicache 24 | base_dir: cache_azure_ai 25 | 26 | output: 27 | type: blob 28 | connection_string: ${LOCAL_BLOB_STORAGE_CONNECTION_STRING} 29 | container_name: azurefixture 30 | base_dir: output 31 | 32 | reporting: 33 | type: blob 34 | connection_string: ${LOCAL_BLOB_STORAGE_CONNECTION_STRING} 35 | container_name: azurefixture 36 | base_dir: reports 37 | -------------------------------------------------------------------------------- /.semversioner/0.3.1.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add preflight check to check LLM connectivity.", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Add streaming support for local/global search to query cli", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Add support for both float and int on schema validation for community report generation", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Avoid running index on gh-pages publishing", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Implement Index API", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Improves filtering for data dir inferring", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Update to nltk 3.9.1", 29 | "type": "patch" 30 | } 31 | ], 32 | "created_at": "2024-08-21T22:46:19+00:00", 33 | "version": "0.3.1" 34 | } -------------------------------------------------------------------------------- /.semversioner/0.4.1.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add update cli entrypoint for incremental indexing", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Allow some CI/CD jobs to skip PRs dedicated to doc updates only.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Fix a file paths issue in the viz guide.", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Fix optional covariates update in incremental indexing", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Raise error on empty deltas for inc indexing", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "add visualization guide to doc site", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "fix streaming output error", 29 | "type": "patch" 30 | } 31 | ], 32 | "created_at": "2024-11-08T23:13:05+00:00", 33 | "version": "0.4.1" 34 | } -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/language.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Language detection for GraphRAG prompts.""" 5 | 6 | from graphrag.language_model.protocol.base import ChatModel 7 | from graphrag.prompt_tune.prompt.language import DETECT_LANGUAGE_PROMPT 8 | 9 | 10 | async def detect_language(model: ChatModel, docs: str | list[str]) -> str: 11 | """Detect input language to use for GraphRAG prompts. 12 | 13 | Parameters 14 | ---------- 15 | - llm (CompletionLLM): The LLM to use for generation 16 | - docs (str | list[str]): The docs to detect language from 17 | 18 | Returns 19 | ------- 20 | - str: The detected language. 21 | """ 22 | docs_str = " ".join(docs) if isinstance(docs, list) else docs 23 | language_prompt = DETECT_LANGUAGE_PROMPT.format(input_text=docs_str) 24 | 25 | response = await model.achat(language_prompt) 26 | 27 | return str(response.output.content) 28 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/domain.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Domain generation for GraphRAG prompts.""" 5 | 6 | from graphrag.language_model.protocol.base import ChatModel 7 | from graphrag.prompt_tune.prompt.domain import GENERATE_DOMAIN_PROMPT 8 | 9 | 10 | async def generate_domain(model: ChatModel, docs: str | list[str]) -> str: 11 | """Generate an LLM persona to use for GraphRAG prompts. 12 | 13 | Parameters 14 | ---------- 15 | - llm (CompletionLLM): The LLM to use for generation 16 | - docs (str | list[str]): The domain to generate a persona for 17 | 18 | Returns 19 | ------- 20 | - str: The generated domain prompt response. 21 | """ 22 | docs_str = " ".join(docs) if isinstance(docs, list) else docs 23 | domain_prompt = GENERATE_DOMAIN_PROMPT.format(input_text=docs_str) 24 | 25 | response = await model.achat(domain_prompt) 26 | 27 | return str(response.output.content) 28 | -------------------------------------------------------------------------------- /graphrag/prompts/index/summarize_descriptions.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A file containing prompts definition.""" 5 | 6 | SUMMARIZE_PROMPT = """ 7 | You are a helpful assistant responsible for generating a comprehensive summary of the data provided below. 8 | Given one or more entities, and a list of descriptions, all related to the same entity or group of entities. 9 | Please concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions. 10 | If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary. 11 | Make sure it is written in third person, and include the entity names so we have the full context. 12 | Limit the final description length to {max_length} words. 13 | 14 | ####### 15 | -Data- 16 | Entities: {entity_name} 17 | Description List: {description_list} 18 | ####### 19 | Output: 20 | """ 21 | -------------------------------------------------------------------------------- /graphrag/query/context_builder/rate_prompt.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Global search with dynamic community selection prompt.""" 5 | 6 | RATE_QUERY = """ 7 | ---Role--- 8 | You are a helpful assistant responsible for deciding whether the provided information is useful in answering a given question, even if it is only partially relevant. 9 | ---Goal--- 10 | On a scale from 0 to 5, please rate how relevant or helpful is the provided information in answering the question. 11 | ---Information--- 12 | {description} 13 | ---Question--- 14 | {question} 15 | ---Target response length and format--- 16 | Please response in the following JSON format with two entries: 17 | - "reason": the reasoning of your rating, please include information that you have considered. 18 | - "rating": the relevancy rating from 0 to 5, where 0 is the least relevant and 5 is the most relevant. 19 | {{ 20 | "reason": str, 21 | "rating": int. 22 | }} 23 | """ 24 | -------------------------------------------------------------------------------- /graphrag/index/operations/build_noun_graph/np_extractors/np_validator.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Util functions to tag noun phrases for filtering.""" 5 | 6 | 7 | def is_compound(tokens: list[str]) -> bool: 8 | """List of tokens forms a compound noun phrase.""" 9 | return any( 10 | "-" in token and len(token.strip()) > 1 and len(token.strip().split("-")) > 1 11 | for token in tokens 12 | ) 13 | 14 | 15 | def has_valid_token_length(tokens: list[str], max_length: int) -> bool: 16 | """Check if all tokens have valid length.""" 17 | return all(len(token) <= max_length for token in tokens) 18 | 19 | 20 | def is_valid_entity(entity: tuple[str, str], tokens: list[str]) -> bool: 21 | """Check if the entity is valid.""" 22 | return (entity[1] not in ["CARDINAL", "ORDINAL"] and len(tokens) > 0) or ( 23 | entity[1] in ["CARDINAL", "ORDINAL"] 24 | and (len(tokens) > 1 or is_compound(tokens)) 25 | ) 26 | -------------------------------------------------------------------------------- /SUPPORT.md: -------------------------------------------------------------------------------- 1 | # Support 2 | 3 | ## How to file issues and get help 4 | 5 | This project uses GitHub Issues to track bugs and feature requests. Please search the existing 6 | issues before filing new issues to avoid duplicates. For new issues, file your bug or 7 | feature request as a new Issue. 8 | 9 | For help and questions about using this project, please create a GitHub issue with your question. 10 | 11 | ## Microsoft Support Policy 12 | 13 | # Support for this **PROJECT or PRODUCT** is limited to the resources listed above. 14 | 15 | # Support 16 | 17 | ## How to file issues and get help 18 | 19 | This project uses GitHub Issues to track bugs and feature requests. Please search the existing 20 | issues before filing new issues to avoid duplicates. For new issues, file your bug or 21 | feature request as a new Issue. 22 | 23 | For help and questions about using this project, please file an issue on the repo. 24 | 25 | ## Microsoft Support Policy 26 | 27 | Support for this project is limited to the resources listed above. 28 | -------------------------------------------------------------------------------- /graphrag/language_model/cache/base.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Base cache protocol definition.""" 5 | 6 | from typing import Any, Protocol 7 | 8 | 9 | class ModelCache(Protocol): 10 | """Base cache protocol.""" 11 | 12 | async def has(self, key: str) -> bool: 13 | """Check if the cache has a value.""" 14 | ... 15 | 16 | async def get(self, key: str) -> Any | None: 17 | """Retrieve a value from the cache.""" 18 | ... 19 | 20 | async def set( 21 | self, key: str, value: Any, metadata: dict[str, Any] | None = None 22 | ) -> None: 23 | """Write a value into the cache.""" 24 | ... 25 | 26 | async def remove(self, key: str) -> None: 27 | """Remove a value from the cache.""" 28 | ... 29 | 30 | async def clear(self) -> None: 31 | """Clear the cache.""" 32 | ... 33 | 34 | def child(self, key: str) -> Any: 35 | """Create a child cache.""" 36 | ... 37 | -------------------------------------------------------------------------------- /graphrag/config/models/snapshots_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | 10 | 11 | class SnapshotsConfig(BaseModel): 12 | """Configuration section for snapshots.""" 13 | 14 | embeddings: bool = Field( 15 | description="A flag indicating whether to take snapshots of embeddings.", 16 | default=graphrag_config_defaults.snapshots.embeddings, 17 | ) 18 | graphml: bool = Field( 19 | description="A flag indicating whether to take snapshots of GraphML.", 20 | default=graphrag_config_defaults.snapshots.graphml, 21 | ) 22 | raw_graph: bool = Field( 23 | description="A flag indicating whether to take snapshots of the raw extracted graph (entities and relationships) before merging.", 24 | default=graphrag_config_defaults.snapshots.raw_graph, 25 | ) 26 | -------------------------------------------------------------------------------- /tests/verbs/test_prune_graph.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | from graphrag.config.create_graphrag_config import create_graphrag_config 5 | from graphrag.config.models.prune_graph_config import PruneGraphConfig 6 | from graphrag.index.workflows.prune_graph import ( 7 | run_workflow, 8 | ) 9 | from graphrag.utils.storage import load_table_from_storage 10 | 11 | from .util import ( 12 | DEFAULT_MODEL_CONFIG, 13 | create_test_context, 14 | ) 15 | 16 | 17 | async def test_prune_graph(): 18 | context = await create_test_context( 19 | storage=["entities", "relationships"], 20 | ) 21 | 22 | config = create_graphrag_config({"models": DEFAULT_MODEL_CONFIG}) 23 | config.prune_graph = PruneGraphConfig( 24 | min_node_freq=4, min_node_degree=0, min_edge_weight_pct=0 25 | ) 26 | 27 | await run_workflow(config, context) 28 | 29 | nodes_actual = await load_table_from_storage("entities", context.output_storage) 30 | 31 | assert len(nodes_actual) == 20 32 | -------------------------------------------------------------------------------- /graphrag/index/operations/chunk_text/bootstrap.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Bootstrap definition.""" 5 | 6 | import warnings 7 | 8 | # Ignore warnings from numba 9 | warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*") 10 | warnings.filterwarnings("ignore", message=".*Use no seed for parallelism.*") 11 | 12 | initialized_nltk = False 13 | 14 | 15 | def bootstrap(): 16 | """Bootstrap definition.""" 17 | global initialized_nltk 18 | if not initialized_nltk: 19 | import nltk 20 | from nltk.corpus import wordnet as wn 21 | 22 | nltk.download("punkt") 23 | nltk.download("punkt_tab") 24 | nltk.download("averaged_perceptron_tagger") 25 | nltk.download("averaged_perceptron_tagger_eng") 26 | nltk.download("maxent_ne_chunker") 27 | nltk.download("maxent_ne_chunker_tab") 28 | nltk.download("words") 29 | nltk.download("wordnet") 30 | wn.ensure_loaded() 31 | initialized_nltk = True 32 | -------------------------------------------------------------------------------- /.semversioner/2.3.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Remove Dynamic Max Retries support. Refactor typer typing in cli interface", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Update fnllm to latest. Update default graphrag configuration", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "A few fixes and enhancements for better reuse and flow.", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Add full llm response to LLM PRovider output", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Fix Drift Reduce Response for non streaming calls", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Fix global search prompt to include missing formatting key", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Upgrade pyarrow dependency to >=17.0.0 to fix CVE-2024-52338", 29 | "type": "patch" 30 | } 31 | ], 32 | "created_at": "2025-05-23T21:02:47+00:00", 33 | "version": "2.3.0" 34 | } -------------------------------------------------------------------------------- /.semversioner/0.5.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Data model changes.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Add Parquet as part of the default emitters when not pressent", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Centralized prompts and export all for easier injection.", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Cleanup of artifact outputs/schemas.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Config and docs updates.", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Implement dynamic community selection to global search", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "fix autocompletion of existing files/directory paths.", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "move import statements out of init files", 33 | "type": "patch" 34 | } 35 | ], 36 | "created_at": "2024-11-16T00:43:06+00:00", 37 | "version": "0.5.0" 38 | } -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/persona.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Persona generating module for fine-tuning GraphRAG prompts.""" 5 | 6 | from graphrag.language_model.protocol.base import ChatModel 7 | from graphrag.prompt_tune.defaults import DEFAULT_TASK 8 | from graphrag.prompt_tune.prompt.persona import GENERATE_PERSONA_PROMPT 9 | 10 | 11 | async def generate_persona( 12 | model: ChatModel, domain: str, task: str = DEFAULT_TASK 13 | ) -> str: 14 | """Generate an LLM persona to use for GraphRAG prompts. 15 | 16 | Parameters 17 | ---------- 18 | - llm (CompletionLLM): The LLM to use for generation 19 | - domain (str): The domain to generate a persona for 20 | - task (str): The task to generate a persona for. Default is DEFAULT_TASK 21 | """ 22 | formatted_task = task.format(domain=domain) 23 | persona_prompt = GENERATE_PERSONA_PROMPT.format(sample_task=formatted_task) 24 | 25 | response = await model.achat(persona_prompt) 26 | 27 | return str(response.output.content) 28 | -------------------------------------------------------------------------------- /graphrag/prompts/query/question_gen_system_prompt.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Question Generation system prompts.""" 5 | 6 | QUESTION_SYSTEM_PROMPT = """ 7 | ---Role--- 8 | 9 | You are a helpful assistant generating a bulleted list of {question_count} questions about data in the tables provided. 10 | 11 | 12 | ---Data tables--- 13 | 14 | {context_data} 15 | 16 | 17 | ---Goal--- 18 | 19 | Given a series of example questions provided by the user, generate a bulleted list of {question_count} candidates for the next question. Use - marks as bullet points. 20 | 21 | These candidate questions should represent the most important or urgent information content or themes in the data tables. 22 | 23 | The candidate questions should be answerable using the data tables provided, but should not mention any specific data fields or data tables in the question text. 24 | 25 | If the user's questions reference several named entities, then each candidate question should reference all named entities. 26 | 27 | ---Example questions--- 28 | """ 29 | -------------------------------------------------------------------------------- /docs/prompt_tuning/overview.md: -------------------------------------------------------------------------------- 1 | # Prompt Tuning ⚙️ 2 | 3 | This page provides an overview of the prompt tuning options available for the GraphRAG indexing engine. 4 | 5 | ## Default Prompts 6 | 7 | The default prompts are the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. More details about each of the default prompts for indexing and query can be found on the [manual tuning](./manual_prompt_tuning.md) page. 8 | 9 | ## Auto Tuning 10 | 11 | Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the [Auto Tuning](auto_prompt_tuning.md) documentation. 12 | 13 | ## Manual Tuning 14 | 15 | Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the [manual tuning](manual_prompt_tuning.md) documentation. 16 | -------------------------------------------------------------------------------- /graphrag/index/workflows/update_clean_state.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing run_workflow method definition.""" 5 | 6 | import logging 7 | 8 | from graphrag.config.models.graph_rag_config import GraphRagConfig 9 | from graphrag.index.typing.context import PipelineRunContext 10 | from graphrag.index.typing.workflow import WorkflowFunctionOutput 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | 15 | async def run_workflow( # noqa: RUF029 16 | _config: GraphRagConfig, 17 | context: PipelineRunContext, 18 | ) -> WorkflowFunctionOutput: 19 | """Clean the state after the update.""" 20 | logger.info("Workflow started: update_clean_state") 21 | keys_to_delete = [ 22 | key_name 23 | for key_name in context.state 24 | if key_name.startswith("incremental_update_") 25 | ] 26 | 27 | for key_name in keys_to_delete: 28 | del context.state[key_name] 29 | 30 | logger.info("Workflow completed: update_clean_state") 31 | return WorkflowFunctionOutput(result=None) 32 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/prompt/community_reporter_role.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for community reporter role generation.""" 5 | 6 | GENERATE_COMMUNITY_REPORTER_ROLE_PROMPT = """ 7 | {persona} 8 | Given a sample text, help the user by creating a role definition that will be tasked with community analysis. 9 | Take a look at this example, determine its key parts, and using the domain provided and your expertise, create a new role definition for the provided inputs that follows the same pattern as the example. 10 | Remember, your output should look just like the provided example in structure and content. 11 | 12 | Example: 13 | A technologist reporter that is analyzing Kevin Scott's "Behind the Tech Podcast", given a list of entities 14 | that belong to the community as well as their relationships and optional associated claims. 15 | The report will be used to inform decision-makers about significant developments associated with the community and their potential impact. 16 | 17 | 18 | Domain: {domain} 19 | Text: {input_text} 20 | Role:""" 21 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/rate_limiter/rate_limiter.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Rate Limiter.""" 5 | 6 | from abc import ABC, abstractmethod 7 | from collections.abc import Iterator 8 | from contextlib import contextmanager 9 | from typing import Any 10 | 11 | 12 | class RateLimiter(ABC): 13 | """Abstract base class for rate limiters.""" 14 | 15 | @abstractmethod 16 | def __init__( 17 | self, 18 | /, 19 | **kwargs: Any, 20 | ) -> None: ... 21 | 22 | @abstractmethod 23 | @contextmanager 24 | def acquire(self, *, token_count: int) -> Iterator[None]: 25 | """ 26 | Acquire Rate Limiter. 27 | 28 | Args 29 | ---- 30 | token_count: The estimated number of tokens for the current request. 31 | 32 | Yields 33 | ------ 34 | None: This context manager does not return any value. 35 | """ 36 | msg = "RateLimiter subclasses must implement the acquire method." 37 | raise NotImplementedError(msg) 38 | -------------------------------------------------------------------------------- /.vsts-ci.yml: -------------------------------------------------------------------------------- 1 | name: GraphRAG CI 2 | pool: 3 | vmImage: ubuntu-latest 4 | 5 | trigger: 6 | batch: true 7 | branches: 8 | include: 9 | - main 10 | 11 | variables: 12 | isMain: $[eq(variables['Build.SourceBranch'], 'refs/heads/main')] 13 | pythonVersion: "3.10" 14 | poetryVersion: "1.6.1" 15 | nodeVersion: "18.x" 16 | artifactsFullFeedName: "Resilience/resilience_python" 17 | 18 | stages: 19 | - stage: Compliance 20 | dependsOn: [] 21 | jobs: 22 | - job: compliance 23 | displayName: Compliance 24 | pool: 25 | vmImage: windows-latest 26 | steps: 27 | - task: CredScan@3 28 | inputs: 29 | outputFormat: sarif 30 | debugMode: false 31 | 32 | - task: ComponentGovernanceComponentDetection@0 33 | inputs: 34 | scanType: "Register" 35 | verbosity: "Verbose" 36 | alertWarningLevel: "High" 37 | 38 | - task: PublishSecurityAnalysisLogs@3 39 | inputs: 40 | ArtifactName: "CodeAnalysisLogs" 41 | ArtifactType: "Container" -------------------------------------------------------------------------------- /graphrag/index/typing/workflow.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Pipeline workflow types.""" 5 | 6 | from collections.abc import Awaitable, Callable 7 | from dataclasses import dataclass 8 | from typing import Any 9 | 10 | from graphrag.config.models.graph_rag_config import GraphRagConfig 11 | from graphrag.index.typing.context import PipelineRunContext 12 | 13 | 14 | @dataclass 15 | class WorkflowFunctionOutput: 16 | """Data container for Workflow function results.""" 17 | 18 | result: Any | None 19 | """The result of the workflow function. This can be anything - we use it only for logging downstream, and expect each workflow function to write official outputs to the provided storage.""" 20 | stop: bool = False 21 | """Flag to indicate if the workflow should stop after this function. This should only be used when continuation could cause an unstable failure.""" 22 | 23 | 24 | WorkflowFunction = Callable[ 25 | [GraphRagConfig, PipelineRunContext], 26 | Awaitable[WorkflowFunctionOutput], 27 | ] 28 | Workflow = tuple[str, WorkflowFunction] 29 | -------------------------------------------------------------------------------- /.github/workflows/issues-autoresolve.yml: -------------------------------------------------------------------------------- 1 | name: Close inactive issues 2 | on: 3 | schedule: 4 | - cron: "30 1 * * *" 5 | 6 | permissions: 7 | actions: write 8 | issues: write 9 | pull-requests: write 10 | 11 | jobs: 12 | close-issues: 13 | runs-on: ubuntu-latest 14 | permissions: 15 | issues: write 16 | pull-requests: write 17 | steps: 18 | - uses: actions/stale@v9 19 | with: 20 | days-before-issue-stale: 7 21 | days-before-issue-close: 5 22 | stale-issue-label: "stale" 23 | close-issue-label: "autoresolved" 24 | stale-issue-message: "This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days." 25 | close-issue-message: "This issue has been closed after being marked as stale for five days. Please reopen if needed." 26 | any-of-labels: "awaiting_response" 27 | days-before-pr-stale: -1 28 | days-before-pr-close: -1 29 | repo-token: ${{ secrets.GITHUB_TOKEN }} 30 | -------------------------------------------------------------------------------- /unified-search-app/.vsts-ci.yml: -------------------------------------------------------------------------------- 1 | name: unified-search-app 2 | pool: 3 | vmImage: ubuntu-latest 4 | 5 | trigger: 6 | batch: true 7 | branches: 8 | include: 9 | - main 10 | paths: 11 | include: 12 | - unified-search-app 13 | 14 | 15 | stages: 16 | - stage: Build_deploy 17 | dependsOn: [] 18 | jobs: 19 | - job: build 20 | displayName: Build and deploy 21 | pool: 22 | vmImage: ubuntu-latest 23 | steps: 24 | - task: UsePythonVersion@0 25 | inputs: 26 | versionSpec: "3.11" 27 | displayName: "Use Python 3.11" 28 | 29 | - task: Docker@2 30 | inputs: 31 | containerRegistry: '$(containerRegistry)' 32 | repository: 'main' 33 | command: 'buildAndPush' 34 | Dockerfile: 'unified-search-app/Dockerfile' 35 | tags: 'latest' 36 | - task: AzureAppServiceManage@0 37 | inputs: 38 | azureSubscription: '$(subscription)' 39 | Action: 'Restart Azure App Service' 40 | WebAppName: '$(webApp)' 41 | -------------------------------------------------------------------------------- /.semversioner/0.3.2.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add context data to query API responses.", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Add missing config parameter documentation for prompt tuning", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Add neo4j community notebook", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Ensure entity types to be str when running prompt tuning", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Fix weight casting during graph extraction", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Patch \"past\" dependency issues", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Update developer guide.", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "Update query type hints.", 33 | "type": "patch" 34 | }, 35 | { 36 | "description": "change-lancedb-placement", 37 | "type": "patch" 38 | } 39 | ], 40 | "created_at": "2024-08-26T23:43:01+00:00", 41 | "version": "0.3.2" 42 | } -------------------------------------------------------------------------------- /graphrag/language_model/providers/litellm/services/retry/retry.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Retry Abstract Base Class.""" 5 | 6 | from abc import ABC, abstractmethod 7 | from collections.abc import Awaitable, Callable 8 | from typing import Any 9 | 10 | 11 | class Retry(ABC): 12 | """LiteLLM Retry Abstract Base Class.""" 13 | 14 | @abstractmethod 15 | def __init__(self, /, **kwargs: Any): 16 | msg = "Retry subclasses must implement the __init__ method." 17 | raise NotImplementedError(msg) 18 | 19 | @abstractmethod 20 | def retry(self, func: Callable[..., Any], **kwargs: Any) -> Any: 21 | """Retry a synchronous function.""" 22 | msg = "Subclasses must implement this method" 23 | raise NotImplementedError(msg) 24 | 25 | @abstractmethod 26 | async def aretry( 27 | self, 28 | func: Callable[..., Awaitable[Any]], 29 | **kwargs: Any, 30 | ) -> Any: 31 | """Retry an asynchronous function.""" 32 | msg = "Subclasses must implement this method" 33 | raise NotImplementedError(msg) 34 | -------------------------------------------------------------------------------- /graphrag/index/operations/finalize_community_reports.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """All the steps to transform final entities.""" 5 | 6 | from uuid import uuid4 7 | 8 | import pandas as pd 9 | 10 | from graphrag.data_model.schemas import COMMUNITY_REPORTS_FINAL_COLUMNS 11 | 12 | 13 | def finalize_community_reports( 14 | reports: pd.DataFrame, 15 | communities: pd.DataFrame, 16 | ) -> pd.DataFrame: 17 | """All the steps to transform final community reports.""" 18 | # Merge with communities to add shared fields 19 | community_reports = reports.merge( 20 | communities.loc[:, ["community", "parent", "children", "size", "period"]], 21 | on="community", 22 | how="left", 23 | copy=False, 24 | ) 25 | 26 | community_reports["community"] = community_reports["community"].astype(int) 27 | community_reports["human_readable_id"] = community_reports["community"] 28 | community_reports["id"] = [uuid4().hex for _ in range(len(community_reports))] 29 | 30 | return community_reports.loc[ 31 | :, 32 | COMMUNITY_REPORTS_FINAL_COLUMNS, 33 | ] 34 | -------------------------------------------------------------------------------- /tests/unit/indexing/test_init_content.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | import re 5 | from typing import Any, cast 6 | 7 | import yaml 8 | 9 | from graphrag.config.create_graphrag_config import create_graphrag_config 10 | from graphrag.config.init_content import INIT_YAML 11 | from graphrag.config.models.graph_rag_config import GraphRagConfig 12 | 13 | 14 | def test_init_yaml(): 15 | data = yaml.load(INIT_YAML, Loader=yaml.FullLoader) 16 | config = create_graphrag_config(data) 17 | GraphRagConfig.model_validate(config, strict=True) 18 | 19 | 20 | def test_init_yaml_uncommented(): 21 | lines = INIT_YAML.splitlines() 22 | lines = [line for line in lines if "##" not in line] 23 | 24 | def uncomment_line(line: str) -> str: 25 | leading_whitespace = cast("Any", re.search(r"^(\s*)", line)).group(1) 26 | return re.sub(r"^\s*# ", leading_whitespace, line, count=1) 27 | 28 | content = "\n".join([uncomment_line(line) for line in lines]) 29 | data = yaml.load(content, Loader=yaml.FullLoader) 30 | config = create_graphrag_config(data) 31 | GraphRagConfig.model_validate(config, strict=True) 32 | -------------------------------------------------------------------------------- /graphrag/index/operations/build_noun_graph/np_extractors/resource_loader.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Util functions needed for nltk-based noun-phrase extractors (i.e. TextBlob).""" 5 | 6 | import nltk 7 | 8 | 9 | def download_if_not_exists(resource_name) -> bool: 10 | """Download nltk resources if they haven't been already.""" 11 | # look under all possible categories 12 | root_categories = [ 13 | "corpora", 14 | "tokenizers", 15 | "taggers", 16 | "chunkers", 17 | "classifiers", 18 | "stemmers", 19 | "stopwords", 20 | "languages", 21 | "frequent", 22 | "gate", 23 | "models", 24 | "mt", 25 | "sentiment", 26 | "similarity", 27 | ] 28 | for category in root_categories: 29 | try: 30 | # if found, stop looking and avoid downloading 31 | nltk.find(f"{category}/{resource_name}") 32 | return True # noqa: TRY300 33 | except LookupError: 34 | continue 35 | 36 | # is not found, download 37 | nltk.download(resource_name) 38 | return False 39 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/template/entity_summarization.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Fine-tuning prompts for entity summarization.""" 5 | 6 | ENTITY_SUMMARIZATION_PROMPT = """ 7 | {persona} 8 | Using your expertise, you're asked to generate a comprehensive summary of the data provided below. 9 | Given one or two entities, and a list of descriptions, all related to the same entity or group of entities. 10 | Please concatenate all of these into a single, concise description in {language}. Make sure to include information collected from all the descriptions. 11 | If the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary. 12 | Make sure it is written in third person, and include the entity names so we have the full context. 13 | 14 | Enrich it as much as you can with relevant information from the nearby text, this is very important. 15 | 16 | If no answer is possible, or the description is empty, only convey information that is provided within the text. 17 | ####### 18 | -Data- 19 | Entities: {{entity_name}} 20 | Description List: {{description_list}} 21 | ####### 22 | Output:""" 23 | -------------------------------------------------------------------------------- /unified-search-app/pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "unified-copilot" 3 | version = "1.0.0" 4 | description = "" 5 | authors = [ 6 | {name = "GraphRAG team"}, 7 | ] 8 | readme = "README.md" 9 | requires-python = ">=3.10,<3.12" 10 | 11 | dependencies = [ 12 | "streamlit==1.43.0", 13 | "azure-search-documents>=11.4.0", 14 | "azure-storage-blob>=12.20.0", 15 | "azure-identity>=1.16.0", 16 | "graphrag==2.0.0", 17 | "altair>=5.3.0", 18 | "streamlit-agraph>=0.0.45", 19 | "st-tabs>=0.1.1", 20 | "spacy>=3.8.4,<4.0.0", 21 | ] 22 | 23 | [project.optional-dependencies] 24 | dev = [ 25 | "poethepoet>=0.26.1", 26 | "ipykernel>=6.29.4", 27 | "pyright>=1.1.349", 28 | "ruff>=0.4.7", 29 | ] 30 | 31 | [build-system] 32 | requires = ["setuptools>=64", "wheel"] 33 | build-backend = "setuptools.build_meta" 34 | 35 | [tool.setuptools.packages.find] 36 | include = ["app*"] 37 | exclude = ["images*"] 38 | 39 | [tool.poe.tasks] 40 | start = "streamlit run app/home_page.py" 41 | start_prod = "streamlit run app/home_page.py --server.port=8501 --server.address=0.0.0.0" 42 | 43 | [tool.pyright] 44 | include = ["app"] 45 | exclude = ["**/node_modules", "**/__pycache__"] 46 | -------------------------------------------------------------------------------- /.github/workflows/gh-pages.yml: -------------------------------------------------------------------------------- 1 | name: gh-pages 2 | on: 3 | push: 4 | branches: [main] 5 | permissions: 6 | contents: write 7 | 8 | env: 9 | PYTHON_VERSION: "3.11" 10 | 11 | jobs: 12 | build: 13 | runs-on: ubuntu-latest 14 | env: 15 | GH_PAGES: 1 16 | DEBUG: 1 17 | GRAPHRAG_API_KEY: ${{ secrets.GRAPHRAG_API_KEY }} 18 | 19 | steps: 20 | - uses: actions/checkout@v4 21 | with: 22 | persist-credentials: false 23 | 24 | - name: Set up Python ${{ env.PYTHON_VERSION }} 25 | uses: actions/setup-python@v5 26 | with: 27 | python-version: ${{ env.PYTHON_VERSION }} 28 | 29 | - name: Install uv 30 | uses: astral-sh/setup-uv@v6 31 | 32 | - name: Install dependencies 33 | shell: bash 34 | run: uv sync 35 | 36 | - name: mkdocs build 37 | shell: bash 38 | run: uv run poe build_docs 39 | 40 | - name: List Docsite Contents 41 | run: find site 42 | 43 | - name: Deploy to GitHub Pages 44 | uses: JamesIves/github-pages-deploy-action@v4.6.4 45 | with: 46 | branch: gh-pages 47 | folder: site 48 | clean: true 49 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Microsoft Corporation. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE 22 | -------------------------------------------------------------------------------- /graphrag/callbacks/noop_workflow_callbacks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A no-op implementation of WorkflowCallbacks.""" 5 | 6 | from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks 7 | from graphrag.index.typing.pipeline_run_result import PipelineRunResult 8 | from graphrag.logger.progress import Progress 9 | 10 | 11 | class NoopWorkflowCallbacks(WorkflowCallbacks): 12 | """A no-op implementation of WorkflowCallbacks that logs all events to standard logging.""" 13 | 14 | def pipeline_start(self, names: list[str]) -> None: 15 | """Execute this callback to signal when the entire pipeline starts.""" 16 | 17 | def pipeline_end(self, results: list[PipelineRunResult]) -> None: 18 | """Execute this callback to signal when the entire pipeline ends.""" 19 | 20 | def workflow_start(self, name: str, instance: object) -> None: 21 | """Execute this callback when a workflow starts.""" 22 | 23 | def workflow_end(self, name: str, instance: object) -> None: 24 | """Execute this callback when a workflow ends.""" 25 | 26 | def progress(self, progress: Progress) -> None: 27 | """Handle when progress occurs.""" 28 | -------------------------------------------------------------------------------- /graphrag/callbacks/query_callbacks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Query Callbacks.""" 5 | 6 | from typing import Any 7 | 8 | from graphrag.callbacks.llm_callbacks import BaseLLMCallback 9 | from graphrag.query.structured_search.base import SearchResult 10 | 11 | 12 | class QueryCallbacks(BaseLLMCallback): 13 | """Callbacks used during query execution.""" 14 | 15 | def on_context(self, context: Any) -> None: 16 | """Handle when context data is constructed.""" 17 | 18 | def on_map_response_start(self, map_response_contexts: list[str]) -> None: 19 | """Handle the start of map operation.""" 20 | 21 | def on_map_response_end(self, map_response_outputs: list[SearchResult]) -> None: 22 | """Handle the end of map operation.""" 23 | 24 | def on_reduce_response_start( 25 | self, reduce_response_context: str | dict[str, Any] 26 | ) -> None: 27 | """Handle the start of reduce operation.""" 28 | 29 | def on_reduce_response_end(self, reduce_response_output: str) -> None: 30 | """Handle the end of reduce operation.""" 31 | 32 | def on_llm_new_token(self, token) -> None: 33 | """Handle when a new token is generated.""" 34 | -------------------------------------------------------------------------------- /tests/unit/litellm_services/utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Test Utilities.""" 5 | 6 | 7 | def bin_time_intervals( 8 | time_values: list[float], time_interval: int 9 | ) -> list[list[float]]: 10 | """Bin values.""" 11 | bins: list[list[float]] = [] 12 | 13 | bin_number = 0 14 | for time_value in time_values: 15 | upper_bound = (bin_number * time_interval) + time_interval 16 | while time_value >= upper_bound: 17 | bin_number += 1 18 | upper_bound = (bin_number * time_interval) + time_interval 19 | while len(bins) <= bin_number: 20 | bins.append([]) 21 | bins[bin_number].append(time_value) 22 | 23 | return bins 24 | 25 | 26 | def assert_max_num_values_per_period( 27 | periods: list[list[float]], max_values_per_period: int 28 | ): 29 | """Assert the number of values per period.""" 30 | for period in periods: 31 | assert len(period) <= max_values_per_period 32 | 33 | 34 | def assert_stagger(time_values: list[float], stagger: float): 35 | """Assert stagger.""" 36 | for i in range(1, len(time_values)): 37 | assert time_values[i] - time_values[i - 1] >= stagger 38 | -------------------------------------------------------------------------------- /tests/verbs/test_extract_graph_nlp.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | from graphrag.config.create_graphrag_config import create_graphrag_config 5 | from graphrag.index.workflows.extract_graph_nlp import ( 6 | run_workflow, 7 | ) 8 | from graphrag.utils.storage import load_table_from_storage 9 | 10 | from .util import ( 11 | DEFAULT_MODEL_CONFIG, 12 | create_test_context, 13 | ) 14 | 15 | 16 | async def test_extract_graph_nlp(): 17 | context = await create_test_context( 18 | storage=["text_units"], 19 | ) 20 | 21 | config = create_graphrag_config({"models": DEFAULT_MODEL_CONFIG}) 22 | 23 | await run_workflow(config, context) 24 | 25 | nodes_actual = await load_table_from_storage("entities", context.output_storage) 26 | edges_actual = await load_table_from_storage( 27 | "relationships", context.output_storage 28 | ) 29 | 30 | # this will be the raw count of entities and edges with no pruning 31 | # with NLP it is deterministic, so we can assert exact row counts 32 | assert len(nodes_actual) == 1148 33 | assert len(nodes_actual.columns) == 5 34 | assert len(edges_actual) == 29445 35 | assert len(edges_actual.columns) == 5 36 | -------------------------------------------------------------------------------- /graphrag/callbacks/noop_query_callbacks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """No-op Query Callbacks.""" 5 | 6 | from typing import Any 7 | 8 | from graphrag.callbacks.query_callbacks import QueryCallbacks 9 | from graphrag.query.structured_search.base import SearchResult 10 | 11 | 12 | class NoopQueryCallbacks(QueryCallbacks): 13 | """A no-op implementation of QueryCallbacks.""" 14 | 15 | def on_context(self, context: Any) -> None: 16 | """Handle when context data is constructed.""" 17 | 18 | def on_map_response_start(self, map_response_contexts: list[str]) -> None: 19 | """Handle the start of map operation.""" 20 | 21 | def on_map_response_end(self, map_response_outputs: list[SearchResult]) -> None: 22 | """Handle the end of map operation.""" 23 | 24 | def on_reduce_response_start( 25 | self, reduce_response_context: str | dict[str, Any] 26 | ) -> None: 27 | """Handle the start of reduce operation.""" 28 | 29 | def on_reduce_response_end(self, reduce_response_output: str) -> None: 30 | """Handle the end of reduce operation.""" 31 | 32 | def on_llm_new_token(self, token): 33 | """Handle when a new token is generated.""" 34 | -------------------------------------------------------------------------------- /graphrag/index/input/text.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing load method definition.""" 5 | 6 | import logging 7 | from pathlib import Path 8 | 9 | import pandas as pd 10 | 11 | from graphrag.config.models.input_config import InputConfig 12 | from graphrag.index.input.util import load_files 13 | from graphrag.index.utils.hashing import gen_sha512_hash 14 | from graphrag.storage.pipeline_storage import PipelineStorage 15 | 16 | logger = logging.getLogger(__name__) 17 | 18 | 19 | async def load_text( 20 | config: InputConfig, 21 | storage: PipelineStorage, 22 | ) -> pd.DataFrame: 23 | """Load text inputs from a directory.""" 24 | 25 | async def load_file(path: str, group: dict | None = None) -> pd.DataFrame: 26 | if group is None: 27 | group = {} 28 | text = await storage.get(path, encoding=config.encoding) 29 | new_item = {**group, "text": text} 30 | new_item["id"] = gen_sha512_hash(new_item, new_item.keys()) 31 | new_item["title"] = str(Path(path).name) 32 | new_item["creation_date"] = await storage.get_creation_date(path) 33 | return pd.DataFrame([new_item]) 34 | 35 | return await load_files(load_file, config, storage) 36 | -------------------------------------------------------------------------------- /tests/fixtures/min-csv/settings.yml: -------------------------------------------------------------------------------- 1 | models: 2 | default_chat_model: 3 | azure_auth_type: api_key 4 | type: chat 5 | model_provider: azure 6 | api_key: ${GRAPHRAG_API_KEY} 7 | api_base: ${GRAPHRAG_API_BASE} 8 | api_version: "2025-04-01-preview" 9 | deployment_name: gpt-4.1 10 | model: gpt-4.1 11 | retry_strategy: exponential_backoff 12 | tokens_per_minute: null 13 | requests_per_minute: null 14 | model_supports_json: true 15 | concurrent_requests: 25 16 | async_mode: threaded 17 | default_embedding_model: 18 | azure_auth_type: api_key 19 | type: embedding 20 | model_provider: azure 21 | api_key: ${GRAPHRAG_API_KEY} 22 | api_base: ${GRAPHRAG_API_BASE} 23 | api_version: "2025-04-01-preview" 24 | deployment_name: text-embedding-ada-002 25 | model: text-embedding-ada-002 26 | retry_strategy: exponential_backoff 27 | tokens_per_minute: null 28 | requests_per_minute: null 29 | concurrent_requests: 25 30 | async_mode: threaded 31 | 32 | vector_store: 33 | default_vector_store: 34 | type: "lancedb" 35 | db_uri: "./tests/fixtures/min-csv/lancedb" 36 | container_name: "lancedb_ci" 37 | overwrite: True 38 | 39 | input: 40 | file_type: csv 41 | 42 | snapshots: 43 | embeddings: true -------------------------------------------------------------------------------- /.semversioner/0.9.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Refactor graph creation.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Dependency updates", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Fix Global Search with dynamic Community selection bug", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Fix question gen.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Optimize Final Community Reports calculation and stabilize cache", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "miscellaneous code cleanup and minor changes for better alignment of style across the codebase.", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "replace llm package with fnllm", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "replaced md5 hash with sha256", 33 | "type": "patch" 34 | }, 35 | { 36 | "description": "replaced md5 hash with sha512", 37 | "type": "patch" 38 | }, 39 | { 40 | "description": "update API and add a demonstration notebook", 41 | "type": "patch" 42 | } 43 | ], 44 | "created_at": "2024-12-06T20:12:30+00:00", 45 | "version": "0.9.0" 46 | } -------------------------------------------------------------------------------- /.semversioner/2.2.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Support OpenAI reasoning models.", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Add option to snapshot raw extracted graph tables.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Added batching logic to the prompt tuning autoselection embeddings workflow", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Align config classes and docs better.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Align embeddings table loading with configured fields.", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Brings parity with our latest NLP extraction approaches.", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Fix fnllm to 0.2.3", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "Fixes to basic search.", 33 | "type": "patch" 34 | }, 35 | { 36 | "description": "Update llm args for consistency.", 37 | "type": "patch" 38 | }, 39 | { 40 | "description": "add vector store integration tests", 41 | "type": "patch" 42 | } 43 | ], 44 | "created_at": "2025-04-25T23:30:57+00:00", 45 | "version": "2.2.0" 46 | } -------------------------------------------------------------------------------- /.semversioner/0.3.5.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Add compound verbs with tests infra.", 5 | "type": "patch" 6 | }, 7 | { 8 | "description": "Collapse create_final_communities.", 9 | "type": "patch" 10 | }, 11 | { 12 | "description": "Collapse create_final_text_units.", 13 | "type": "patch" 14 | }, 15 | { 16 | "description": "Covariate verb collapse.", 17 | "type": "patch" 18 | }, 19 | { 20 | "description": "Fix duplicates in community context builder", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Fix prompt tune output path", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Fix seed hardcoded init", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "Fix seeded random gen on clustering", 33 | "type": "patch" 34 | }, 35 | { 36 | "description": "Improve logging.", 37 | "type": "patch" 38 | }, 39 | { 40 | "description": "Set default values for cli parameters.", 41 | "type": "patch" 42 | }, 43 | { 44 | "description": "Use static output directories.", 45 | "type": "patch" 46 | } 47 | ], 48 | "created_at": "2024-09-19T15:26:01+00:00", 49 | "version": "0.3.5" 50 | } -------------------------------------------------------------------------------- /tests/verbs/test_create_final_text_units.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | from graphrag.config.create_graphrag_config import create_graphrag_config 5 | from graphrag.data_model.schemas import TEXT_UNITS_FINAL_COLUMNS 6 | from graphrag.index.workflows.create_final_text_units import ( 7 | run_workflow, 8 | ) 9 | from graphrag.utils.storage import load_table_from_storage 10 | 11 | from .util import ( 12 | DEFAULT_MODEL_CONFIG, 13 | compare_outputs, 14 | create_test_context, 15 | load_test_table, 16 | ) 17 | 18 | 19 | async def test_create_final_text_units(): 20 | expected = load_test_table("text_units") 21 | 22 | context = await create_test_context( 23 | storage=[ 24 | "text_units", 25 | "entities", 26 | "relationships", 27 | "covariates", 28 | ], 29 | ) 30 | 31 | config = create_graphrag_config({"models": DEFAULT_MODEL_CONFIG}) 32 | config.extract_claims.enabled = True 33 | 34 | await run_workflow(config, context) 35 | 36 | actual = await load_table_from_storage("text_units", context.output_storage) 37 | 38 | for column in TEXT_UNITS_FINAL_COLUMNS: 39 | assert column in actual.columns 40 | 41 | compare_outputs(actual, expected) 42 | -------------------------------------------------------------------------------- /graphrag/config/models/basic_search_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | 10 | 11 | class BasicSearchConfig(BaseModel): 12 | """The default configuration section for Cache.""" 13 | 14 | prompt: str | None = Field( 15 | description="The basic search prompt to use.", 16 | default=graphrag_config_defaults.basic_search.prompt, 17 | ) 18 | chat_model_id: str = Field( 19 | description="The model ID to use for basic search.", 20 | default=graphrag_config_defaults.basic_search.chat_model_id, 21 | ) 22 | embedding_model_id: str = Field( 23 | description="The model ID to use for text embeddings.", 24 | default=graphrag_config_defaults.basic_search.embedding_model_id, 25 | ) 26 | k: int = Field( 27 | description="The number of text units to include in search context.", 28 | default=graphrag_config_defaults.basic_search.k, 29 | ) 30 | max_context_tokens: int = Field( 31 | description="The maximum tokens.", 32 | default=graphrag_config_defaults.basic_search.max_context_tokens, 33 | ) 34 | -------------------------------------------------------------------------------- /graphrag/index/operations/summarize_descriptions/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing 'SummarizedDescriptionResult' model.""" 5 | 6 | from collections.abc import Awaitable, Callable 7 | from dataclasses import dataclass 8 | from enum import Enum 9 | from typing import Any, NamedTuple 10 | 11 | from graphrag.cache.pipeline_cache import PipelineCache 12 | 13 | StrategyConfig = dict[str, Any] 14 | 15 | 16 | @dataclass 17 | class SummarizedDescriptionResult: 18 | """Entity summarization result class definition.""" 19 | 20 | id: str | tuple[str, str] 21 | description: str 22 | 23 | 24 | SummarizationStrategy = Callable[ 25 | [ 26 | str | tuple[str, str], 27 | list[str], 28 | PipelineCache, 29 | StrategyConfig, 30 | ], 31 | Awaitable[SummarizedDescriptionResult], 32 | ] 33 | 34 | 35 | class DescriptionSummarizeRow(NamedTuple): 36 | """DescriptionSummarizeRow class definition.""" 37 | 38 | graph: Any 39 | 40 | 41 | class SummarizeStrategyType(str, Enum): 42 | """SummarizeStrategyType class definition.""" 43 | 44 | graph_intelligence = "graph_intelligence" 45 | 46 | def __repr__(self): 47 | """Get a string representation.""" 48 | return f'"{self.value}"' 49 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/community_reporter_role.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Generate a community reporter role for community summarization.""" 5 | 6 | from graphrag.language_model.protocol.base import ChatModel 7 | from graphrag.prompt_tune.prompt.community_reporter_role import ( 8 | GENERATE_COMMUNITY_REPORTER_ROLE_PROMPT, 9 | ) 10 | 11 | 12 | async def generate_community_reporter_role( 13 | model: ChatModel, domain: str, persona: str, docs: str | list[str] 14 | ) -> str: 15 | """Generate an LLM persona to use for GraphRAG prompts. 16 | 17 | Parameters 18 | ---------- 19 | - llm (CompletionLLM): The LLM to use for generation 20 | - domain (str): The domain to generate a persona for 21 | - persona (str): The persona to generate a role for 22 | - docs (str | list[str]): The domain to generate a persona for 23 | 24 | Returns 25 | ------- 26 | - str: The generated domain prompt response. 27 | """ 28 | docs_str = " ".join(docs) if isinstance(docs, list) else docs 29 | domain_prompt = GENERATE_COMMUNITY_REPORTER_ROLE_PROMPT.format( 30 | domain=domain, persona=persona, input_text=docs_str 31 | ) 32 | 33 | response = await model.achat(domain_prompt) 34 | 35 | return str(response.output.content) 36 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_graph/embed_node2vec.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Utilities to generate graph embeddings.""" 5 | 6 | from dataclasses import dataclass 7 | 8 | import networkx as nx 9 | import numpy as np 10 | 11 | 12 | @dataclass 13 | class NodeEmbeddings: 14 | """Node embeddings class definition.""" 15 | 16 | nodes: list[str] 17 | embeddings: np.ndarray 18 | 19 | 20 | def embed_node2vec( 21 | graph: nx.Graph | nx.DiGraph, 22 | dimensions: int = 1536, 23 | num_walks: int = 10, 24 | walk_length: int = 40, 25 | window_size: int = 2, 26 | iterations: int = 3, 27 | random_seed: int = 86, 28 | ) -> NodeEmbeddings: 29 | """Generate node embeddings using Node2Vec.""" 30 | # NOTE: This import is done here to reduce the initial import time of the graphrag package 31 | import graspologic as gc 32 | 33 | # generate embedding 34 | lcc_tensors = gc.embed.node2vec_embed( # type: ignore 35 | graph=graph, 36 | dimensions=dimensions, 37 | window_size=window_size, 38 | iterations=iterations, 39 | num_walks=num_walks, 40 | walk_length=walk_length, 41 | random_seed=random_seed, 42 | ) 43 | return NodeEmbeddings(embeddings=lcc_tensors[0], nodes=lcc_tensors[1]) 44 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/community_report_rating.py: -------------------------------------------------------------------------------- 1 | """Generate a rating description for community report rating.""" 2 | 3 | # Copyright (c) 2024 Microsoft Corporation. 4 | # Licensed under the MIT License 5 | 6 | from graphrag.language_model.protocol.base import ChatModel 7 | from graphrag.prompt_tune.prompt.community_report_rating import ( 8 | GENERATE_REPORT_RATING_PROMPT, 9 | ) 10 | 11 | 12 | async def generate_community_report_rating( 13 | model: ChatModel, domain: str, persona: str, docs: str | list[str] 14 | ) -> str: 15 | """Generate an LLM persona to use for GraphRAG prompts. 16 | 17 | Parameters 18 | ---------- 19 | - llm (CompletionLLM): The LLM to use for generation 20 | - domain (str): The domain to generate a rating for 21 | - persona (str): The persona to generate a rating for for 22 | - docs (str | list[str]): Documents used to contextualize the rating 23 | 24 | Returns 25 | ------- 26 | - str: The generated rating description prompt response. 27 | """ 28 | docs_str = " ".join(docs) if isinstance(docs, list) else docs 29 | domain_prompt = GENERATE_REPORT_RATING_PROMPT.format( 30 | domain=domain, persona=persona, input_text=docs_str 31 | ) 32 | 33 | response = await model.achat(domain_prompt) 34 | 35 | return str(response.output.content).strip() 36 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | 14 | 15 | ## Description 16 | 17 | [Provide a brief description of the changes made in this pull request.] 18 | 19 | ## Related Issues 20 | 21 | [Reference any related issues or tasks that this pull request addresses.] 22 | 23 | ## Proposed Changes 24 | 25 | [List the specific changes made in this pull request.] 26 | 27 | ## Checklist 28 | 29 | - [ ] I have tested these changes locally. 30 | - [ ] I have reviewed the code changes. 31 | - [ ] I have updated the documentation (if necessary). 32 | - [ ] I have added appropriate unit tests (if applicable). 33 | 34 | ## Additional Notes 35 | 36 | [Add any additional notes or context that may be helpful for the reviewer(s).] 37 | -------------------------------------------------------------------------------- /graphrag/index/workflows/update_final_documents.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing run_workflow method definition.""" 5 | 6 | import logging 7 | 8 | from graphrag.config.models.graph_rag_config import GraphRagConfig 9 | from graphrag.index.run.utils import get_update_storages 10 | from graphrag.index.typing.context import PipelineRunContext 11 | from graphrag.index.typing.workflow import WorkflowFunctionOutput 12 | from graphrag.index.update.incremental_index import concat_dataframes 13 | 14 | logger = logging.getLogger(__name__) 15 | 16 | 17 | async def run_workflow( 18 | config: GraphRagConfig, 19 | context: PipelineRunContext, 20 | ) -> WorkflowFunctionOutput: 21 | """Update the documents from a incremental index run.""" 22 | logger.info("Workflow started: update_final_documents") 23 | output_storage, previous_storage, delta_storage = get_update_storages( 24 | config, context.state["update_timestamp"] 25 | ) 26 | 27 | final_documents = await concat_dataframes( 28 | "documents", previous_storage, delta_storage, output_storage 29 | ) 30 | 31 | context.state["incremental_update_final_documents"] = final_documents 32 | 33 | logger.info("Workflow completed: update_final_documents") 34 | return WorkflowFunctionOutput(result=None) 35 | -------------------------------------------------------------------------------- /graphrag/index/typing/context.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | # isort: skip_file 5 | """A module containing the 'PipelineRunContext' models.""" 6 | 7 | from dataclasses import dataclass 8 | 9 | from graphrag.cache.pipeline_cache import PipelineCache 10 | from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks 11 | from graphrag.index.typing.state import PipelineState 12 | from graphrag.index.typing.stats import PipelineRunStats 13 | from graphrag.storage.pipeline_storage import PipelineStorage 14 | 15 | 16 | @dataclass 17 | class PipelineRunContext: 18 | """Provides the context for the current pipeline run.""" 19 | 20 | stats: PipelineRunStats 21 | input_storage: PipelineStorage 22 | "Storage for input documents." 23 | output_storage: PipelineStorage 24 | "Long-term storage for pipeline verbs to use. Items written here will be written to the storage provider." 25 | previous_storage: PipelineStorage 26 | "Storage for previous pipeline run when running in update mode." 27 | cache: PipelineCache 28 | "Cache instance for reading previous LLM responses." 29 | callbacks: WorkflowCallbacks 30 | "Callbacks to be called during the pipeline run." 31 | state: PipelineState 32 | "Arbitrary property bag for runtime state, persistent pre-computes, or experimental features." 33 | -------------------------------------------------------------------------------- /graphrag/prompt_tune/generator/entity_summarization_prompt.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Entity summarization prompt generation module.""" 5 | 6 | from pathlib import Path 7 | 8 | from graphrag.prompt_tune.template.entity_summarization import ( 9 | ENTITY_SUMMARIZATION_PROMPT, 10 | ) 11 | 12 | ENTITY_SUMMARIZATION_FILENAME = "summarize_descriptions.txt" 13 | 14 | 15 | def create_entity_summarization_prompt( 16 | persona: str, 17 | language: str, 18 | output_path: Path | None = None, 19 | ) -> str: 20 | """ 21 | Create a prompt for entity summarization. 22 | 23 | Parameters 24 | ---------- 25 | - persona (str): The persona to use for the entity summarization prompt 26 | - language (str): The language to use for the entity summarization prompt 27 | - output_path (Path | None): The path to write the prompt to. Default is None. 28 | """ 29 | prompt = ENTITY_SUMMARIZATION_PROMPT.format(persona=persona, language=language) 30 | 31 | if output_path: 32 | output_path.mkdir(parents=True, exist_ok=True) 33 | 34 | output_path = output_path / ENTITY_SUMMARIZATION_FILENAME 35 | # Write file to output path 36 | with output_path.open("wb") as file: 37 | file.write(prompt.encode(encoding="utf-8", errors="strict")) 38 | 39 | return prompt 40 | -------------------------------------------------------------------------------- /tests/notebook/test_notebooks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | import subprocess 4 | from pathlib import Path 5 | 6 | import nbformat 7 | import pytest 8 | 9 | NOTEBOOKS_PATH = Path("examples_notebooks") 10 | EXCLUDED_PATH = NOTEBOOKS_PATH / "community_contrib" 11 | 12 | notebooks_list = [ 13 | notebook 14 | for notebook in NOTEBOOKS_PATH.rglob("*.ipynb") 15 | if EXCLUDED_PATH not in notebook.parents 16 | ] 17 | 18 | 19 | def _notebook_run(filepath: Path): 20 | """Execute a notebook via nbconvert and collect output. 21 | :returns execution errors 22 | """ 23 | args = [ 24 | "jupyter", 25 | "nbconvert", 26 | "--to", 27 | "notebook", 28 | "--execute", 29 | "-y", 30 | "--no-prompt", 31 | "--stdout", 32 | str(filepath.absolute().resolve()), 33 | ] 34 | notebook = subprocess.check_output(args) 35 | nb = nbformat.reads(notebook, nbformat.current_nbformat) 36 | 37 | return [ 38 | output 39 | for cell in nb.cells 40 | if "outputs" in cell 41 | for output in cell["outputs"] 42 | if output.output_type == "error" 43 | ] 44 | 45 | 46 | @pytest.mark.parametrize("notebook_path", notebooks_list) 47 | def test_notebook(notebook_path: Path): 48 | assert _notebook_run(notebook_path) == [] 49 | -------------------------------------------------------------------------------- /graphrag/config/models/reporting_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | from graphrag.config.enums import ReportingType 10 | 11 | 12 | class ReportingConfig(BaseModel): 13 | """The default configuration section for Reporting.""" 14 | 15 | type: ReportingType | str = Field( 16 | description="The reporting type to use.", 17 | default=graphrag_config_defaults.reporting.type, 18 | ) 19 | base_dir: str = Field( 20 | description="The base directory for reporting.", 21 | default=graphrag_config_defaults.reporting.base_dir, 22 | ) 23 | connection_string: str | None = Field( 24 | description="The reporting connection string to use.", 25 | default=graphrag_config_defaults.reporting.connection_string, 26 | ) 27 | container_name: str | None = Field( 28 | description="The reporting container name to use.", 29 | default=graphrag_config_defaults.reporting.container_name, 30 | ) 31 | storage_account_blob_url: str | None = Field( 32 | description="The storage account blob url to use.", 33 | default=graphrag_config_defaults.reporting.storage_account_blob_url, 34 | ) 35 | -------------------------------------------------------------------------------- /graphrag/callbacks/workflow_callbacks.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Collection of callbacks that can be used to monitor the workflow execution.""" 5 | 6 | from typing import Protocol 7 | 8 | from graphrag.index.typing.pipeline_run_result import PipelineRunResult 9 | from graphrag.logger.progress import Progress 10 | 11 | 12 | class WorkflowCallbacks(Protocol): 13 | """ 14 | A collection of callbacks that can be used to monitor the workflow execution. 15 | 16 | This base class is a "noop" implementation so that clients may implement just the callbacks they need. 17 | """ 18 | 19 | def pipeline_start(self, names: list[str]) -> None: 20 | """Execute this callback to signal when the entire pipeline starts.""" 21 | ... 22 | 23 | def pipeline_end(self, results: list[PipelineRunResult]) -> None: 24 | """Execute this callback to signal when the entire pipeline ends.""" 25 | ... 26 | 27 | def workflow_start(self, name: str, instance: object) -> None: 28 | """Execute this callback when a workflow starts.""" 29 | ... 30 | 31 | def workflow_end(self, name: str, instance: object) -> None: 32 | """Execute this callback when a workflow ends.""" 33 | ... 34 | 35 | def progress(self, progress: Progress) -> None: 36 | """Handle when progress occurs.""" 37 | ... 38 | -------------------------------------------------------------------------------- /graphrag/index/operations/embed_text/strategies/mock.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing run and _embed_text methods definitions.""" 5 | 6 | import random 7 | from collections.abc import Iterable 8 | from typing import Any 9 | 10 | from graphrag.cache.pipeline_cache import PipelineCache 11 | from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks 12 | from graphrag.index.operations.embed_text.strategies.typing import TextEmbeddingResult 13 | from graphrag.logger.progress import ProgressTicker, progress_ticker 14 | 15 | 16 | async def run( # noqa RUF029 async is required for interface 17 | input: list[str], 18 | callbacks: WorkflowCallbacks, 19 | cache: PipelineCache, 20 | _args: dict[str, Any], 21 | ) -> TextEmbeddingResult: 22 | """Run the Claim extraction chain.""" 23 | input = input if isinstance(input, Iterable) else [input] 24 | ticker = progress_ticker( 25 | callbacks.progress, len(input), description="generate embeddings progress: " 26 | ) 27 | return TextEmbeddingResult( 28 | embeddings=[_embed_text(cache, text, ticker) for text in input] 29 | ) 30 | 31 | 32 | def _embed_text(_cache: PipelineCache, _text: str, tick: ProgressTicker) -> list[float]: 33 | """Embed a single piece of text.""" 34 | tick(1) 35 | return [random.random(), random.random(), random.random()] # noqa S311 36 | -------------------------------------------------------------------------------- /graphrag/tokenizer/tiktoken_tokenizer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Tiktoken Tokenizer.""" 5 | 6 | import tiktoken 7 | 8 | from graphrag.tokenizer.tokenizer import Tokenizer 9 | 10 | 11 | class TiktokenTokenizer(Tokenizer): 12 | """Tiktoken Tokenizer.""" 13 | 14 | def __init__(self, encoding_name: str) -> None: 15 | """Initialize the Tiktoken Tokenizer. 16 | 17 | Args 18 | ---- 19 | encoding_name (str): The name of the Tiktoken encoding to use for tokenization. 20 | """ 21 | self.encoding = tiktoken.get_encoding(encoding_name) 22 | 23 | def encode(self, text: str) -> list[int]: 24 | """Encode the given text into a list of tokens. 25 | 26 | Args 27 | ---- 28 | text (str): The input text to encode. 29 | 30 | Returns 31 | ------- 32 | list[int]: A list of tokens representing the encoded text. 33 | """ 34 | return self.encoding.encode(text) 35 | 36 | def decode(self, tokens: list[int]) -> str: 37 | """Decode a list of tokens back into a string. 38 | 39 | Args 40 | ---- 41 | tokens (list[int]): A list of tokens to decode. 42 | 43 | Returns 44 | ------- 45 | str: The decoded string from the list of tokens. 46 | """ 47 | return self.encoding.decode(tokens) 48 | -------------------------------------------------------------------------------- /.github/workflows/python-publish.yml: -------------------------------------------------------------------------------- 1 | name: Python Publish (pypi) 2 | on: 3 | release: 4 | types: [created] 5 | push: 6 | branches: [main] 7 | 8 | env: 9 | PYTHON_VERSION: "3.10" 10 | 11 | jobs: 12 | publish: 13 | name: Upload release to PyPI 14 | if: github.ref == 'refs/heads/main' 15 | runs-on: ubuntu-latest 16 | environment: 17 | name: pypi 18 | url: https://pypi.org/p/graphrag 19 | permissions: 20 | id-token: write # IMPORTANT: this permission is mandatory for trusted publishing 21 | 22 | steps: 23 | - uses: actions/checkout@v4 24 | with: 25 | fetch-depth: 0 26 | fetch-tags: true 27 | 28 | - name: Set up Python 29 | uses: actions/setup-python@v5 30 | with: 31 | python-version: ${{ env.PYTHON_VERSION }} 32 | 33 | - name: Install uv 34 | uses: astral-sh/setup-uv@v6 35 | 36 | - name: Install dependencies 37 | shell: bash 38 | run: uv sync 39 | 40 | - name: Export Publication Version 41 | run: echo "version=$(uv version --short)" >> $GITHUB_OUTPUT 42 | 43 | - name: Build Distributable 44 | shell: bash 45 | run: uv build 46 | 47 | - name: Publish package distributions to PyPI 48 | uses: pypa/gh-action-pypi-publish@release/v1 49 | with: 50 | packages-dir: dist 51 | skip-existing: true 52 | verbose: true 53 | -------------------------------------------------------------------------------- /tests/fixtures/text/settings.yml: -------------------------------------------------------------------------------- 1 | models: 2 | default_chat_model: 3 | azure_auth_type: api_key 4 | type: chat 5 | model_provider: azure 6 | api_key: ${GRAPHRAG_API_KEY} 7 | api_base: ${GRAPHRAG_API_BASE} 8 | api_version: "2025-04-01-preview" 9 | deployment_name: gpt-4.1 10 | model: gpt-4.1 11 | retry_strategy: exponential_backoff 12 | tokens_per_minute: null 13 | requests_per_minute: null 14 | model_supports_json: true 15 | concurrent_requests: 25 16 | async_mode: threaded 17 | default_embedding_model: 18 | azure_auth_type: api_key 19 | type: embedding 20 | model_provider: azure 21 | api_key: ${GRAPHRAG_API_KEY} 22 | api_base: ${GRAPHRAG_API_BASE} 23 | api_version: "2025-04-01-preview" 24 | deployment_name: text-embedding-ada-002 25 | model: text-embedding-ada-002 26 | retry_strategy: exponential_backoff 27 | tokens_per_minute: null 28 | requests_per_minute: null 29 | concurrent_requests: 25 30 | async_mode: threaded 31 | 32 | vector_store: 33 | default_vector_store: 34 | type: "azure_ai_search" 35 | url: ${AZURE_AI_SEARCH_URL_ENDPOINT} 36 | api_key: ${AZURE_AI_SEARCH_API_KEY} 37 | container_name: "simple_text_ci" 38 | 39 | extract_claims: 40 | enabled: true 41 | 42 | community_reports: 43 | prompt: "prompts/community_report.txt" 44 | max_length: 2000 45 | max_input_length: 8000 46 | 47 | snapshots: 48 | embeddings: true -------------------------------------------------------------------------------- /graphrag/api/__init__.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """API for GraphRAG. 5 | 6 | WARNING: This API is under development and may undergo changes in future releases. 7 | Backwards compatibility is not guaranteed at this time. 8 | """ 9 | 10 | from graphrag.api.index import build_index 11 | from graphrag.api.prompt_tune import generate_indexing_prompts 12 | from graphrag.api.query import ( 13 | basic_search, 14 | basic_search_streaming, 15 | drift_search, 16 | drift_search_streaming, 17 | global_search, 18 | global_search_streaming, 19 | local_search, 20 | local_search_streaming, 21 | multi_index_basic_search, 22 | multi_index_drift_search, 23 | multi_index_global_search, 24 | multi_index_local_search, 25 | ) 26 | from graphrag.prompt_tune.types import DocSelectionType 27 | 28 | __all__ = [ # noqa: RUF022 29 | # index API 30 | "build_index", 31 | # query API 32 | "global_search", 33 | "global_search_streaming", 34 | "local_search", 35 | "local_search_streaming", 36 | "drift_search", 37 | "drift_search_streaming", 38 | "basic_search", 39 | "basic_search_streaming", 40 | "multi_index_basic_search", 41 | "multi_index_drift_search", 42 | "multi_index_global_search", 43 | "multi_index_local_search", 44 | # prompt tuning API 45 | "DocSelectionType", 46 | "generate_indexing_prompts", 47 | ] 48 | -------------------------------------------------------------------------------- /graphrag/config/create_graphrag_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration, loaded from environment variables.""" 5 | 6 | from pathlib import Path 7 | from typing import Any 8 | 9 | from graphrag.config.models.graph_rag_config import GraphRagConfig 10 | 11 | 12 | def create_graphrag_config( 13 | values: dict[str, Any] | None = None, 14 | root_dir: str | None = None, 15 | ) -> GraphRagConfig: 16 | """Load Configuration Parameters from a dictionary. 17 | 18 | Parameters 19 | ---------- 20 | values : dict[str, Any] | None 21 | Dictionary of configuration values to pass into pydantic model. 22 | root_dir : str | None 23 | Root directory for the project. 24 | skip_validation : bool 25 | Skip pydantic model validation of the configuration. 26 | This is useful for testing and mocking purposes but 27 | should not be used in the core code or API. 28 | 29 | Returns 30 | ------- 31 | GraphRagConfig 32 | The configuration object. 33 | 34 | Raises 35 | ------ 36 | ValidationError 37 | If the configuration values do not satisfy pydantic validation. 38 | """ 39 | values = values or {} 40 | if root_dir: 41 | root_path = Path(root_dir).resolve() 42 | values["root_dir"] = str(root_path) 43 | return GraphRagConfig(**values) 44 | -------------------------------------------------------------------------------- /tests/verbs/test_create_communities.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | from graphrag.config.create_graphrag_config import create_graphrag_config 5 | from graphrag.data_model.schemas import COMMUNITIES_FINAL_COLUMNS 6 | from graphrag.index.workflows.create_communities import ( 7 | run_workflow, 8 | ) 9 | from graphrag.utils.storage import load_table_from_storage 10 | 11 | from .util import ( 12 | DEFAULT_MODEL_CONFIG, 13 | compare_outputs, 14 | create_test_context, 15 | load_test_table, 16 | ) 17 | 18 | 19 | async def test_create_communities(): 20 | expected = load_test_table("communities") 21 | 22 | context = await create_test_context( 23 | storage=[ 24 | "entities", 25 | "relationships", 26 | ], 27 | ) 28 | 29 | config = create_graphrag_config({"models": DEFAULT_MODEL_CONFIG}) 30 | 31 | await run_workflow( 32 | config, 33 | context, 34 | ) 35 | 36 | actual = await load_table_from_storage("communities", context.output_storage) 37 | 38 | columns = list(expected.columns.values) 39 | # don't compare period since it is created with the current date each time 40 | columns.remove("period") 41 | compare_outputs( 42 | actual, 43 | expected, 44 | columns=columns, 45 | ) 46 | 47 | for column in COMMUNITIES_FINAL_COLUMNS: 48 | assert column in actual.columns 49 | -------------------------------------------------------------------------------- /graphrag/tokenizer/litellm_tokenizer.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """LiteLLM Tokenizer.""" 5 | 6 | from litellm import decode, encode # type: ignore 7 | 8 | from graphrag.tokenizer.tokenizer import Tokenizer 9 | 10 | 11 | class LitellmTokenizer(Tokenizer): 12 | """LiteLLM Tokenizer.""" 13 | 14 | def __init__(self, model_name: str) -> None: 15 | """Initialize the LiteLLM Tokenizer. 16 | 17 | Args 18 | ---- 19 | model_name (str): The name of the LiteLLM model to use for tokenization. 20 | """ 21 | self.model_name = model_name 22 | 23 | def encode(self, text: str) -> list[int]: 24 | """Encode the given text into a list of tokens. 25 | 26 | Args 27 | ---- 28 | text (str): The input text to encode. 29 | 30 | Returns 31 | ------- 32 | list[int]: A list of tokens representing the encoded text. 33 | """ 34 | return encode(model=self.model_name, text=text) 35 | 36 | def decode(self, tokens: list[int]) -> str: 37 | """Decode a list of tokens back into a string. 38 | 39 | Args 40 | ---- 41 | tokens (list[int]): A list of tokens to decode. 42 | 43 | Returns 44 | ------- 45 | str: The decoded string from the list of tokens. 46 | """ 47 | return decode(model=self.model_name, tokens=tokens) 48 | -------------------------------------------------------------------------------- /graphrag/index/operations/extract_covariates/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing 'Covariate' and 'CovariateExtractionResult' models.""" 5 | 6 | from collections.abc import Awaitable, Callable, Iterable 7 | from dataclasses import dataclass 8 | from typing import Any 9 | 10 | from graphrag.cache.pipeline_cache import PipelineCache 11 | from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks 12 | 13 | 14 | @dataclass 15 | class Covariate: 16 | """Covariate class definition.""" 17 | 18 | covariate_type: str | None = None 19 | subject_id: str | None = None 20 | object_id: str | None = None 21 | type: str | None = None 22 | status: str | None = None 23 | start_date: str | None = None 24 | end_date: str | None = None 25 | description: str | None = None 26 | source_text: list[str] | None = None 27 | doc_id: str | None = None 28 | record_id: int | None = None 29 | id: str | None = None 30 | 31 | 32 | @dataclass 33 | class CovariateExtractionResult: 34 | """Covariate extraction result class definition.""" 35 | 36 | covariate_data: list[Covariate] 37 | 38 | 39 | CovariateExtractStrategy = Callable[ 40 | [ 41 | Iterable[str], 42 | list[str], 43 | dict[str, str], 44 | WorkflowCallbacks, 45 | PipelineCache, 46 | dict[str, Any], 47 | ], 48 | Awaitable[CovariateExtractionResult], 49 | ] 50 | -------------------------------------------------------------------------------- /graphrag/language_model/providers/fnllm/cache.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2025 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """FNLLM Cache provider.""" 5 | 6 | from typing import Any 7 | 8 | from fnllm.caching import Cache as FNLLMCache 9 | 10 | from graphrag.cache.pipeline_cache import PipelineCache 11 | 12 | 13 | class FNLLMCacheProvider(FNLLMCache): 14 | """A cache for the pipeline.""" 15 | 16 | def __init__(self, cache: PipelineCache): 17 | self._cache = cache 18 | 19 | async def has(self, key: str) -> bool: 20 | """Check if the cache has a value.""" 21 | return await self._cache.has(key) 22 | 23 | async def get(self, key: str) -> Any | None: 24 | """Retrieve a value from the cache.""" 25 | return await self._cache.get(key) 26 | 27 | async def set( 28 | self, key: str, value: Any, metadata: dict[str, Any] | None = None 29 | ) -> None: 30 | """Write a value into the cache.""" 31 | await self._cache.set(key, value, metadata) 32 | 33 | async def remove(self, key: str) -> None: 34 | """Remove a value from the cache.""" 35 | await self._cache.delete(key) 36 | 37 | async def clear(self) -> None: 38 | """Clear the cache.""" 39 | await self._cache.clear() 40 | 41 | def child(self, key: str) -> "FNLLMCacheProvider": 42 | """Create a child cache.""" 43 | child_cache = self._cache.child(key) 44 | return FNLLMCacheProvider(child_cache) 45 | -------------------------------------------------------------------------------- /graphrag/index/operations/graph_to_dataframes.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing create_graph definition.""" 5 | 6 | import networkx as nx 7 | import pandas as pd 8 | 9 | 10 | def graph_to_dataframes( 11 | graph: nx.Graph, 12 | node_columns: list[str] | None = None, 13 | edge_columns: list[str] | None = None, 14 | node_id: str = "title", 15 | ) -> tuple[pd.DataFrame, pd.DataFrame]: 16 | """Deconstructs an nx.Graph into nodes and edges dataframes.""" 17 | # nx graph nodes are a tuple, and creating a df from them results in the id being the index 18 | nodes = pd.DataFrame.from_dict(dict(graph.nodes(data=True)), orient="index") 19 | nodes[node_id] = nodes.index 20 | nodes.reset_index(inplace=True, drop=True) 21 | 22 | edges = nx.to_pandas_edgelist(graph) 23 | 24 | # we don't deal in directed graphs, but we do need to ensure consistent ordering for df joins 25 | # nx loses the initial ordering 26 | edges["min_source"] = edges[["source", "target"]].min(axis=1) 27 | edges["max_target"] = edges[["source", "target"]].max(axis=1) 28 | edges = edges.drop(columns=["source", "target"]).rename( 29 | columns={"min_source": "source", "max_target": "target"} # type: ignore 30 | ) 31 | 32 | if node_columns: 33 | nodes = nodes.loc[:, node_columns] 34 | 35 | if edge_columns: 36 | edges = edges.loc[:, edge_columns] 37 | 38 | return (nodes, edges) 39 | -------------------------------------------------------------------------------- /unified-search-app/app/data_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Data config module.""" 5 | 6 | # This file is used to store configurations for the graph-indexed data and the LLM/embeddings models used in the app. 7 | 8 | # name of the table in the graph-indexed data where the communities are stored 9 | communities_table = "output/communities" 10 | 11 | # name of the table in the graph-indexed data where the community reports are stored 12 | community_report_table = "output/community_reports" 13 | 14 | # name of the table in the graph-indexed data where the entity embeddings are stored 15 | entity_table = "output/entities" 16 | 17 | # name of the table in the graph-indexed data where the entity relationships are stored 18 | relationship_table = "output/relationships" 19 | 20 | # name of the table in the graph-indexed data where the entity covariates are stored 21 | covariate_table = "output/covariates" 22 | 23 | # name of the table in the graph-indexed data where the text units are stored 24 | text_unit_table = "output/text_units" 25 | 26 | # default configurations for LLM's answer generation, used in all search types 27 | # this should be adjusted based on the token limits of the LLM model being used 28 | # The following setting is for gpt-4-1106-preview (i.e. gpt-4-turbo) 29 | # For gpt-4 (token-limit = 8k), a good setting could be: 30 | default_suggested_questions = 5 31 | 32 | # default timeout for streamlit cache 33 | default_ttl = 60 * 60 * 24 * 7 34 | -------------------------------------------------------------------------------- /graphrag/index/operations/extract_graph/typing.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing 'Document' and 'EntityExtractionResult' models.""" 5 | 6 | from collections.abc import Awaitable, Callable 7 | from dataclasses import dataclass 8 | from enum import Enum 9 | from typing import Any 10 | 11 | import networkx as nx 12 | 13 | from graphrag.cache.pipeline_cache import PipelineCache 14 | 15 | ExtractedEntity = dict[str, Any] 16 | ExtractedRelationship = dict[str, Any] 17 | StrategyConfig = dict[str, Any] 18 | EntityTypes = list[str] 19 | 20 | 21 | @dataclass 22 | class Document: 23 | """Document class definition.""" 24 | 25 | text: str 26 | id: str 27 | 28 | 29 | @dataclass 30 | class EntityExtractionResult: 31 | """Entity extraction result class definition.""" 32 | 33 | entities: list[ExtractedEntity] 34 | relationships: list[ExtractedRelationship] 35 | graph: nx.Graph | None 36 | 37 | 38 | EntityExtractStrategy = Callable[ 39 | [ 40 | list[Document], 41 | EntityTypes, 42 | PipelineCache, 43 | StrategyConfig, 44 | ], 45 | Awaitable[EntityExtractionResult], 46 | ] 47 | 48 | 49 | class ExtractEntityStrategyType(str, Enum): 50 | """ExtractEntityStrategyType class definition.""" 51 | 52 | graph_intelligence = "graph_intelligence" 53 | nltk = "nltk" 54 | 55 | def __repr__(self): 56 | """Get a string representation.""" 57 | return f'"{self.value}"' 58 | -------------------------------------------------------------------------------- /graphrag/config/models/cache_config.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Parameterization settings for the default configuration.""" 5 | 6 | from pydantic import BaseModel, Field 7 | 8 | from graphrag.config.defaults import graphrag_config_defaults 9 | from graphrag.config.enums import CacheType 10 | 11 | 12 | class CacheConfig(BaseModel): 13 | """The default configuration section for Cache.""" 14 | 15 | type: CacheType | str = Field( 16 | description="The cache type to use.", 17 | default=graphrag_config_defaults.cache.type, 18 | ) 19 | base_dir: str = Field( 20 | description="The base directory for the cache.", 21 | default=graphrag_config_defaults.cache.base_dir, 22 | ) 23 | connection_string: str | None = Field( 24 | description="The cache connection string to use.", 25 | default=graphrag_config_defaults.cache.connection_string, 26 | ) 27 | container_name: str | None = Field( 28 | description="The cache container name to use.", 29 | default=graphrag_config_defaults.cache.container_name, 30 | ) 31 | storage_account_blob_url: str | None = Field( 32 | description="The storage account blob url to use.", 33 | default=graphrag_config_defaults.cache.storage_account_blob_url, 34 | ) 35 | cosmosdb_account_url: str | None = Field( 36 | description="The cosmosdb account url to use.", 37 | default=graphrag_config_defaults.cache.cosmosdb_account_url, 38 | ) 39 | -------------------------------------------------------------------------------- /graphrag/index/operations/compute_edge_combined_degree.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing compute_edge_combined_degree methods definition.""" 5 | 6 | from typing import cast 7 | 8 | import pandas as pd 9 | 10 | 11 | def compute_edge_combined_degree( 12 | edge_df: pd.DataFrame, 13 | node_degree_df: pd.DataFrame, 14 | node_name_column: str, 15 | node_degree_column: str, 16 | edge_source_column: str, 17 | edge_target_column: str, 18 | ) -> pd.Series: 19 | """Compute the combined degree for each edge in a graph.""" 20 | 21 | def join_to_degree(df: pd.DataFrame, column: str) -> pd.DataFrame: 22 | degree_column = _degree_colname(column) 23 | result = df.merge( 24 | node_degree_df.rename( 25 | columns={node_name_column: column, node_degree_column: degree_column} 26 | ), 27 | on=column, 28 | how="left", 29 | ) 30 | result[degree_column] = result[degree_column].fillna(0) 31 | return result 32 | 33 | output_df = join_to_degree(edge_df, edge_source_column) 34 | output_df = join_to_degree(output_df, edge_target_column) 35 | output_df["combined_degree"] = ( 36 | output_df[_degree_colname(edge_source_column)] 37 | + output_df[_degree_colname(edge_target_column)] 38 | ) 39 | return cast("pd.Series", output_df["combined_degree"]) 40 | 41 | 42 | def _degree_colname(column: str) -> str: 43 | return f"{column}_degree" 44 | -------------------------------------------------------------------------------- /.semversioner/1.1.0.json: -------------------------------------------------------------------------------- 1 | { 2 | "changes": [ 3 | { 4 | "description": "Make gleanings independent of encoding", 5 | "type": "minor" 6 | }, 7 | { 8 | "description": "Remove DataShaper (first steps).", 9 | "type": "minor" 10 | }, 11 | { 12 | "description": "Remove old pipeline runner.", 13 | "type": "minor" 14 | }, 15 | { 16 | "description": "new search implemented as a new option for the api", 17 | "type": "minor" 18 | }, 19 | { 20 | "description": "Fix gleanings loop check", 21 | "type": "patch" 22 | }, 23 | { 24 | "description": "Implement cosmosdb storage option for cache and output", 25 | "type": "patch" 26 | }, 27 | { 28 | "description": "Move extractor code to co-locate with operations.", 29 | "type": "patch" 30 | }, 31 | { 32 | "description": "Remove config input models.", 33 | "type": "patch" 34 | }, 35 | { 36 | "description": "Ruff update", 37 | "type": "patch" 38 | }, 39 | { 40 | "description": "Simplify and streamline internal config.", 41 | "type": "patch" 42 | }, 43 | { 44 | "description": "Simplify callbacks model.", 45 | "type": "patch" 46 | }, 47 | { 48 | "description": "Streamline flows.", 49 | "type": "patch" 50 | }, 51 | { 52 | "description": "fix instantiation of storage classes.", 53 | "type": "patch" 54 | } 55 | ], 56 | "created_at": "2025-01-07T20:25:57+00:00", 57 | "version": "1.1.0" 58 | } -------------------------------------------------------------------------------- /graphrag/config/get_embedding_settings.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """A module containing get_embedding_settings.""" 5 | 6 | from graphrag.config.models.graph_rag_config import GraphRagConfig 7 | 8 | 9 | def get_embedding_settings( 10 | settings: GraphRagConfig, 11 | vector_store_params: dict | None = None, 12 | ) -> dict: 13 | """Transform GraphRAG config into settings for workflows.""" 14 | embeddings_llm_settings = settings.get_language_model_config( 15 | settings.embed_text.model_id 16 | ) 17 | vector_store_settings = settings.get_vector_store_config( 18 | settings.embed_text.vector_store_id 19 | ).model_dump() 20 | 21 | # 22 | # If we get to this point, settings.vector_store is defined, and there's a specific setting for this embedding. 23 | # settings.vector_store.base contains connection information, or may be undefined 24 | # settings.vector_store. contains the specific settings for this embedding 25 | # 26 | strategy = settings.embed_text.resolved_strategy( 27 | embeddings_llm_settings 28 | ) # get the default strategy 29 | strategy.update({ 30 | "vector_store": { 31 | **(vector_store_params or {}), 32 | **(vector_store_settings), 33 | } 34 | }) # update the default strategy with the vector store settings 35 | # This ensures the vector store config is part of the strategy and not the global config 36 | return { 37 | "strategy": strategy, 38 | } 39 | -------------------------------------------------------------------------------- /unified-search-app/app/state/query_variable.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2024 Microsoft Corporation. 2 | # Licensed under the MIT License 3 | 4 | """Query variable module.""" 5 | 6 | from typing import Any 7 | 8 | import streamlit as st 9 | 10 | 11 | class QueryVariable: 12 | """ 13 | Manage reading and writing variables from the URL query string. 14 | 15 | We handle translation between string values and bools, accounting for always-lowercase URLs to avoid case issues. 16 | Note that all variables are managed via session state to account for widgets that auto-read. 17 | We just push them up to the query to keep it updated. 18 | """ 19 | 20 | def __init__(self, key: str, default: Any | None): 21 | """Init method definition.""" 22 | self._key = key 23 | val = st.query_params[key].lower() if key in st.query_params else default 24 | if val == "true": 25 | val = True 26 | elif val == "false": 27 | val = False 28 | if key not in st.session_state: 29 | st.session_state[key] = val 30 | 31 | @property 32 | def key(self) -> str: 33 | """Key property definition.""" 34 | return self._key 35 | 36 | @property 37 | def value(self) -> Any: 38 | """Value property definition.""" 39 | return st.session_state[self._key] 40 | 41 | @value.setter 42 | def value(self, value: Any) -> None: 43 | """Value setter definition.""" 44 | st.session_state[self._key] = value 45 | st.query_params[self._key] = f"{value}".lower() 46 | --------------------------------------------------------------------------------