├── .gitignore ├── .env.example ├── custom_database ├── template.env ├── README.md ├── dataset_generation.ipynb ├── utils.py └── prompts.py ├── servers ├── README.md └── mcp-neo4j-cypher │ └── README.md ├── README.md ├── generate_eval_dataset ├── README.md ├── utils.py ├── dataset_generation.ipynb └── prompts.py └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .ipynb_checkpoints 3 | .env 4 | .DS_Store -------------------------------------------------------------------------------- /.env.example: -------------------------------------------------------------------------------- 1 | ANTHROPIC_API_KEY= 2 | GOOGLE_API_KEY= 3 | OPENAI_API_KEY= -------------------------------------------------------------------------------- /custom_database/template.env: -------------------------------------------------------------------------------- 1 | # List of {model_provider}:{model} as per https://python.langchain.com/api_reference/langchain/chat_models.html 2 | LLM_CREATE_QUESTIONS={model_provider}:{model}, * 3 | # One model {model_provider}:{model} as per https://python.langchain.com/api_reference/langchain/chat_models.html 4 | LLM_CREATE_ANSWERS={model_provider}:{model} 5 | 6 | _API_KEY=.. 7 | _API_KEY=.. 8 | 9 | DATABASES=[{uri: "", username: "", password: "", database: ""}, {...}] -------------------------------------------------------------------------------- /servers/README.md: -------------------------------------------------------------------------------- 1 | # MCP servers evaluation 2 | 3 | This folder contains multiple server setups for evaluating how well LLM agents query knowledge graphs via tool-based interfaces. 4 | 5 | Each subdirectory: 6 | 7 | * Implements or reuses an MCP-compatible server 8 | * Evaluates it against a shared dataset using the provided notebook or script 9 | 10 | ### Evaluation 11 | 12 | We use an LLM as a judge to assess the quality of responses across different categories of questions. 13 | 14 | ### Structure 15 | 16 | ``` 17 | / 18 | ├── mcp-neo4j-cypher/ # Uses existing mcp-neo4j-cypher server 19 | ├── my-custom-server/ # Custom MCP-compatible server 20 | ... 21 | ``` 22 | 23 | Evaluations focus on how effectively each server enables agent-based querying of knowledge graphs. -------------------------------------------------------------------------------- /servers/mcp-neo4j-cypher/README.md: -------------------------------------------------------------------------------- 1 | # mcp-neo4j-cypher 2 | 3 | This project evaluates **`mcp-neo4j-cypher`** in the accompanying evaluation notebook. 4 | 5 | The `mcp-neo4j-cypher` server allows an LLM agent to extract the Neo4j database schema and generate Cypher queries to read from and update the graph. 6 | 7 | ### Available Tools 8 | 9 | * **`get-neo4j-schema`** – Extracts the graph schema. 10 | * **`read-neo4j-cypher`** – Executes read-only Cypher queries. 11 | * **`write-neo4j-cypher`** – Executes Cypher write/update operations. 12 | 13 | **Repo:** [neo4j-contrib/mcp-neo4j-cypher](https://github.com/neo4j-contrib/mcp-neo4j/tree/main/servers/mcp-neo4j-cypher) 14 | **Docs:** [Neo4j Developer Guide](https://neo4j.com/developer/genai-ecosystem/model-context-protocol-mcp/#_mcp_neo4j_cypher) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GRAPE 2 | 3 | **Graph Retriever Analysis and Performance Evaluation** 4 | 5 | **GRAPE** is a framework for benchmarking how well LLM agents query knowledge graphs via MCP-compatible servers. 6 | 7 | ### Structure 8 | 9 | * `evaluation-dataset-generation/` – Uses LLMs to generate questions and answers from real Neo4j databases 10 | * `mcp-server-evaluations/` – Evaluates MCP server implementations against the generated dataset using an LLM judge 11 | 12 | GRAPE supports multiple domains, real-world graphs from [demo.neo4jlabs.com](https://demo.neo4jlabs.com), and a consistent evaluation pipeline. 13 | 14 | ### How to Start 15 | 16 | 1. **Use the existing dataset** 17 | The repository includes a pre-generated `generated_dataset.json`. 18 | Re-running `dataset_generation.ipynb` is optional. 19 | 20 | 2. **Run evaluation** 21 | Go to a folder in `mcp-server-evaluations/` and run the evaluation notebook with the dataset. 22 | 23 | ### Contribute 24 | 25 | * Add various MCP implementation evaluations 26 | -------------------------------------------------------------------------------- /generate_eval_dataset/README.md: -------------------------------------------------------------------------------- 1 | # evaluation-dataset-generation 2 | 3 | This folder contains code and prompts for generating the evaluation dataset used to benchmark MCP server implementations. 4 | 5 | We use LLMs to generate: 6 | 7 | * Natural language questions across multiple categories 8 | * Ground truth answers (in Cypher or plain text, depending on the task) 9 | 10 | ### Data Sources 11 | 12 | The dataset is based on **five Neo4j databases** hosted at [demo.neo4jlabs.com](https://demo.neo4jlabs.com), each representing a different domain for evaluation. 13 | 14 | ### Code Structure 15 | 16 | * `dataset_generation.ipynb` – Main notebook for generating the dataset 17 | * `prompts.py` – Prompt templates for question and answer generation 18 | * `utils.py` – Helper functions for database access and formatting 19 | * `generated_dataset.json` – Final output containing questions, answers, and metadata 20 | 21 | The generated dataset is used to evaluate how well MCP-compatible servers support agent-based querying over real-world knowledge graphs. 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Tomaz Bratanic 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /custom_database/README.md: -------------------------------------------------------------------------------- 1 | # evaluation-dataset-generation 2 | 3 | This folder contains code and prompts for generating your own evaluation dataset used to benchmark MCP server implementations. 4 | 5 | We use LLMs to generate: 6 | 7 | * Natural language questions across multiple categories 8 | * Ground truth answers (in Cypher or plain text, depending on the task) 9 | 10 | ### Code Structure 11 | 12 | * `dataset_generation.ipynb` – Main notebook for generating the dataset 13 | * `prompts.py` – Prompt templates for question and answer generation 14 | * `utils.py` – Helper functions for database access and formatting 15 | * `generated_dataset_.json` – Final output containing questions, answers, and metadata 16 | * `template.env` - example on how to provide the parameters with your LLMs and databases 17 | * 18 | ### env file Structure 19 | 20 | * LLM_CREATE_QUESTIONS: list of LLMs to be used to generate questions. Defined as a list of LLM description following the pattern {model_provider}:{model} based on [Langchain](https://python.langchain.com/api_reference/langchain/chat_models/langchain.chat_models.base.init_chat_model.html) 21 | * LLM_CREATE_ANSWERS: LLM used for creating the answers. Defined as a LLM description following the pattern {model_provider}:{model} based on [Langchain](https://python.langchain.com/api_reference/langchain/chat_models/langchain.chat_models.base.init_chat_model.html) 22 | * xxx_API_KEY: list of API key used for this LLM 23 | * DATABASES: list of databases used for the Q&A generation. Defined as a list of dictionary {"uri", "username", "password", "database (optional, default is neo4j)"} 24 | 25 | The generated dataset is used to evaluate how well MCP-compatible servers support agent-based querying over real-world knowledge graphs. 26 | -------------------------------------------------------------------------------- /custom_database/dataset_generation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "d3c6c6c0-0edc-467c-a6c2-8113dbf4c99e", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#!pip install --quiet langchain-anthropic langchain-neo4j cyVer langchain-google-genai json-repair \"numpy<2\"" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "id": "ab36fa35-f5b9-483a-8d3f-93deae83a287", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "from dotenv import load_dotenv, dotenv_values\n", 21 | "import json\n", 22 | "from tqdm import tqdm\n", 23 | "import pandas as pd\n", 24 | "import time\n", 25 | "import os\n", 26 | "\n", 27 | "from langchain.chat_models import init_chat_model\n", 28 | "\n", 29 | "from utils import (\n", 30 | " _value_sanitize,\n", 31 | " extract_json_from_markdown,\n", 32 | " sampling_query,\n", 33 | " validate_cypher,\n", 34 | " process_database,\n", 35 | " process_all_examples_with_limit,\n", 36 | " convert_datetime\n", 37 | ")\n", 38 | "from prompts import (\n", 39 | " system_prompt,\n", 40 | " simple_system_prompt,\n", 41 | ")" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "id": "616b8449-ca9a-4e0d-81a6-e70282af7db6", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "config = dotenv_values(\"run.env\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "id": "53bd7358-d098-4aa7-83b1-4c56c0b43a5f", 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "for k in config:\n", 62 | " if \"_API_KEY\" in k:\n", 63 | " print(\"setup the env variable for \", k)\n", 64 | " os.environ[k] = config.get(k)" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "id": "039c960d-62b6-4020-91f8-9fb7df14d299", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "DATABASES=json.loads(config.get('DATABASES'))\n", 75 | "LLM_CREATE_QUESTIONS=config.get('LLM_CREATE_QUESTIONS').split(\",\")\n", 76 | "LLM_CREATE_ANSWERS=config.get('LLM_CREATE_ANSWERS')" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "8eb8a21e-12fe-47b7-9895-6648b42045a5", 82 | "metadata": {}, 83 | "source": [ 84 | "# Generate dataset" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "id": "b3dfc7f0-86fb-424e-8a41-1a2eb4677dc7", 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "# LLM selection\n", 95 | "models = []\n", 96 | "for l in LLM_CREATE_QUESTIONS:\n", 97 | " models.append(init_chat_model(l, temperature=0))" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "id": "474866e2-b4a9-4b98-962d-11754df7fde1", 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "simple_batch_count = 1 # Number of iterations for simple queries\n", 108 | "multi_batch_count = 1 # Number of iterations complex queries\n", 109 | "\n", 110 | "output = []\n", 111 | "\n", 112 | "for model in models:\n", 113 | " print(model.model)\n", 114 | " for database in tqdm(DATABASES, desc=\"Processing databases\"):\n", 115 | " # Simple question\n", 116 | " database_records = process_database(\n", 117 | " database, model, simple_batch_count, simple_system_prompt\n", 118 | " )\n", 119 | " output.extend(database_records)\n", 120 | "\n", 121 | " database_records = process_database(\n", 122 | " database, model, multi_batch_count, system_prompt\n", 123 | " )\n", 124 | " output.extend(database_records)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "id": "6912281c-de2a-46b7-983e-9bd8eea21a73", 130 | "metadata": {}, 131 | "source": [ 132 | "# Generate text answers" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "id": "01d18e3c-875a-4b12-8713-b14545fab33f", 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "qa_model = init_chat_model(LLM_CREATE_ANSWERS, temperature=0)" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": null, 148 | "id": "df70df72-0765-4b13-8c5e-9f171b6f70c8", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "validated = [el for el in output if el[\"validated\"]]" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "id": "7dfe6727-7ee2-405d-b02a-cfe9977b9c3f", 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "len(validated)" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "id": "5dfd5b0b-358c-4fda-9c90-72a835f0a7c9", 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "# Generate text-based answers\n", 173 | "await process_all_examples_with_limit(validated, qa_model)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "id": "75c4a539-b6ce-4f52-8964-4ef93a0600dc", 180 | "metadata": {}, 181 | "outputs": [], 182 | "source": [ 183 | "# If the question cannot be answered, remove record\n", 184 | "validated = [el for el in validated if not \"UNKNOWN\" in el['answer']]\n", 185 | "\n", 186 | "df = pd.DataFrame.from_records(validated)\n", 187 | "print(f\"Total QA pairs: {len(df)}\")\n", 188 | "df.head(5)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "id": "7938bb90-115c-4060-ba5e-d1bbbb61e7ac", 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "# Assuming 'output' is defined elsewhere in your code\n", 199 | "timestr = time.strftime(\"%Y%m%d-%H%M%S\")\n", 200 | "print(timestr)\n", 201 | "with open(f\"generated_dataset_{timestr}.json\", \"w\") as f:\n", 202 | " json.dump(validated, f, indent=2, default=convert_datetime)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "id": "438a0ce3-6f5e-4b6e-865d-c8afb34c5045", 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [] 212 | } 213 | ], 214 | "metadata": { 215 | "kernelspec": { 216 | "display_name": "Python 3 (ipykernel)", 217 | "language": "python", 218 | "name": "python3" 219 | }, 220 | "language_info": { 221 | "codemirror_mode": { 222 | "name": "ipython", 223 | "version": 3 224 | }, 225 | "file_extension": ".py", 226 | "mimetype": "text/x-python", 227 | "name": "python", 228 | "nbconvert_exporter": "python", 229 | "pygments_lexer": "ipython3", 230 | "version": "3.13.7" 231 | } 232 | }, 233 | "nbformat": 4, 234 | "nbformat_minor": 5 235 | } 236 | -------------------------------------------------------------------------------- /generate_eval_dataset/utils.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import json 3 | 4 | import json_repair 5 | import re 6 | from datetime import datetime 7 | 8 | import pandas as pd 9 | import neo4j 10 | 11 | from tqdm.asyncio import tqdm_asyncio 12 | from tqdm import tqdm 13 | from langchain_anthropic import ChatAnthropic 14 | from langchain_neo4j import Neo4jGraph 15 | from CyVer import SchemaValidator 16 | 17 | from typing import Any 18 | 19 | from prompts import ( 20 | system_prompt, 21 | user_prompt, 22 | qa_system_prompt, 23 | qa_user_prompt 24 | ) 25 | 26 | def convert_datetime(obj): 27 | if isinstance(obj, (pd.Timestamp, datetime, neo4j.time.DateTime, neo4j.time.Date)): 28 | return obj.isoformat() 29 | raise TypeError(f"Object of type {type(obj)} is not JSON serializable") 30 | 31 | def _value_sanitize(d): 32 | """Sanitize the input dictionary or list. 33 | 34 | Sanitizes the input by removing embedding-like values, 35 | lists with more than 128 elements, that are mostly irrelevant for 36 | generating answers in a LLM context. These properties, if left in 37 | results, can occupy significant context space and detract from 38 | the LLM's performance by introducing unnecessary noise and cost. 39 | 40 | Args: 41 | d (Any): The input dictionary or list to sanitize. 42 | 43 | Returns: 44 | Any: The sanitized dictionary or list. 45 | """ 46 | if isinstance(d, dict): 47 | new_dict = {} 48 | for key, value in d.items(): 49 | if isinstance(value, dict): 50 | sanitized_value = _value_sanitize(value) 51 | if ( 52 | sanitized_value is not None 53 | ): # Check if the sanitized value is not None 54 | new_dict[key] = sanitized_value 55 | elif isinstance(value, list): 56 | if len(value) < 56: 57 | sanitized_value = _value_sanitize(value) 58 | if ( 59 | sanitized_value is not None 60 | ): # Check if the sanitized value is not None 61 | new_dict[key] = sanitized_value 62 | # Do not include the key if the list is oversized 63 | else: 64 | new_dict[key] = value 65 | return new_dict 66 | elif isinstance(d, list): 67 | if len(d) < 56: 68 | return [ 69 | _value_sanitize(item) for item in d if _value_sanitize(item) is not None 70 | ] 71 | else: 72 | return None 73 | else: 74 | return d 75 | 76 | def extract_json_from_markdown(text: str): 77 | """ 78 | Extracts and parses JSON content wrapped between ```json and ``` markers. 79 | 80 | Parameters: 81 | text (str): Input string containing JSON in markdown code block format. 82 | 83 | Returns: 84 | object or None: Parsed JSON object if found and valid, else None. 85 | """ 86 | match = re.search(r"```json\s*(.*?)\s*```", text, re.DOTALL) 87 | if match: 88 | json_str = match.group(1) 89 | try: 90 | return json_repair.loads(json_str) 91 | except json.JSONDecodeError as e: 92 | print("JSON decode error:", e) 93 | return None 94 | else: 95 | try: 96 | return json_repair.loads(text) 97 | except json.JSONDecodeError as e: 98 | print("JSON decode error:", e) 99 | return None 100 | 101 | sampling_query = """// Collect sample nodes with different labels for diversity 102 | MATCH (n) 103 | WITH labels(n) as labelSet, collect(n)[0] as sampleNode 104 | ORDER BY size(labelSet) DESC, rand() 105 | WITH collect(sampleNode) as sampleNodes 106 | 107 | // Generate paths from diverse starting points 108 | UNWIND sampleNodes[0..15] as startNode 109 | CALL (startNode) { 110 | // Get shorter paths 111 | MATCH p1=(startNode)-[*2..3]-() 112 | RETURN p1 as p 113 | LIMIT 2 114 | UNION ALL 115 | // Get longer paths 116 | WITH startNode 117 | MATCH p2=()-[*3..4]->(startNode) 118 | RETURN p2 as p 119 | LIMIT 2 120 | } 121 | // Deduplicate and enrich path information 122 | WITH DISTINCT p 123 | WITH p, 124 | nodes(p) as pathNodes, 125 | relationships(p) as pathRels, 126 | length(p) as pathLength 127 | 128 | RETURN 129 | pathLength, 130 | [node in pathNodes | {labels: labels(node), props: properties(node)}] as nodesInfo, 131 | [rel in pathRels | {type: type(rel), props: properties(rel)}] as relsInfo, 132 | // Create a human-readable path signature 133 | reduce(s = "", i in range(0, length(p) + 1) | 134 | s + 135 | CASE 136 | WHEN i % 2 = 0 137 | THEN "(:" + labels(nodes(p)[i/2])[0] + ")" 138 | ELSE "-[:" + type(relationships(p)[(i-1)/2]) + "]->" 139 | END 140 | ) + "(:" + labels(nodes(p)[-1])[0] + ")" as pathSignature 141 | ORDER BY pathLength DESC, rand() 142 | LIMIT 25""" 143 | 144 | def validate_cypher(schema_validator, query, database_name): 145 | schema_score, schema_metadata = schema_validator.validate(query, database_name=database_name) 146 | if schema_score == 1.0: 147 | return True 148 | else: 149 | return False 150 | 151 | 152 | def create_graph_connection(credential: str, db_url: str) -> Neo4jGraph: 153 | """Create and return a Neo4j graph connection.""" 154 | return Neo4jGraph( 155 | url=db_url, 156 | username=credential, 157 | password=credential, 158 | database=credential, 159 | timeout=90 160 | ) 161 | 162 | 163 | def generate_qa_pairs(graph: Neo4jGraph, model: ChatAnthropic, system_prompt: str) -> list: 164 | """Generate question-answer pairs using the LLM and graph data.""" 165 | paths = _value_sanitize(graph.query(sampling_query)) 166 | messages = [ 167 | ("system", system_prompt), 168 | ("human", user_prompt.format(schema=graph.schema, paths=paths)), 169 | ] 170 | response = model.invoke(messages, max_tokens=25000) 171 | return extract_json_from_markdown(response.content) 172 | 173 | 174 | def validate_and_execute_record(record: dict, schema_validator: SchemaValidator, 175 | graph: Neo4jGraph, credential: str) -> dict: 176 | """Validate Cypher query and execute it, updating the record with results.""" 177 | # Validate against schema 178 | record["validated"] = validate_cypher(schema_validator, record["cypher"], credential) 179 | 180 | if not record["validated"]: 181 | return record 182 | 183 | # Execute query and handle exceptions 184 | try: 185 | response = graph.query(record["cypher"]) 186 | record["result"] = response 187 | 188 | # Check if result meets criteria (single non-empty, non-zero value) 189 | if not response or len(response) > 1: 190 | record["validated"] = False 191 | 192 | except Exception: 193 | record["validated"] = False 194 | 195 | return record 196 | 197 | 198 | def process_database(credential: str, db_url: str, model: Any, 199 | iterations_per_database: int, 200 | system_prompt: str = system_prompt) -> list: 201 | """Process a single database and return all generated records.""" 202 | graph = create_graph_connection(credential, db_url) 203 | schema_validator = SchemaValidator(graph._driver) 204 | database_output = [] 205 | 206 | for i in tqdm(range(iterations_per_database), 207 | desc=f"Iterations for {credential}", 208 | leave=False): 209 | try: 210 | # Generate QA pairs 211 | data = generate_qa_pairs(graph, model, system_prompt) 212 | # Validate and execute each record 213 | for record in data: 214 | # Add model name 215 | record["model"] = model._llm_type 216 | # Add database name 217 | record["database"] = credential 218 | validated_record = validate_and_execute_record( 219 | record, schema_validator, graph, credential 220 | ) 221 | database_output.append(validated_record) 222 | except: 223 | raise 224 | continue 225 | 226 | return database_output 227 | 228 | async def process_example(example, qa_model): 229 | """Process a single example asynchronously""" 230 | if not example.get('validated'): 231 | return # Skip unvalidated examples 232 | 233 | qa_messages = [ 234 | ("system", qa_system_prompt), 235 | ("human", qa_user_prompt.format( 236 | question=example['question'], 237 | cypher_query=example['cypher'], 238 | result=example['result'] 239 | )), 240 | ] 241 | 242 | answer = await qa_model.ainvoke(qa_messages) 243 | example['answer'] = answer.content # Modify original dict in-place 244 | 245 | # Main concurrent processing 246 | async def process_all_examples(data, qa_model): 247 | """Process all examples concurrently with progress bar""" 248 | tasks = [ 249 | process_example(example, qa_model) 250 | for example in data 251 | ] 252 | 253 | # Execute all tasks concurrently with progress bar 254 | await tqdm_asyncio.gather(*tasks, desc="Generating text answers") 255 | 256 | # Alternative with semaphore for rate limiting 257 | async def process_all_examples_with_limit(data, qa_model, max_concurrent=10): 258 | """Process all examples concurrently with a limit on concurrent requests""" 259 | semaphore = asyncio.Semaphore(max_concurrent) 260 | 261 | async def process_with_semaphore(example): 262 | async with semaphore: 263 | await process_example(example, qa_model) 264 | 265 | tasks = [process_with_semaphore(example) for example in data] 266 | await tqdm_asyncio.gather(*tasks, desc="Processing examples") 267 | -------------------------------------------------------------------------------- /custom_database/utils.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import json 3 | 4 | import json_repair 5 | import re 6 | from datetime import datetime 7 | 8 | import pandas as pd 9 | import neo4j 10 | 11 | from tqdm.asyncio import tqdm_asyncio 12 | from tqdm import tqdm 13 | from langchain_anthropic import ChatAnthropic 14 | from langchain_neo4j import Neo4jGraph 15 | from CyVer import SchemaValidator 16 | 17 | from typing import Any 18 | 19 | from prompts import ( 20 | system_prompt, 21 | user_prompt, 22 | qa_system_prompt, 23 | qa_user_prompt 24 | ) 25 | 26 | def convert_datetime(obj): 27 | if isinstance(obj, (pd.Timestamp, datetime, neo4j.time.DateTime, neo4j.time.Date)): 28 | return obj.isoformat() 29 | raise TypeError(f"Object of type {type(obj)} is not JSON serializable") 30 | 31 | def _value_sanitize(d): 32 | """Sanitize the input dictionary or list. 33 | 34 | Sanitizes the input by removing embedding-like values, 35 | lists with more than 128 elements, that are mostly irrelevant for 36 | generating answers in a LLM context. These properties, if left in 37 | results, can occupy significant context space and detract from 38 | the LLM's performance by introducing unnecessary noise and cost. 39 | 40 | Args: 41 | d (Any): The input dictionary or list to sanitize. 42 | 43 | Returns: 44 | Any: The sanitized dictionary or list. 45 | """ 46 | if isinstance(d, dict): 47 | new_dict = {} 48 | for key, value in d.items(): 49 | if isinstance(value, dict): 50 | sanitized_value = _value_sanitize(value) 51 | if ( 52 | sanitized_value is not None 53 | ): # Check if the sanitized value is not None 54 | new_dict[key] = sanitized_value 55 | elif isinstance(value, list): 56 | if len(value) < 56: 57 | sanitized_value = _value_sanitize(value) 58 | if ( 59 | sanitized_value is not None 60 | ): # Check if the sanitized value is not None 61 | new_dict[key] = sanitized_value 62 | # Do not include the key if the list is oversized 63 | else: 64 | new_dict[key] = value 65 | return new_dict 66 | elif isinstance(d, list): 67 | if len(d) < 56: 68 | return [ 69 | _value_sanitize(item) for item in d if _value_sanitize(item) is not None 70 | ] 71 | else: 72 | return None 73 | else: 74 | return d 75 | 76 | def extract_json_from_markdown(text: str): 77 | """ 78 | Extracts and parses JSON content wrapped between ```json and ``` markers. 79 | 80 | Parameters: 81 | text (str): Input string containing JSON in markdown code block format. 82 | 83 | Returns: 84 | object or None: Parsed JSON object if found and valid, else None. 85 | """ 86 | match = re.search(r"```json\s*(.*?)\s*```", text, re.DOTALL) 87 | if match: 88 | json_str = match.group(1) 89 | try: 90 | return json_repair.loads(json_str) 91 | except json.JSONDecodeError as e: 92 | print("JSON decode error:", e) 93 | return None 94 | else: 95 | try: 96 | return json_repair.loads(text) 97 | except json.JSONDecodeError as e: 98 | print("JSON decode error:", e) 99 | return None 100 | 101 | sampling_query = """// Collect sample nodes with different labels for diversity 102 | MATCH (n) 103 | WITH labels(n) as labelSet, collect(n)[0] as sampleNode 104 | ORDER BY size(labelSet) DESC, rand() 105 | WITH collect(sampleNode) as sampleNodes 106 | 107 | // Generate paths from diverse starting points 108 | UNWIND sampleNodes[0..15] as startNode 109 | CALL (startNode) { 110 | // Get shorter paths 111 | MATCH p1=(startNode)-[*2..3]-() 112 | RETURN p1 as p 113 | LIMIT 2 114 | UNION ALL 115 | // Get longer paths 116 | WITH startNode 117 | MATCH p2=()-[*3..4]->(startNode) 118 | RETURN p2 as p 119 | LIMIT 2 120 | } 121 | // Deduplicate and enrich path information 122 | WITH DISTINCT p 123 | WITH p, 124 | nodes(p) as pathNodes, 125 | relationships(p) as pathRels, 126 | length(p) as pathLength 127 | 128 | RETURN 129 | pathLength, 130 | [node in pathNodes | {labels: labels(node), props: properties(node)}] as nodesInfo, 131 | [rel in pathRels | {type: type(rel), props: properties(rel)}] as relsInfo, 132 | // Create a human-readable path signature 133 | reduce(s = "", i in range(0, length(p) + 1) | 134 | s + 135 | CASE 136 | WHEN i % 2 = 0 137 | THEN "(:" + labels(nodes(p)[i/2])[0] + ")" 138 | ELSE "-[:" + type(relationships(p)[(i-1)/2]) + "]->" 139 | END 140 | ) + "(:" + labels(nodes(p)[-1])[0] + ")" as pathSignature 141 | ORDER BY pathLength DESC, rand() 142 | LIMIT 25""" 143 | 144 | def validate_cypher(schema_validator, query, database_name): 145 | schema_score, schema_metadata = schema_validator.validate(query, database_name=database_name) 146 | if schema_score == 1.0: 147 | return True 148 | else: 149 | return False 150 | 151 | 152 | def create_graph_connection(database_name: str, username: str, password: str, db_url: str) -> Neo4jGraph: 153 | """Create and return a Neo4j graph connection.""" 154 | return Neo4jGraph( 155 | url=db_url, 156 | username=username, 157 | password=password, 158 | database=database_name, 159 | timeout=90 160 | ) 161 | 162 | 163 | def generate_qa_pairs(graph: Neo4jGraph, model: ChatAnthropic, system_prompt: str) -> list: 164 | """Generate question-answer pairs using the LLM and graph data.""" 165 | paths = _value_sanitize(graph.query(sampling_query)) 166 | messages = [ 167 | ("system", system_prompt), 168 | ("human", user_prompt.format(schema=graph.schema, paths=paths)), 169 | ] 170 | response = model.invoke(messages, max_tokens=25000) 171 | return extract_json_from_markdown(response.content) 172 | 173 | 174 | def validate_and_execute_record(record: dict, schema_validator: SchemaValidator, 175 | graph: Neo4jGraph, database_name: str) -> dict: 176 | """Validate Cypher query and execute it, updating the record with results.""" 177 | # Validate against schema 178 | record["validated"] = validate_cypher(schema_validator, record["cypher"], database_name) 179 | 180 | if not record["validated"]: 181 | return record 182 | 183 | # Execute query and handle exceptions 184 | try: 185 | response = graph.query(record["cypher"]) 186 | record["result"] = response 187 | 188 | # Check if result meets criteria (single non-empty, non-zero value) 189 | if not response or len(response) > 1: 190 | record["validated"] = False 191 | 192 | except Exception: 193 | record["validated"] = False 194 | 195 | return record 196 | 197 | def process_database(database: list, model: Any, 198 | iterations_per_database: int, 199 | system_prompt: str = system_prompt) -> list: 200 | """Process a single database and return all generated records.""" 201 | if 'database' not in database: 202 | database_name = 'neo4j' 203 | else: 204 | database_name = database['database'] 205 | graph = create_graph_connection(database_name, database['username'], database['password'], database['uri']) 206 | schema_validator = SchemaValidator(graph._driver) 207 | database_output = [] 208 | 209 | for i in tqdm(range(iterations_per_database), 210 | desc=f"Iterations for {database_name}", 211 | leave=False): 212 | try: 213 | # Generate QA pairs 214 | data = generate_qa_pairs(graph, model, system_prompt) 215 | # Validate and execute each record 216 | for record in data: 217 | # Add model name 218 | record["model"] = model._llm_type 219 | # Add database name 220 | record["database"] = database_name 221 | validated_record = validate_and_execute_record( 222 | record, schema_validator, graph, database_name 223 | ) 224 | database_output.append(validated_record) 225 | except: 226 | raise 227 | continue 228 | 229 | return database_output 230 | 231 | async def process_example(example, qa_model): 232 | """Process a single example asynchronously""" 233 | if not example.get('validated'): 234 | return # Skip unvalidated examples 235 | 236 | qa_messages = [ 237 | ("system", qa_system_prompt), 238 | ("human", qa_user_prompt.format( 239 | question=example['question'], 240 | cypher_query=example['cypher'], 241 | result=example['result'] 242 | )), 243 | ] 244 | 245 | answer = await qa_model.ainvoke(qa_messages) 246 | example['answer'] = answer.content # Modify original dict in-place 247 | 248 | # Main concurrent processing 249 | async def process_all_examples(data, qa_model): 250 | """Process all examples concurrently with progress bar""" 251 | tasks = [ 252 | process_example(example, qa_model) 253 | for example in data 254 | ] 255 | 256 | # Execute all tasks concurrently with progress bar 257 | await tqdm_asyncio.gather(*tasks, desc="Generating text answers") 258 | 259 | # Alternative with semaphore for rate limiting 260 | async def process_all_examples_with_limit(data, qa_model, max_concurrent=10): 261 | """Process all examples concurrently with a limit on concurrent requests""" 262 | semaphore = asyncio.Semaphore(max_concurrent) 263 | 264 | async def process_with_semaphore(example): 265 | async with semaphore: 266 | await process_example(example, qa_model) 267 | 268 | tasks = [process_with_semaphore(example) for example in data] 269 | await tqdm_asyncio.gather(*tasks, desc="Processing examples") 270 | -------------------------------------------------------------------------------- /generate_eval_dataset/dataset_generation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "id": "d3c6c6c0-0edc-467c-a6c2-8113dbf4c99e", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "#!pip install --quiet langchain-anthropic langchain-neo4j cyVer langchain-google-genai json-repair \"numpy<2\"" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "id": "def56e63-f956-4460-bde1-b1c230c3230e", 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/plain": [ 22 | "True" 23 | ] 24 | }, 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "output_type": "execute_result" 28 | } 29 | ], 30 | "source": [ 31 | "from dotenv import load_dotenv\n", 32 | "\n", 33 | "load_dotenv()" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 2, 39 | "id": "ab36fa35-f5b9-483a-8d3f-93deae83a287", 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "from tqdm import tqdm\n", 44 | "import pandas as pd\n", 45 | "import json\n", 46 | "\n", 47 | "from langchain_anthropic import ChatAnthropic\n", 48 | "from langchain_google_genai import ChatGoogleGenerativeAI\n", 49 | "\n", 50 | "from utils import (\n", 51 | " _value_sanitize,\n", 52 | " extract_json_from_markdown,\n", 53 | " sampling_query,\n", 54 | " validate_cypher,\n", 55 | " process_database,\n", 56 | " process_all_examples_with_limit,\n", 57 | " convert_datetime\n", 58 | ")\n", 59 | "from prompts import (\n", 60 | " system_prompt,\n", 61 | " simple_system_prompt,\n", 62 | ")" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "8eb8a21e-12fe-47b7-9895-6648b42045a5", 68 | "metadata": {}, 69 | "source": [ 70 | "# Generate dataset" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "id": "b3dfc7f0-86fb-424e-8a41-1a2eb4677dc7", 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "# LLM selection\n", 81 | "models = [ChatAnthropic(model='claude-opus-4-20250514')] #, ChatGoogleGenerativeAI(model=\"gemini-2.5-pro\")]\n", 82 | "\n", 83 | "# Database selection (for demo database)\n", 84 | "db_url = \"neo4j+s://demo.neo4jlabs.com\"\n", 85 | "databases = [\n", 86 | " \"companies\",\n", 87 | " \"twitch\", \n", 88 | " \"network\",\n", 89 | " \"northwind\",\n", 90 | " \"ClinicalKnowledgeGraph\"\n", 91 | "]\n" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 4, 97 | "id": "474866e2-b4a9-4b98-962d-11754df7fde1", 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "claude-opus-4-20250514\n" 105 | ] 106 | }, 107 | { 108 | "name": "stderr", 109 | "output_type": "stream", 110 | "text": [ 111 | "Processing databases: 0%| | 0/5 [00:00\n", 251 | "\n", 264 | "\n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | "
questioncypherquery_typecomplexitynoise_appliedmodeldatabasevalidatedresultanswernoise_type
0What's Microsoft Corporation's revenue?MATCH (o:Organization {name: 'Microsoft Corpor...Direct Property Access0-hopFalseanthropic-chatcompaniesTrue[{'company_revenue': 198270000000.0}]According to the data, Microsoft Corporation's...NaN
1How many employees does SAP have?MATCH (o:Organization {name: 'SAP'}) RETURN o....Direct Property Access0-hopFalseanthropic-chatcompaniesTrue[{'employee_count': 111961}]According to the data, SAP has 111,961 employees.NaN
2Tell me Mircosoft Corporation's mottoMATCH (o:Organization {name: 'Microsoft Corpor...Direct Property Access0-hopTrueanthropic-chatcompaniesTrue[{'company_motto': 'We’re on a mission to empo...Microsoft Corporation's motto is: \"We're on a ...typo
3Is Cloud Sherpas still active?MATCH (o:Organization {name: 'Cloud Sherpas'})...Direct Property Access0-hopFalseanthropic-chatcompaniesTrue[{'is_dissolved': True}]No, Cloud Sherpas is not still active. The que...NaN
4What's the revenue of New Energy Group?MATCH (o:Organization {name: 'New Energy Group...Direct Property Access0-hopFalseanthropic-chatcompaniesTrue[{'company_revenue': 120000000.0}]The revenue of New Energy Group is $120,000,000.NaN
\n", 354 | "" 355 | ], 356 | "text/plain": [ 357 | " question \\\n", 358 | "0 What's Microsoft Corporation's revenue? \n", 359 | "1 How many employees does SAP have? \n", 360 | "2 Tell me Mircosoft Corporation's motto \n", 361 | "3 Is Cloud Sherpas still active? \n", 362 | "4 What's the revenue of New Energy Group? \n", 363 | "\n", 364 | " cypher query_type \\\n", 365 | "0 MATCH (o:Organization {name: 'Microsoft Corpor... Direct Property Access \n", 366 | "1 MATCH (o:Organization {name: 'SAP'}) RETURN o.... Direct Property Access \n", 367 | "2 MATCH (o:Organization {name: 'Microsoft Corpor... Direct Property Access \n", 368 | "3 MATCH (o:Organization {name: 'Cloud Sherpas'})... Direct Property Access \n", 369 | "4 MATCH (o:Organization {name: 'New Energy Group... Direct Property Access \n", 370 | "\n", 371 | " complexity noise_applied model database validated \\\n", 372 | "0 0-hop False anthropic-chat companies True \n", 373 | "1 0-hop False anthropic-chat companies True \n", 374 | "2 0-hop True anthropic-chat companies True \n", 375 | "3 0-hop False anthropic-chat companies True \n", 376 | "4 0-hop False anthropic-chat companies True \n", 377 | "\n", 378 | " result \\\n", 379 | "0 [{'company_revenue': 198270000000.0}] \n", 380 | "1 [{'employee_count': 111961}] \n", 381 | "2 [{'company_motto': 'We’re on a mission to empo... \n", 382 | "3 [{'is_dissolved': True}] \n", 383 | "4 [{'company_revenue': 120000000.0}] \n", 384 | "\n", 385 | " answer noise_type \n", 386 | "0 According to the data, Microsoft Corporation's... NaN \n", 387 | "1 According to the data, SAP has 111,961 employees. NaN \n", 388 | "2 Microsoft Corporation's motto is: \"We're on a ... typo \n", 389 | "3 No, Cloud Sherpas is not still active. The que... NaN \n", 390 | "4 The revenue of New Energy Group is $120,000,000. NaN " 391 | ] 392 | }, 393 | "execution_count": 11, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# If the question cannot be answered, remove record\n", 400 | "validated = [el for el in validated if not \"UNKNOWN\" in el['answer']]\n", 401 | "\n", 402 | "df = pd.DataFrame.from_records(validated)\n", 403 | "print(f\"Total QA pairs: {len(df)}\")\n", 404 | "df.head(5)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 12, 410 | "id": "7938bb90-115c-4060-ba5e-d1bbbb61e7ac", 411 | "metadata": {}, 412 | "outputs": [], 413 | "source": [ 414 | "# Assuming 'output' is defined elsewhere in your code\n", 415 | "with open(\"generated_dataset.json\", \"w\") as f:\n", 416 | " json.dump(validated, f, indent=2, default=convert_datetime)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "code", 421 | "execution_count": null, 422 | "id": "438a0ce3-6f5e-4b6e-865d-c8afb34c5045", 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [] 426 | } 427 | ], 428 | "metadata": { 429 | "kernelspec": { 430 | "display_name": "Python 3 (ipykernel)", 431 | "language": "python", 432 | "name": "python3" 433 | }, 434 | "language_info": { 435 | "codemirror_mode": { 436 | "name": "ipython", 437 | "version": 3 438 | }, 439 | "file_extension": ".py", 440 | "mimetype": "text/x-python", 441 | "name": "python", 442 | "nbconvert_exporter": "python", 443 | "pygments_lexer": "ipython3", 444 | "version": "3.12.7" 445 | } 446 | }, 447 | "nbformat": 4, 448 | "nbformat_minor": 5 449 | } 450 | -------------------------------------------------------------------------------- /custom_database/prompts.py: -------------------------------------------------------------------------------- 1 | system_prompt = """ 2 | # Text2Cypher Dataset Generation System Prompt 3 | 4 | You are a synthetic dataset generator for text2cypher evaluation. Your task is to create **25 question-answer pairs** where questions are in natural language and answers are Cypher queries that return **single values only** (not result sets). 5 | 6 | ## Input Format 7 | You will receive: 8 | 1. **Graph schema** in this format: 9 | ``` 10 | Node properties are the following: 11 | {NODE_TYPE} {property1: TYPE, property2: TYPE, ...} 12 | 13 | Relationship properties are the following: 14 | {RELATIONSHIP_TYPE} {property1: TYPE, property2: TYPE, ...} 15 | 16 | The relationships are the following: 17 | (:NodeType)-[:RELATIONSHIP_TYPE]->(:NodeType) 18 | ``` 19 | 20 | 2. **Sample paths** from the actual graph data showing real nodes and relationships: 21 | ``` 22 | [{'pathLength': 2, 23 | 'nodesInfo': [{'labels': ['Person'], 'props': {'name': 'Tom Hanks'}}, 24 | {'labels': ['Movie'], 'props': {'title': 'Forrest Gump', 'imdbRating': 8.8}}], 25 | 'relsInfo': [{'type': 'ACTED_IN', 'props': {'role': 'Forrest'}}], 26 | 'pathSignature': '(:Person)-[:ACTED_IN]->(:Movie)'}] 27 | ``` 28 | 29 | ## Core Requirements 30 | 31 | ### 1. Multi-Hop Focus 32 | - Generate questions that require **2-4 relationship traversals** 33 | - Focus on meaningful path patterns: Person→Company→City, Person→Knows→Person→WorksAt→Company 34 | - Avoid single-hop queries or simple property lookups 35 | 36 | ### 2. Single-Value Answers Only 37 | All Cypher queries must return exactly one value: 38 | - **Aggregations**: COUNT, SUM, AVG, MAX, MIN 39 | - **Property retrieval**: Single property from end node 40 | - **Existence verification**: Count of paths, shortest path length, count of connecting entities, or single connecting entity property 41 | - **Path metrics**: Shortest path length, hop count, connection count 42 | 43 | ### 3. Column Aliasing Requirements 44 | **MANDATORY**: All Cypher queries must include meaningful column aliases using the `AS` clause: 45 | 46 | **Aggregation Query Aliases:** 47 | - `COUNT(DISTINCT entity) AS total_count` 48 | - `COUNT(DISTINCT entity) AS num_entities` 49 | - `SUM(property) AS total_amount` 50 | - `AVG(property) AS average_value` 51 | - `MAX(property) AS highest_value` 52 | - `MIN(property) AS lowest_value` 53 | 54 | **Property Retrieval Aliases:** 55 | - `entity.property AS entity_property_name` 56 | - `m.title AS movie_title` 57 | - `p.name AS person_name` 58 | - `c.industry AS company_industry` 59 | - `city.name AS city_name` 60 | 61 | **Existence Verification Aliases:** 62 | - `COUNT(path) AS connection_count` 63 | - `COUNT(DISTINCT intermediate) AS bridge_count` 64 | - `LENGTH(path) AS path_length` 65 | - `COUNT(DISTINCT connecting_entity) AS connector_count` 66 | 67 | **Sorting Query Aliases:** 68 | - `MAX(property) AS highest_property` 69 | - `MIN(property) AS lowest_property` 70 | - `entity.property AS top_property` (when using ORDER BY with LIMIT 1) 71 | 72 | **Examples of Proper Aliasing:** 73 | ```cypher 74 | // Aggregation 75 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a:Person) 76 | RETURN COUNT(DISTINCT a) AS total_coactors 77 | 78 | // Property Retrieval 79 | MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) 80 | RETURN city.name AS workplace_city 81 | 82 | // Sorting 83 | MATCH (p:Person {name: 'Director'})-[:DIRECTED]->(m:Movie) 84 | WHERE m.imdbRating IS NOT NULL 85 | RETURN MAX(m.imdbRating) AS highest_rating 86 | 87 | // Existence Verification 88 | MATCH path = shortestPath((p1:Person {name: 'Actor1'})-[*]-(p2:Person {name: 'Actor2'})) 89 | RETURN LENGTH(path) AS connection_distance 90 | ``` 91 | 92 | ### 4. Cypher Syntax Requirements 93 | **IMPORTANT**: When using multiple relationship types in a single pattern, use the correct Cypher syntax: 94 | - **Correct**: `(node)-[:REL_TYPE1|REL_TYPE2]->(node)` 95 | - **Incorrect**: `(node)-[:REL_TYPE1|:REL_TYPE2]->(node)` 96 | 97 | Do NOT include the colon (`:`) before subsequent relationship types in the union pattern. For example: 98 | - ✅ `(person)-[:WROTE|DIRECTED]->(movie)` 99 | - ❌ `(person)-[:WROTE|:DIRECTED]->(movie)` 100 | 101 | ### 5. Generation Process 102 | 103 | ## Generation Process 104 | 105 | **Step 1: Schema and Path Analysis** 106 | - Parse the provided schema to identify all node types and relationships 107 | - Analyze the sample paths to understand real entity names and property values 108 | - Focus on 2-4 hop paths that represent realistic analytical questions 109 | 110 | **Step 2: Question Template Types** 111 | Create diverse question types using domain-appropriate vocabulary. **You have creative freedom to vary phrasing, structure, and vocabulary beyond these templates.** Adapt terminology to match your target domain (e.g., movies: "films", "actors", "directors"; business: "companies", "employees", "revenue"; social: "people", "friends", "connections"). 112 | 113 | **Aggregation Templates (COUNT, SUM, AVG, MIN, MAX):** 114 | - COUNT variations: "How many...", "What's the total number of...", "Count the...", "How many different...", "What's the count of..." 115 | - SUM variations: "What's the total...", "Sum up the...", "What's the combined...", "Add up all the..." 116 | - AVG variations: "What's the average...", "What's the mean...", "On average, what's the...", "What's the typical..." 117 | - Domain-specific examples: 118 | - Movies: "How many films has [actor] appeared in with directors who worked with [other_actor]?" 119 | - Business: "What's the total revenue of companies that compete with [person]'s employer?" 120 | - Social: "How many friends does [person] have through their work connections?" 121 | 122 | **Property Retrieval Templates:** 123 | - Direct property access: "What's the [property] of...", "What [property] does... have?", "Which [property] belongs to..." 124 | - Relationship-based: "What [property] connects [entity1] to [entity2]?", "Through what [property] are they linked?" 125 | - Path-based: "What [property] can [entity] reach through [path]?" 126 | - Domain-specific examples: 127 | - Movies: "What's the rating of the highest-rated film starring actors who worked with [director]?" 128 | - Business: "What's the industry of companies that [person] can reach through their professional network?" 129 | - Social: "What city does [person] live in through their friendship connections?" 130 | 131 | **Sorting Templates (MAX/MIN with ORDER BY):** 132 | - Maximum: "What's the highest...", "Which has the maximum...", "What's the peak...", "What's the top..." 133 | - Minimum: "What's the lowest...", "Which has the minimum...", "What's the smallest...", "What's the bottom..." 134 | - Superlative forms: "What's the most...", "What's the least...", "Which is the greatest...", "Which is the smallest..." 135 | - Domain-specific examples: 136 | - Movies: "What's the highest budget of films directed by people who acted alongside [actor]?" 137 | - Business: "Which company has the lowest revenue among those connected to [person]'s network?" 138 | - Social: "What's the oldest age of people [person] can reach through mutual friends?" 139 | 140 | **Existence Templates (Verification with Evidence):** 141 | Instead of simple boolean checks, existence queries should return **evidence of the connection**: 142 | - **Count-based verification**: "How many paths connect [entity1] to [entity2]?", "How many ways can [entity] reach [entity2]?" 143 | - **Shortest path length**: "What's the shortest path length between [entity1] and [entity2]?", "How many steps does it take to connect [entity1] to [entity2]?" 144 | - **Intermediate entity verification**: "Which [entity_type] connects [entity1] to [entity2]?", "What [entity_type] serves as a bridge between [entity1] and [entity2]?" 145 | - **Path existence with count**: "How many [intermediate_entities] link [entity1] to [entity2] through [relationship_chain]?" 146 | - Domain-specific examples: 147 | - Movies: "How many directors connect [actor1] to [actor2] through collaborations?" 148 | - Business: "What's the shortest path length between [person] and companies in the tech industry?" 149 | - Social: "How many mutual friends do [person1] and [person2] have?" 150 | 151 | **Cypher patterns for existence verification:** 152 | - Path counting: `RETURN COUNT(path) AS connection_count` or `RETURN COUNT(DISTINCT intermediate_entity) AS bridge_count` 153 | - Shortest path: `RETURN LENGTH(shortestPath(...)) AS path_length` 154 | - Intermediate entity: `RETURN intermediate_entity.property AS bridge_property LIMIT 1` 155 | - Connection verification: `RETURN COUNT(DISTINCT connecting_entity) AS connector_count` 156 | 157 | **Creative Variations Encouraged:** 158 | - **Vary sentence structure**: Use questions, statements, imperatives 159 | - **Use domain synonyms**: "films/movies", "actors/performers", "companies/firms", "people/individuals" 160 | - **Add context**: "In [person]'s network...", "Among [entity_type] connected to...", "Through [relationship_type] relationships..." 161 | - **Use natural language**: "folks", "stuff", "things" (especially for noise injection) 162 | - **Experiment with phrasing**: "Tell me the...", "Find the...", "I need to know...", "Show me the..." 163 | 164 | **Step 3: Domain-Aware Entity Substitution** 165 | - **Use real entity names** from the provided sample paths 166 | - **Adapt vocabulary to domain**: Match the graph schema's domain (movies, business, social networks, etc.) 167 | - **Use domain-appropriate terminology**: 168 | - Movies: "films", "actors", "directors", "cast", "crew", "productions", "ratings", "box office" 169 | - Business: "companies", "employees", "executives", "revenue", "profits", "industries", "competitors" 170 | - Social: "people", "friends", "connections", "network", "relationships", "communities" 171 | - Academic: "researchers", "papers", "citations", "institutions", "publications", "collaborations" 172 | - **Ensure semantic consistency**: If schema uses "Person" nodes, questions can use "actors", "people", "individuals" interchangeably 173 | - **Validate realistic paths**: Ensure paths actually exist between chosen entities 174 | - **Check for meaningful results**: Validate queries return non-null, non-zero values when appropriate 175 | 176 | **Step 4: Sorting Query Requirements** 177 | When creating sorting queries, **always include WHERE clauses** to filter out null values: 178 | ```cypher 179 | WHERE entity.property IS NOT NULL 180 | ORDER BY entity.property DESC/ASC 181 | LIMIT 1 182 | ``` 183 | 184 | ## Output Requirements 185 | 186 | Generate exactly **25 question-answer pairs** based on the provided schema and sample paths. **Use creative freedom to vary question phrasing, structure, and vocabulary while maintaining the required distribution and domain appropriateness.** 187 | 188 | **Distribution Guidelines (Flexible Based on Domain):** 189 | - **Target Distribution**: Aim for roughly equal distribution across query types: 190 | - ~6-7 Aggregation queries (COUNT, SUM, AVG, MAX, MIN) 191 | - ~6-7 Property Retrieval queries (single property access) 192 | - ~6-7 Sorting queries (MAX/MIN with ORDER BY) 193 | - ~6-7 Existence Verification queries (path counts, shortest paths, connecting entities) 194 | 195 | - **Domain-Driven Flexibility**: **Do NOT force weird or non-semantic questions just to achieve exact distributions.** If the domain or graph schema isn't ideal for certain query types, produce more of those which make natural sense for users of this data/platform. For example: 196 | - Movie databases naturally lend themselves to aggregation queries about collaboration patterns 197 | - Social networks are ideal for existence verification and path-based queries 198 | - Business networks may have more property retrieval queries about company metrics 199 | - Academic citation networks favor aggregation and sorting queries 200 | 201 | - **Quality Over Rigid Distribution**: Prioritize creating realistic, meaningful questions that actual users would ask over maintaining exact 6-7-6-7 distribution. A distribution like 8-5-7-5 is perfectly acceptable if it results in more natural, semantically meaningful questions. 202 | 203 | **Creative Guidelines:** 204 | - **Vary question structure**: Mix interrogative, declarative, and imperative forms 205 | - **Use domain-appropriate vocabulary**: Match terminology to the graph schema's domain 206 | - **Experiment with phrasing**: Don't stick rigidly to templates - be creative! 207 | - **Include natural variations**: "What's the...", "Tell me the...", "Find the...", "I need to know..." 208 | - **Add contextual phrases**: "In [entity]'s network...", "Among [entity_type] connected to...", "Through [relationship] relationships..." 209 | 210 | **Noise Injection (40% of questions):** 211 | Apply to exactly 10 questions (spread across different query types): 212 | - **Typos in named entities**: Minor misspellings of person/movie/company names: 213 | - "Tom Hanks" → "Tom Henks" 214 | - "Georges Méliès" → "George Melies" 215 | - "Harold Lloyd" → "Harold Loyd" 216 | - "Microsoft" → "Mircosoft" 217 | - "Google" → "Googel" 218 | - **Colloquialisms and informal language**: 219 | - "folks" → "people", "stuff" → "things", "flicks" → "movies", "pic" → "picture" 220 | - "company" → "firm", "biz", "corp" 221 | - "person" → "guy", "individual", "someone" 222 | - "work with" → "collaborate with", "team up with" 223 | - **Grammatical variations**: 224 | - "companys" → "companies", "actorss" → "actors" 225 | - "Who has worked" → "Who's worked", "What is" → "What's" 226 | - **Domain-specific slang**: 227 | - Movies: "blockbuster", "indie film", "A-lister", "supporting actor" 228 | - Business: "startup", "enterprise", "C-suite", "workforce" 229 | - Social: "buddy", "acquaintance", "circle", "network" 230 | 231 | **Vocabulary Matching Guidelines:** 232 | - **Analyze the domain** from node types and relationships in the schema 233 | - **Use appropriate synonyms** that match the domain context 234 | - **Maintain semantic consistency** throughout questions 235 | - **Avoid ambiguous pronouns** ("their", "his", "her") - always use specific entity names 236 | - **Match formality level** to the domain (academic vs. casual social networks) 237 | 238 | ### 6. Quality Assurance 239 | - Validate each Cypher query returns exactly one value 240 | - Ensure multi-hop paths are semantically meaningful 241 | - Test queries against your actual graph schema 242 | - Filter out queries that always return 0/null 243 | - **Verify all queries include meaningful column aliases** 244 | 245 | ### Complexity Field Requirements 246 | 247 | Add a "complexity" field to each output that counts the number of relationship traversals in the Cypher query: 248 | - **0-hop**: Queries that don't traverse any relationships (just node property access) 249 | - **1-hop**: Queries that traverse one relationship 250 | - **2-hop**: Queries that traverse two relationships 251 | - **3-hop**: Queries that traverse three relationships 252 | - **4-hop**: Queries that traverse four relationships 253 | - **5+ hop**: Queries that traverse five or more relationships 254 | 255 | **Complexity Calculation Examples:** 256 | - `(p:Person {name: 'Tom Hanks'}) RETURN p.name` = **0-hop** 257 | - `(p:Person)-[:ACTED_IN]->(m:Movie)` = **1-hop** 258 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)` = **2-hop** 259 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)` = **3-hop** 260 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person)` = **4-hop** 261 | 262 | Count each relationship traversal (both `-[:REL]->` and `<-[:REL]-`) as one hop, regardless of direction. 263 | 264 | ## Output Format 265 | 266 | First, provide your analysis and reasoning in `` tags, then return exactly 25 JSON objects in the specified format. 267 | 268 | 269 | [Explain your analysis of the provided schema and sample paths, your approach to generating domain-appropriate questions, the distribution of query types you chose, and any specific considerations for the domain. Include your reasoning for entity selection, path complexity, and noise injection choices.] 270 | 271 | 272 | ```json 273 | [ 274 | { 275 | "question": "How many actors appeared in films with directors who also worked with Tom Henks?", 276 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person) RETURN COUNT(DISTINCT a) AS total_coactors", 277 | "query_type": "Aggregation", 278 | "complexity": "4-hop", 279 | "noise_applied": true, 280 | "noise_type": "typo" 281 | }, 282 | { 283 | "question": "What's the peak imdbRating among films directed by people who performed alongside Georges Méliès?", 284 | "cypher": "MATCH (p:Person {name: 'Georges Méliès'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a:Person)-[:DIRECTED]->(m2:Movie) WHERE m2.imdbRating IS NOT NULL RETURN MAX(m2.imdbRating) AS highest_rating", 285 | "query_type": "Sorting", 286 | "complexity": "3-hop", 287 | "noise_applied": false 288 | }, 289 | { 290 | "question": "Tell me the title of the movie that connects these two actors through their collaborations?", 291 | "cypher": "MATCH (p1:Person {name: 'Actor1'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person {name: 'Actor2'}) RETURN m.title AS connecting_movie LIMIT 1", 292 | "query_type": "Property Retrieval", 293 | "complexity": "2-hop", 294 | "noise_applied": false 295 | }, 296 | { 297 | "question": "What's the name of Tom Hanks?", 298 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'}) RETURN p.name AS person_name", 299 | "query_type": "Property Retrieval", 300 | "complexity": "0-hop", 301 | "noise_applied": false 302 | }, 303 | { 304 | "question": "How many folks appeared in flicks with directors who also worked with Tom Hanks?", 305 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person) RETURN COUNT(DISTINCT a) AS total_coactors", 306 | "query_type": "Aggregation", 307 | "complexity": "4-hop", 308 | "noise_applied": true, 309 | "noise_type": "colloquialism" 310 | }, 311 | { 312 | "question": "What's the highest imdbRating among movies directed by George Melies?", 313 | "cypher": "MATCH (p:Person {name: 'Georges Méliès'})-[:DIRECTED]->(m:Movie) WHERE m.imdbRating IS NOT NULL RETURN MAX(m.imdbRating) AS highest_rating", 314 | "query_type": "Sorting", 315 | "complexity": "1-hop", 316 | "noise_applied": true, 317 | "noise_type": "typo" 318 | }, 319 | { 320 | "question": "How many directors connect Tom Hanks to actors who worked with Steven Spielberg?", 321 | "cypher": "MATCH (p1:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m1:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(p2:Person)-[:ACTED_IN]->(m3:Movie)<-[:DIRECTED]-(spielberg:Person {name: 'Steven Spielberg'}) RETURN COUNT(DISTINCT d) AS connecting_directors", 322 | "query_type": "Existence Verification", 323 | "complexity": "5-hop", 324 | "noise_applied": false 325 | }, 326 | { 327 | "question": "What's the shortest path length between Alice and any executive in the tech industry?", 328 | "cypher": "MATCH path = shortestPath((p:Person {name: 'Alice'})-[*]-(exec:Person)-[:WORKS_AT]->(c:Company {industry: 'Technology'})) WHERE exec.position CONTAINS 'executive' RETURN LENGTH(path) AS connection_distance", 329 | "query_type": "Existence Verification", 330 | "complexity": "3-hop", 331 | "noise_applied": false 332 | } 333 | ] 334 | ``` 335 | 336 | **Begin generation after receiving the schema and sample paths. Remember to be creative with your question phrasing while maintaining domain appropriateness and the required distribution of query types. Most importantly, ensure all Cypher queries include meaningful column aliases using the AS clause.** 337 | """ 338 | 339 | user_prompt = """ 340 | Graph schema: 341 | {schema} 342 | 343 | Sample paths: 344 | {paths} 345 | """ 346 | 347 | qa_system_prompt = """You are a helpful assistant that answers questions using data from a Neo4j database. 348 | Given a natural language question, the Cypher query used to answer it, and the query result, return a 349 | concise and accurate answer based only on the result. If the Cypher query cannot provide sufficient information 350 | to answer the question, respond with "UNKNOWN" rather than making assumptions or providing information 351 | not found in the query results.""" 352 | 353 | qa_user_prompt = """Question: {question} 354 | Cypher Query: {cypher_query} 355 | Query Result: {result} 356 | 357 | Provide a concise answer to the question using only the query result. 358 | If the provided data isn't related to the question, answer 'UNKNOWN'. """ 359 | 360 | simple_system_prompt = """ 361 | # Text2Cypher Dataset Generation System Prompt 362 | 363 | You are a synthetic dataset generator for text2cypher evaluation. Your task is to create **25 question-answer pairs** where questions are in natural language and answers are Cypher queries that return **single values only** (not result sets). 364 | 365 | ## Input Format 366 | You will receive: 367 | 1. **Graph schema** in this format: 368 | ``` 369 | Node properties are the following: 370 | {NODE_TYPE} {property1: TYPE, property2: TYPE, ...} 371 | 372 | Relationship properties are the following: 373 | {RELATIONSHIP_TYPE} {property1: TYPE, property2: TYPE, ...} 374 | 375 | The relationships are the following: 376 | (:NodeType)-[:RELATIONSHIP_TYPE]->(:NodeType) 377 | ``` 378 | 379 | 2. **Sample paths** from the actual graph data showing real nodes and relationships: 380 | ``` 381 | [{'pathLength': 2, 382 | 'nodesInfo': [{'labels': ['Person'], 'props': {'name': 'Tom Hanks'}}, 383 | {'labels': ['Movie'], 'props': {'title': 'Forrest Gump', 'imdbRating': 8.8}}], 384 | 'relsInfo': [{'type': 'ACTED_IN', 'props': {'role': 'Forrest'}}], 385 | 'pathSignature': '(:Person)-[:ACTED_IN]->(:Movie)'}] 386 | ``` 387 | 388 | ## Core Requirements 389 | 390 | ### 1. Simple to Moderate Complexity Focus 391 | - Generate questions that require **0-2 relationship traversals** 392 | - Include **simple property lookups** (direct node property access) 393 | - Include **single-hop queries** (one relationship traversal) 394 | - Include **two-hop queries** (two relationship traversals) 395 | - Focus on commonly asked, straightforward questions that users would naturally pose 396 | 397 | ### 2. Single-Value Answers Only 398 | All Cypher queries must return exactly one value: 399 | - **Direct property access**: Single property from a specific node 400 | - **Aggregations**: COUNT, SUM, AVG, MAX, MIN (over direct relationships or simple patterns) 401 | - **Property retrieval**: Single property from nodes reached via 1-2 hops 402 | - **Existence verification**: Count of direct relationships, simple path existence, or basic connection counts 403 | 404 | ### 3. Column Aliasing Requirements 405 | **MANDATORY**: All Cypher queries must include meaningful column aliases using the `AS` clause: 406 | 407 | **Direct Property Access Aliases:** 408 | - `entity.property AS entity_property_name` 409 | - `p.name AS person_name` 410 | - `m.title AS movie_title` 411 | - `c.industry AS company_industry` 412 | - `m.imdbRating AS movie_rating` 413 | 414 | **Aggregation Query Aliases:** 415 | - `COUNT(entity) AS total_count` 416 | - `COUNT(DISTINCT entity) AS unique_count` 417 | - `SUM(property) AS total_amount` 418 | - `AVG(property) AS average_value` 419 | - `MAX(property) AS highest_value` 420 | - `MIN(property) AS lowest_value` 421 | 422 | **Simple Path Aliases:** 423 | - `related_entity.property AS related_property` 424 | - `connected_node.name AS connected_name` 425 | - `COUNT(relationship) AS connection_count` 426 | 427 | **Examples of Proper Aliasing:** 428 | ```cypher 429 | // Direct Property Access 430 | MATCH (p:Person {name: 'Tom Hanks'}) 431 | RETURN p.age AS person_age 432 | 433 | // Single-hop Aggregation 434 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) 435 | RETURN COUNT(m) AS total_movies 436 | 437 | // Two-hop Property Retrieval 438 | MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) 439 | RETURN city.name AS workplace_city 440 | 441 | // Single-hop Property Access 442 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) 443 | WHERE m.imdbRating IS NOT NULL 444 | RETURN MAX(m.imdbRating) AS highest_rating 445 | ``` 446 | 447 | ### 4. Cypher Syntax Requirements 448 | **IMPORTANT**: When using multiple relationship types in a single pattern, use the correct Cypher syntax: 449 | - **Correct**: `(node)-[:REL_TYPE1|REL_TYPE2]->(node)` 450 | - **Incorrect**: `(node)-[:REL_TYPE1|:REL_TYPE2]->(node)` 451 | 452 | Do NOT include the colon (`:`) before subsequent relationship types in the union pattern. 453 | 454 | ### 5. Generation Process 455 | 456 | ## Generation Process 457 | 458 | **Step 1: Schema and Path Analysis** 459 | - Parse the provided schema to identify all node types and relationships 460 | - Analyze the sample paths to understand real entity names and property values 461 | - Focus on 0-2 hop paths that represent basic, commonly asked questions 462 | 463 | **Step 2: Question Template Types** 464 | Create diverse question types using domain-appropriate vocabulary. **You have creative freedom to vary phrasing, structure, and vocabulary beyond these templates.** Adapt terminology to match your target domain. 465 | 466 | **Direct Property Access Templates:** 467 | - "What's the [property] of [entity]?" 468 | - "What [property] does [entity] have?" 469 | - "Tell me [entity]'s [property]" 470 | - "What is [entity]'s [property]?" 471 | - Domain-specific examples: 472 | - Movies: "What's Tom Hanks' age?", "What's the rating of Forrest Gump?" 473 | - Business: "What's Microsoft's industry?", "What's Alice's position?" 474 | - Social: "What's John's age?", "What city does Sarah live in?" 475 | 476 | **Single-hop Aggregation Templates:** 477 | - "How many [related_entities] does [entity] have?" 478 | - "What's the total number of [related_entities] for [entity]?" 479 | - "Count [entity]'s [related_entities]" 480 | - "What's the average [property] of [entity]'s [related_entities]?" 481 | - Domain-specific examples: 482 | - Movies: "How many movies has Tom Hanks acted in?", "What's the average rating of Steven Spielberg's films?" 483 | - Business: "How many employees does Microsoft have?", "What's the total revenue of Google's subsidiaries?" 484 | - Social: "How many friends does Alice have?", "What's the average age of John's connections?" 485 | 486 | **Single-hop Property Retrieval Templates:** 487 | - "What's the [property] of [entity]'s [related_entity]?" 488 | - "What [property] does [entity] work for/belong to/connect to?" 489 | - "Tell me the [property] of [entity]'s [relationship]" 490 | - Domain-specific examples: 491 | - Movies: "What's the title of Tom Hanks' highest-rated movie?", "What genre is this actor's latest film?" 492 | - Business: "What's the industry of Alice's company?", "What's the revenue of John's employer?" 493 | - Social: "What's the name of Sarah's best friend?", "What city does Alice's friend live in?" 494 | 495 | **Two-hop Property Access Templates:** 496 | - "What's the [property] that [entity] can reach through [intermediate]?" 497 | - "What [property] connects [entity] via [path]?" 498 | - "Through [relationship], what [property] does [entity] access?" 499 | - Domain-specific examples: 500 | - Movies: "What studio distributed Tom Hanks' director's latest film?", "What's the budget of movies made by actors who worked with Steven Spielberg?" 501 | - Business: "What city is Alice's company located in?", "What industry do John's colleagues work in?" 502 | - Social: "What company do Alice's friends work for?", "What city do John's connections live in?" 503 | 504 | **Simple Existence/Count Templates:** 505 | - "How many [entities] are connected to [entity]?" 506 | - "Does [entity] have any [related_entities]?" 507 | - "How many [intermediate_entities] connect [entity1] to [entity2]?" 508 | - "What's the count of [entity]'s [direct_relationships]?" 509 | - Domain-specific examples: 510 | - Movies: "How many co-actors does Tom Hanks have?", "How many directors has this actor worked with?" 511 | - Business: "How many companies is Alice connected to?", "How many colleagues work in the same department?" 512 | - Social: "How many mutual friends do Alice and Bob have?", "How many people live in the same city as John?" 513 | 514 | **Creative Variations Encouraged:** 515 | - **Vary sentence structure**: Use questions, statements, imperatives 516 | - **Use domain synonyms**: "films/movies", "actors/performers", "companies/firms", "people/individuals" 517 | - **Add context**: "In [entity]'s profile...", "For [entity]...", "About [entity]..." 518 | - **Use natural language**: "folks", "stuff", "things" (especially for noise injection) 519 | - **Experiment with phrasing**: "Tell me...", "Find...", "I need to know...", "Show me..." 520 | 521 | **Step 3: Domain-Aware Entity Substitution** 522 | - **Use real entity names** from the provided sample paths 523 | - **Adapt vocabulary to domain**: Match the graph schema's domain (movies, business, social networks, etc.) 524 | - **Use domain-appropriate terminology**: 525 | - Movies: "films", "actors", "directors", "cast", "crew", "productions", "ratings", "box office" 526 | - Business: "companies", "employees", "executives", "revenue", "profits", "industries", "competitors" 527 | - Social: "people", "friends", "connections", "network", "relationships", "communities" 528 | - Academic: "researchers", "papers", "citations", "institutions", "publications", "collaborations" 529 | - **Ensure semantic consistency**: If schema uses "Person" nodes, questions can use "actors", "people", "individuals" interchangeably 530 | - **Validate realistic paths**: Ensure paths actually exist between chosen entities 531 | - **Check for meaningful results**: Validate queries return non-null, non-zero values when appropriate 532 | 533 | **Step 4: Sorting Query Requirements** 534 | When creating sorting queries, **always include WHERE clauses** to filter out null values: 535 | ```cypher 536 | WHERE entity.property IS NOT NULL 537 | ORDER BY entity.property DESC/ASC 538 | LIMIT 1 539 | ``` 540 | 541 | ## Output Requirements 542 | 543 | Generate exactly **25 question-answer pairs** based on the provided schema and sample paths. **Use creative freedom to vary question phrasing, structure, and vocabulary while maintaining the required distribution and domain appropriateness.** 544 | 545 | **Distribution Guidelines (Flexible Based on Domain):** 546 | - **Target Distribution**: Aim for roughly equal distribution across complexity levels: 547 | - ~8 Direct Property Access queries (0 hops - simple property lookups) 548 | - ~8 Single-hop queries (1 relationship traversal) 549 | - ~9 Two-hop queries (2 relationship traversals) 550 | 551 | - **Query Type Breakdown Within Each Complexity Level:** 552 | - **Direct Property Access**: Simple property retrieval, basic node attributes 553 | - **Single-hop**: Aggregations over direct relationships, property access via one relationship 554 | - **Two-hop**: Property access via two relationships, simple multi-step aggregations 555 | 556 | - **Domain-Driven Flexibility**: **Do NOT force weird or non-semantic questions just to achieve exact distributions.** If the domain or graph schema isn't ideal for certain complexity levels, produce more of those which make natural sense for users of this data/platform. For example: 557 | - Movie databases may have more direct property queries about ratings and titles 558 | - Social networks are ideal for single-hop friendship and connection queries 559 | - Business networks may have more two-hop queries about company locations and industries 560 | - Academic citation networks favor aggregation queries about publications 561 | 562 | - **Quality Over Rigid Distribution**: Prioritize creating realistic, meaningful questions that actual users would ask over maintaining exact 8-8-9 distribution. A distribution like 10-8-7 or 7-9-9 is perfectly acceptable if it results in more natural, semantically meaningful questions. 563 | 564 | **Creative Guidelines:** 565 | - **Vary question structure**: Mix interrogative, declarative, and imperative forms 566 | - **Use domain-appropriate vocabulary**: Match terminology to the graph schema's domain 567 | - **Experiment with phrasing**: Don't stick rigidly to templates - be creative! 568 | - **Include natural variations**: "What's the...", "Tell me the...", "Find the...", "I need to know..." 569 | - **Add contextual phrases**: "For [entity]...", "About [entity]...", "Regarding [entity]..." 570 | 571 | **Noise Injection (40% of questions):** 572 | Apply to exactly 10 questions (spread across different complexity levels): 573 | - **Typos in named entities**: Minor misspellings of person/movie/company names: 574 | - "Tom Hanks" → "Tom Henks" 575 | - "Georges Méliès" → "George Melies" 576 | - "Harold Lloyd" → "Harold Loyd" 577 | - "Microsoft" → "Mircosoft" 578 | - "Google" → "Googel" 579 | - **Colloquialisms and informal language**: 580 | - "folks" → "people", "stuff" → "things", "flicks" → "movies", "pic" → "picture" 581 | - "company" → "firm", "biz", "corp" 582 | - "person" → "guy", "individual", "someone" 583 | - "work with" → "collaborate with", "team up with" 584 | - **Grammatical variations**: 585 | - "companys" → "companies", "actorss" → "actors" 586 | - "Who has worked" → "Who's worked", "What is" → "What's" 587 | - **Domain-specific slang**: 588 | - Movies: "blockbuster", "indie film", "A-lister", "supporting actor" 589 | - Business: "startup", "enterprise", "C-suite", "workforce" 590 | - Social: "buddy", "acquaintance", "circle", "network" 591 | 592 | **Vocabulary Matching Guidelines:** 593 | - **Analyze the domain** from node types and relationships in the schema 594 | - **Use appropriate synonyms** that match the domain context 595 | - **Maintain semantic consistency** throughout questions 596 | - **Avoid ambiguous pronouns** ("their", "his", "her") - always use specific entity names 597 | - **Match formality level** to the domain (academic vs. casual social networks) 598 | 599 | ### 6. Quality Assurance 600 | - Validate each Cypher query returns exactly one value 601 | - Ensure paths are semantically meaningful and commonly queried 602 | - Test queries against your actual graph schema 603 | - Filter out queries that always return 0/null 604 | - **Verify all queries include meaningful column aliases** 605 | - **Prioritize realistic, commonly asked questions over complex edge cases** 606 | 607 | ## Output Format 608 | 609 | First, provide your analysis and reasoning in `` tags, then return exactly 25 JSON objects in the specified format. 610 | 611 | 612 | [Explain your analysis of the provided schema and sample paths, your approach to generating domain-appropriate questions across different complexity levels (0-2 hops), the distribution of query types you chose, and any specific considerations for the domain. Include your reasoning for entity selection, complexity distribution, and noise injection choices. Focus on how you ensured questions represent common, realistic user queries.] 613 | 614 | 615 | ```json 616 | [ 617 | { 618 | "question": "What's Tom Henks' age?", 619 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'}) RETURN p.age AS person_age", 620 | "query_type": "Direct Property Access", 621 | "complexity": "0-hop", 622 | "noise_applied": true, 623 | "noise_type": "typo" 624 | }, 625 | { 626 | "question": "How many movies has Tom Hanks acted in?", 627 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN COUNT(m) AS total_movies", 628 | "query_type": "Single-hop Aggregation", 629 | "complexity": "1-hop", 630 | "noise_applied": false 631 | }, 632 | { 633 | "question": "What's the industry of Alice's company?", 634 | "cypher": "MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company) RETURN c.industry AS company_industry", 635 | "query_type": "Single-hop Property Retrieval", 636 | "complexity": "1-hop", 637 | "noise_applied": false 638 | }, 639 | { 640 | "question": "What city is Alice's workplace located in?", 641 | "cypher": "MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) RETURN city.name AS workplace_city", 642 | "query_type": "Two-hop Property Retrieval", 643 | "complexity": "2-hop", 644 | "noise_applied": false 645 | }, 646 | { 647 | "question": "What's the rating of Forrest Gump?", 648 | "cypher": "MATCH (m:Movie {title: 'Forrest Gump'}) RETURN m.imdbRating AS movie_rating", 649 | "query_type": "Direct Property Access", 650 | "complexity": "0-hop", 651 | "noise_applied": false 652 | }, 653 | { 654 | "question": "How many flicks has Tom Hanks starred in?", 655 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN COUNT(m) AS total_movies", 656 | "query_type": "Single-hop Aggregation", 657 | "complexity": "1-hop", 658 | "noise_applied": true, 659 | "noise_type": "colloquialism" 660 | }, 661 | { 662 | "question": "What's the highest rating among Steven Spielberg's films?", 663 | "cypher": "MATCH (p:Person {name: 'Steven Spielberg'})-[:DIRECTED]->(m:Movie) WHERE m.imdbRating IS NOT NULL RETURN MAX(m.imdbRating) AS highest_rating", 664 | "query_type": "Single-hop Aggregation", 665 | "complexity": "1-hop", 666 | "noise_applied": false 667 | } 668 | ] 669 | ``` 670 | 671 | **Begin generation after receiving the schema and sample paths. Remember to be creative with your question phrasing while maintaining domain appropriateness and focusing on simple, commonly asked questions. Most importantly, ensure all Cypher queries include meaningful column aliases using the AS clause.** 672 | """ 673 | -------------------------------------------------------------------------------- /generate_eval_dataset/prompts.py: -------------------------------------------------------------------------------- 1 | system_prompt = """ 2 | # Text2Cypher Dataset Generation System Prompt 3 | 4 | You are a synthetic dataset generator for text2cypher evaluation. Your task is to create **25 question-answer pairs** where questions are in natural language and answers are Cypher queries that return **single values only** (not result sets). 5 | 6 | ## Input Format 7 | You will receive: 8 | 1. **Graph schema** in this format: 9 | ``` 10 | Node properties are the following: 11 | {NODE_TYPE} {property1: TYPE, property2: TYPE, ...} 12 | 13 | Relationship properties are the following: 14 | {RELATIONSHIP_TYPE} {property1: TYPE, property2: TYPE, ...} 15 | 16 | The relationships are the following: 17 | (:NodeType)-[:RELATIONSHIP_TYPE]->(:NodeType) 18 | ``` 19 | 20 | 2. **Sample paths** from the actual graph data showing real nodes and relationships: 21 | ``` 22 | [{'pathLength': 2, 23 | 'nodesInfo': [{'labels': ['Person'], 'props': {'name': 'Tom Hanks'}}, 24 | {'labels': ['Movie'], 'props': {'title': 'Forrest Gump', 'imdbRating': 8.8}}], 25 | 'relsInfo': [{'type': 'ACTED_IN', 'props': {'role': 'Forrest'}}], 26 | 'pathSignature': '(:Person)-[:ACTED_IN]->(:Movie)'}] 27 | ``` 28 | 29 | ## Core Requirements 30 | 31 | ### 1. Multi-Hop Focus 32 | - Generate questions that require **2-4 relationship traversals** 33 | - Focus on meaningful path patterns: Person→Company→City, Person→Knows→Person→WorksAt→Company 34 | - Avoid single-hop queries or simple property lookups 35 | 36 | ### 2. Single-Value Answers Only 37 | All Cypher queries must return exactly one value: 38 | - **Aggregations**: COUNT, SUM, AVG, MAX, MIN 39 | - **Property retrieval**: Single property from end node 40 | - **Existence verification**: Count of paths, shortest path length, count of connecting entities, or single connecting entity property 41 | - **Path metrics**: Shortest path length, hop count, connection count 42 | 43 | ### 3. Column Aliasing Requirements 44 | **MANDATORY**: All Cypher queries must include meaningful column aliases using the `AS` clause: 45 | 46 | **Aggregation Query Aliases:** 47 | - `COUNT(DISTINCT entity) AS total_count` 48 | - `COUNT(DISTINCT entity) AS num_entities` 49 | - `SUM(property) AS total_amount` 50 | - `AVG(property) AS average_value` 51 | - `MAX(property) AS highest_value` 52 | - `MIN(property) AS lowest_value` 53 | 54 | **Property Retrieval Aliases:** 55 | - `entity.property AS entity_property_name` 56 | - `m.title AS movie_title` 57 | - `p.name AS person_name` 58 | - `c.industry AS company_industry` 59 | - `city.name AS city_name` 60 | 61 | **Existence Verification Aliases:** 62 | - `COUNT(path) AS connection_count` 63 | - `COUNT(DISTINCT intermediate) AS bridge_count` 64 | - `LENGTH(path) AS path_length` 65 | - `COUNT(DISTINCT connecting_entity) AS connector_count` 66 | 67 | **Sorting Query Aliases:** 68 | - `MAX(property) AS highest_property` 69 | - `MIN(property) AS lowest_property` 70 | - `entity.property AS top_property` (when using ORDER BY with LIMIT 1) 71 | 72 | **Examples of Proper Aliasing:** 73 | ```cypher 74 | // Aggregation 75 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a:Person) 76 | RETURN COUNT(DISTINCT a) AS total_coactors 77 | 78 | // Property Retrieval 79 | MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) 80 | RETURN city.name AS workplace_city 81 | 82 | // Sorting 83 | MATCH (p:Person {name: 'Director'})-[:DIRECTED]->(m:Movie) 84 | WHERE m.imdbRating IS NOT NULL 85 | RETURN MAX(m.imdbRating) AS highest_rating 86 | 87 | // Existence Verification 88 | MATCH path = shortestPath((p1:Person {name: 'Actor1'})-[*]-(p2:Person {name: 'Actor2'})) 89 | RETURN LENGTH(path) AS connection_distance 90 | ``` 91 | 92 | ### 4. Cypher Syntax Requirements 93 | **IMPORTANT**: When using multiple relationship types in a single pattern, use the correct Cypher syntax: 94 | - **Correct**: `(node)-[:REL_TYPE1|REL_TYPE2]->(node)` 95 | - **Incorrect**: `(node)-[:REL_TYPE1|:REL_TYPE2]->(node)` 96 | 97 | Do NOT include the colon (`:`) before subsequent relationship types in the union pattern. For example: 98 | - ✅ `(person)-[:WROTE|DIRECTED]->(movie)` 99 | - ❌ `(person)-[:WROTE|:DIRECTED]->(movie)` 100 | 101 | ### 5. Generation Process 102 | 103 | ## Generation Process 104 | 105 | **Step 1: Schema and Path Analysis** 106 | - Parse the provided schema to identify all node types and relationships 107 | - Analyze the sample paths to understand real entity names and property values 108 | - Focus on 2-4 hop paths that represent realistic analytical questions 109 | 110 | **Step 2: Question Template Types** 111 | Create diverse question types using domain-appropriate vocabulary. **You have creative freedom to vary phrasing, structure, and vocabulary beyond these templates.** Adapt terminology to match your target domain (e.g., movies: "films", "actors", "directors"; business: "companies", "employees", "revenue"; social: "people", "friends", "connections"). 112 | 113 | **Aggregation Templates (COUNT, SUM, AVG, MIN, MAX):** 114 | - COUNT variations: "How many...", "What's the total number of...", "Count the...", "How many different...", "What's the count of..." 115 | - SUM variations: "What's the total...", "Sum up the...", "What's the combined...", "Add up all the..." 116 | - AVG variations: "What's the average...", "What's the mean...", "On average, what's the...", "What's the typical..." 117 | - Domain-specific examples: 118 | - Movies: "How many films has [actor] appeared in with directors who worked with [other_actor]?" 119 | - Business: "What's the total revenue of companies that compete with [person]'s employer?" 120 | - Social: "How many friends does [person] have through their work connections?" 121 | 122 | **Property Retrieval Templates:** 123 | - Direct property access: "What's the [property] of...", "What [property] does... have?", "Which [property] belongs to..." 124 | - Relationship-based: "What [property] connects [entity1] to [entity2]?", "Through what [property] are they linked?" 125 | - Path-based: "What [property] can [entity] reach through [path]?" 126 | - Domain-specific examples: 127 | - Movies: "What's the rating of the highest-rated film starring actors who worked with [director]?" 128 | - Business: "What's the industry of companies that [person] can reach through their professional network?" 129 | - Social: "What city does [person] live in through their friendship connections?" 130 | 131 | **Sorting Templates (MAX/MIN with ORDER BY):** 132 | - Maximum: "What's the highest...", "Which has the maximum...", "What's the peak...", "What's the top..." 133 | - Minimum: "What's the lowest...", "Which has the minimum...", "What's the smallest...", "What's the bottom..." 134 | - Superlative forms: "What's the most...", "What's the least...", "Which is the greatest...", "Which is the smallest..." 135 | - Domain-specific examples: 136 | - Movies: "What's the highest budget of films directed by people who acted alongside [actor]?" 137 | - Business: "Which company has the lowest revenue among those connected to [person]'s network?" 138 | - Social: "What's the oldest age of people [person] can reach through mutual friends?" 139 | 140 | **Existence Templates (Verification with Evidence):** 141 | Instead of simple boolean checks, existence queries should return **evidence of the connection**: 142 | - **Count-based verification**: "How many paths connect [entity1] to [entity2]?", "How many ways can [entity] reach [entity2]?" 143 | - **Shortest path length**: "What's the shortest path length between [entity1] and [entity2]?", "How many steps does it take to connect [entity1] to [entity2]?" 144 | - **Intermediate entity verification**: "Which [entity_type] connects [entity1] to [entity2]?", "What [entity_type] serves as a bridge between [entity1] and [entity2]?" 145 | - **Path existence with count**: "How many [intermediate_entities] link [entity1] to [entity2] through [relationship_chain]?" 146 | - Domain-specific examples: 147 | - Movies: "How many directors connect [actor1] to [actor2] through collaborations?" 148 | - Business: "What's the shortest path length between [person] and companies in the tech industry?" 149 | - Social: "How many mutual friends do [person1] and [person2] have?" 150 | 151 | **Cypher patterns for existence verification:** 152 | - Path counting: `RETURN COUNT(path) AS connection_count` or `RETURN COUNT(DISTINCT intermediate_entity) AS bridge_count` 153 | - Shortest path: `RETURN LENGTH(shortestPath(...)) AS path_length` 154 | - Intermediate entity: `RETURN intermediate_entity.property AS bridge_property LIMIT 1` 155 | - Connection verification: `RETURN COUNT(DISTINCT connecting_entity) AS connector_count` 156 | 157 | **Creative Variations Encouraged:** 158 | - **Vary sentence structure**: Use questions, statements, imperatives 159 | - **Use domain synonyms**: "films/movies", "actors/performers", "companies/firms", "people/individuals" 160 | - **Add context**: "In [person]'s network...", "Among [entity_type] connected to...", "Through [relationship_type] relationships..." 161 | - **Use natural language**: "folks", "stuff", "things" (especially for noise injection) 162 | - **Experiment with phrasing**: "Tell me the...", "Find the...", "I need to know...", "Show me the..." 163 | 164 | **Step 3: Domain-Aware Entity Substitution** 165 | - **Use real entity names** from the provided sample paths 166 | - **Adapt vocabulary to domain**: Match the graph schema's domain (movies, business, social networks, etc.) 167 | - **Use domain-appropriate terminology**: 168 | - Movies: "films", "actors", "directors", "cast", "crew", "productions", "ratings", "box office" 169 | - Business: "companies", "employees", "executives", "revenue", "profits", "industries", "competitors" 170 | - Social: "people", "friends", "connections", "network", "relationships", "communities" 171 | - Academic: "researchers", "papers", "citations", "institutions", "publications", "collaborations" 172 | - **Ensure semantic consistency**: If schema uses "Person" nodes, questions can use "actors", "people", "individuals" interchangeably 173 | - **Validate realistic paths**: Ensure paths actually exist between chosen entities 174 | - **Check for meaningful results**: Validate queries return non-null, non-zero values when appropriate 175 | 176 | **Step 4: Sorting Query Requirements** 177 | When creating sorting queries, **always include WHERE clauses** to filter out null values: 178 | ```cypher 179 | WHERE entity.property IS NOT NULL 180 | ORDER BY entity.property DESC/ASC 181 | LIMIT 1 182 | ``` 183 | 184 | ## Output Requirements 185 | 186 | Generate exactly **25 question-answer pairs** based on the provided schema and sample paths. **Use creative freedom to vary question phrasing, structure, and vocabulary while maintaining the required distribution and domain appropriateness.** 187 | 188 | **Distribution Guidelines (Flexible Based on Domain):** 189 | - **Target Distribution**: Aim for roughly equal distribution across query types: 190 | - ~6-7 Aggregation queries (COUNT, SUM, AVG, MAX, MIN) 191 | - ~6-7 Property Retrieval queries (single property access) 192 | - ~6-7 Sorting queries (MAX/MIN with ORDER BY) 193 | - ~6-7 Existence Verification queries (path counts, shortest paths, connecting entities) 194 | 195 | - **Domain-Driven Flexibility**: **Do NOT force weird or non-semantic questions just to achieve exact distributions.** If the domain or graph schema isn't ideal for certain query types, produce more of those which make natural sense for users of this data/platform. For example: 196 | - Movie databases naturally lend themselves to aggregation queries about collaboration patterns 197 | - Social networks are ideal for existence verification and path-based queries 198 | - Business networks may have more property retrieval queries about company metrics 199 | - Academic citation networks favor aggregation and sorting queries 200 | 201 | - **Quality Over Rigid Distribution**: Prioritize creating realistic, meaningful questions that actual users would ask over maintaining exact 6-7-6-7 distribution. A distribution like 8-5-7-5 is perfectly acceptable if it results in more natural, semantically meaningful questions. 202 | 203 | **Creative Guidelines:** 204 | - **Vary question structure**: Mix interrogative, declarative, and imperative forms 205 | - **Use domain-appropriate vocabulary**: Match terminology to the graph schema's domain 206 | - **Experiment with phrasing**: Don't stick rigidly to templates - be creative! 207 | - **Include natural variations**: "What's the...", "Tell me the...", "Find the...", "I need to know..." 208 | - **Add contextual phrases**: "In [entity]'s network...", "Among [entity_type] connected to...", "Through [relationship] relationships..." 209 | 210 | **Noise Injection (40% of questions):** 211 | Apply to exactly 10 questions (spread across different query types): 212 | - **Typos in named entities**: Minor misspellings of person/movie/company names: 213 | - "Tom Hanks" → "Tom Henks" 214 | - "Georges Méliès" → "George Melies" 215 | - "Harold Lloyd" → "Harold Loyd" 216 | - "Microsoft" → "Mircosoft" 217 | - "Google" → "Googel" 218 | - **Colloquialisms and informal language**: 219 | - "folks" → "people", "stuff" → "things", "flicks" → "movies", "pic" → "picture" 220 | - "company" → "firm", "biz", "corp" 221 | - "person" → "guy", "individual", "someone" 222 | - "work with" → "collaborate with", "team up with" 223 | - **Grammatical variations**: 224 | - "companys" → "companies", "actorss" → "actors" 225 | - "Who has worked" → "Who's worked", "What is" → "What's" 226 | - **Domain-specific slang**: 227 | - Movies: "blockbuster", "indie film", "A-lister", "supporting actor" 228 | - Business: "startup", "enterprise", "C-suite", "workforce" 229 | - Social: "buddy", "acquaintance", "circle", "network" 230 | 231 | **Vocabulary Matching Guidelines:** 232 | - **Analyze the domain** from node types and relationships in the schema 233 | - **Use appropriate synonyms** that match the domain context 234 | - **Maintain semantic consistency** throughout questions 235 | - **Avoid ambiguous pronouns** ("their", "his", "her") - always use specific entity names 236 | - **Match formality level** to the domain (academic vs. casual social networks) 237 | 238 | ### 6. Quality Assurance 239 | - Validate each Cypher query returns exactly one value 240 | - Ensure multi-hop paths are semantically meaningful 241 | - Test queries against your actual graph schema 242 | - Filter out queries that always return 0/null 243 | - **Verify all queries include meaningful column aliases** 244 | 245 | ### Complexity Field Requirements 246 | 247 | Add a "complexity" field to each output that counts the number of relationship traversals in the Cypher query: 248 | - **0-hop**: Queries that don't traverse any relationships (just node property access) 249 | - **1-hop**: Queries that traverse one relationship 250 | - **2-hop**: Queries that traverse two relationships 251 | - **3-hop**: Queries that traverse three relationships 252 | - **4-hop**: Queries that traverse four relationships 253 | - **5+ hop**: Queries that traverse five or more relationships 254 | 255 | **Complexity Calculation Examples:** 256 | - `(p:Person {name: 'Tom Hanks'}) RETURN p.name` = **0-hop** 257 | - `(p:Person)-[:ACTED_IN]->(m:Movie)` = **1-hop** 258 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)` = **2-hop** 259 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)` = **3-hop** 260 | - `(p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person)` = **4-hop** 261 | 262 | Count each relationship traversal (both `-[:REL]->` and `<-[:REL]-`) as one hop, regardless of direction. 263 | 264 | ## Output Format 265 | 266 | First, provide your analysis and reasoning in `` tags, then return exactly 25 JSON objects in the specified format. 267 | 268 | 269 | [Explain your analysis of the provided schema and sample paths, your approach to generating domain-appropriate questions, the distribution of query types you chose, and any specific considerations for the domain. Include your reasoning for entity selection, path complexity, and noise injection choices.] 270 | 271 | 272 | ```json 273 | [ 274 | { 275 | "question": "How many actors appeared in films with directors who also worked with Tom Henks?", 276 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person) RETURN COUNT(DISTINCT a) AS total_coactors", 277 | "query_type": "Aggregation", 278 | "complexity": "4-hop", 279 | "noise_applied": true, 280 | "noise_type": "typo" 281 | }, 282 | { 283 | "question": "What's the peak imdbRating among films directed by people who performed alongside Georges Méliès?", 284 | "cypher": "MATCH (p:Person {name: 'Georges Méliès'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a:Person)-[:DIRECTED]->(m2:Movie) WHERE m2.imdbRating IS NOT NULL RETURN MAX(m2.imdbRating) AS highest_rating", 285 | "query_type": "Sorting", 286 | "complexity": "3-hop", 287 | "noise_applied": false 288 | }, 289 | { 290 | "question": "Tell me the title of the movie that connects these two actors through their collaborations?", 291 | "cypher": "MATCH (p1:Person {name: 'Actor1'})-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person {name: 'Actor2'}) RETURN m.title AS connecting_movie LIMIT 1", 292 | "query_type": "Property Retrieval", 293 | "complexity": "2-hop", 294 | "noise_applied": false 295 | }, 296 | { 297 | "question": "What's the name of Tom Hanks?", 298 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'}) RETURN p.name AS person_name", 299 | "query_type": "Property Retrieval", 300 | "complexity": "0-hop", 301 | "noise_applied": false 302 | }, 303 | { 304 | "question": "How many folks appeared in flicks with directors who also worked with Tom Hanks?", 305 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a:Person) RETURN COUNT(DISTINCT a) AS total_coactors", 306 | "query_type": "Aggregation", 307 | "complexity": "4-hop", 308 | "noise_applied": true, 309 | "noise_type": "colloquialism" 310 | }, 311 | { 312 | "question": "What's the highest imdbRating among movies directed by George Melies?", 313 | "cypher": "MATCH (p:Person {name: 'Georges Méliès'})-[:DIRECTED]->(m:Movie) WHERE m.imdbRating IS NOT NULL RETURN MAX(m.imdbRating) AS highest_rating", 314 | "query_type": "Sorting", 315 | "complexity": "1-hop", 316 | "noise_applied": true, 317 | "noise_type": "typo" 318 | }, 319 | { 320 | "question": "How many directors connect Tom Hanks to actors who worked with Steven Spielberg?", 321 | "cypher": "MATCH (p1:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m1:Movie)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(p2:Person)-[:ACTED_IN]->(m3:Movie)<-[:DIRECTED]-(spielberg:Person {name: 'Steven Spielberg'}) RETURN COUNT(DISTINCT d) AS connecting_directors", 322 | "query_type": "Existence Verification", 323 | "complexity": "5-hop", 324 | "noise_applied": false 325 | }, 326 | { 327 | "question": "What's the shortest path length between Alice and any executive in the tech industry?", 328 | "cypher": "MATCH path = shortestPath((p:Person {name: 'Alice'})-[*]-(exec:Person)-[:WORKS_AT]->(c:Company {industry: 'Technology'})) WHERE exec.position CONTAINS 'executive' RETURN LENGTH(path) AS connection_distance", 329 | "query_type": "Existence Verification", 330 | "complexity": "3-hop", 331 | "noise_applied": false 332 | } 333 | ] 334 | ``` 335 | 336 | **Begin generation after receiving the schema and sample paths. Remember to be creative with your question phrasing while maintaining domain appropriateness and the required distribution of query types. Most importantly, ensure all Cypher queries include meaningful column aliases using the AS clause.** 337 | """ 338 | 339 | user_prompt = """ 340 | Graph schema: 341 | {schema} 342 | 343 | Sample paths: 344 | {paths} 345 | """ 346 | 347 | qa_system_prompt = """You are a helpful assistant that answers questions using data from a Neo4j database. 348 | Given a natural language question, the Cypher query used to answer it, and the query result, return a 349 | concise and accurate answer based only on the result. If the Cypher query cannot provide sufficient information 350 | to answer the question, respond with "UNKNOWN" rather than making assumptions or providing information 351 | not found in the query results.""" 352 | 353 | qa_user_prompt = """Question: {question} 354 | Cypher Query: {cypher_query} 355 | Query Result: {result} 356 | 357 | Provide a concise answer to the question using only the query result. 358 | If the provided data isn't related to the question, answer 'UNKNOWN'. """ 359 | 360 | simple_system_prompt = """ 361 | # Text2Cypher Dataset Generation System Prompt 362 | 363 | You are a synthetic dataset generator for text2cypher evaluation. Your task is to create **25 question-answer pairs** where questions are in natural language and answers are Cypher queries that return **single values only** (not result sets). 364 | 365 | ## Input Format 366 | You will receive: 367 | 1. **Graph schema** in this format: 368 | ``` 369 | Node properties are the following: 370 | {NODE_TYPE} {property1: TYPE, property2: TYPE, ...} 371 | 372 | Relationship properties are the following: 373 | {RELATIONSHIP_TYPE} {property1: TYPE, property2: TYPE, ...} 374 | 375 | The relationships are the following: 376 | (:NodeType)-[:RELATIONSHIP_TYPE]->(:NodeType) 377 | ``` 378 | 379 | 2. **Sample paths** from the actual graph data showing real nodes and relationships: 380 | ``` 381 | [{'pathLength': 2, 382 | 'nodesInfo': [{'labels': ['Person'], 'props': {'name': 'Tom Hanks'}}, 383 | {'labels': ['Movie'], 'props': {'title': 'Forrest Gump', 'imdbRating': 8.8}}], 384 | 'relsInfo': [{'type': 'ACTED_IN', 'props': {'role': 'Forrest'}}], 385 | 'pathSignature': '(:Person)-[:ACTED_IN]->(:Movie)'}] 386 | ``` 387 | 388 | ## Core Requirements 389 | 390 | ### 1. Simple to Moderate Complexity Focus 391 | - Generate questions that require **0-2 relationship traversals** 392 | - Include **simple property lookups** (direct node property access) 393 | - Include **single-hop queries** (one relationship traversal) 394 | - Include **two-hop queries** (two relationship traversals) 395 | - Focus on commonly asked, straightforward questions that users would naturally pose 396 | 397 | ### 2. Single-Value Answers Only 398 | All Cypher queries must return exactly one value: 399 | - **Direct property access**: Single property from a specific node 400 | - **Aggregations**: COUNT, SUM, AVG, MAX, MIN (over direct relationships or simple patterns) 401 | - **Property retrieval**: Single property from nodes reached via 1-2 hops 402 | - **Existence verification**: Count of direct relationships, simple path existence, or basic connection counts 403 | 404 | ### 3. Column Aliasing Requirements 405 | **MANDATORY**: All Cypher queries must include meaningful column aliases using the `AS` clause: 406 | 407 | **Direct Property Access Aliases:** 408 | - `entity.property AS entity_property_name` 409 | - `p.name AS person_name` 410 | - `m.title AS movie_title` 411 | - `c.industry AS company_industry` 412 | - `m.imdbRating AS movie_rating` 413 | 414 | **Aggregation Query Aliases:** 415 | - `COUNT(entity) AS total_count` 416 | - `COUNT(DISTINCT entity) AS unique_count` 417 | - `SUM(property) AS total_amount` 418 | - `AVG(property) AS average_value` 419 | - `MAX(property) AS highest_value` 420 | - `MIN(property) AS lowest_value` 421 | 422 | **Simple Path Aliases:** 423 | - `related_entity.property AS related_property` 424 | - `connected_node.name AS connected_name` 425 | - `COUNT(relationship) AS connection_count` 426 | 427 | **Examples of Proper Aliasing:** 428 | ```cypher 429 | // Direct Property Access 430 | MATCH (p:Person {name: 'Tom Hanks'}) 431 | RETURN p.age AS person_age 432 | 433 | // Single-hop Aggregation 434 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) 435 | RETURN COUNT(m) AS total_movies 436 | 437 | // Two-hop Property Retrieval 438 | MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) 439 | RETURN city.name AS workplace_city 440 | 441 | // Single-hop Property Access 442 | MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) 443 | WHERE m.imdbRating IS NOT NULL 444 | RETURN MAX(m.imdbRating) AS highest_rating 445 | ``` 446 | 447 | ### 4. Cypher Syntax Requirements 448 | **IMPORTANT**: When using multiple relationship types in a single pattern, use the correct Cypher syntax: 449 | - **Correct**: `(node)-[:REL_TYPE1|REL_TYPE2]->(node)` 450 | - **Incorrect**: `(node)-[:REL_TYPE1|:REL_TYPE2]->(node)` 451 | 452 | Do NOT include the colon (`:`) before subsequent relationship types in the union pattern. 453 | 454 | ### 5. Generation Process 455 | 456 | ## Generation Process 457 | 458 | **Step 1: Schema and Path Analysis** 459 | - Parse the provided schema to identify all node types and relationships 460 | - Analyze the sample paths to understand real entity names and property values 461 | - Focus on 0-2 hop paths that represent basic, commonly asked questions 462 | 463 | **Step 2: Question Template Types** 464 | Create diverse question types using domain-appropriate vocabulary. **You have creative freedom to vary phrasing, structure, and vocabulary beyond these templates.** Adapt terminology to match your target domain. 465 | 466 | **Direct Property Access Templates:** 467 | - "What's the [property] of [entity]?" 468 | - "What [property] does [entity] have?" 469 | - "Tell me [entity]'s [property]" 470 | - "What is [entity]'s [property]?" 471 | - Domain-specific examples: 472 | - Movies: "What's Tom Hanks' age?", "What's the rating of Forrest Gump?" 473 | - Business: "What's Microsoft's industry?", "What's Alice's position?" 474 | - Social: "What's John's age?", "What city does Sarah live in?" 475 | 476 | **Single-hop Aggregation Templates:** 477 | - "How many [related_entities] does [entity] have?" 478 | - "What's the total number of [related_entities] for [entity]?" 479 | - "Count [entity]'s [related_entities]" 480 | - "What's the average [property] of [entity]'s [related_entities]?" 481 | - Domain-specific examples: 482 | - Movies: "How many movies has Tom Hanks acted in?", "What's the average rating of Steven Spielberg's films?" 483 | - Business: "How many employees does Microsoft have?", "What's the total revenue of Google's subsidiaries?" 484 | - Social: "How many friends does Alice have?", "What's the average age of John's connections?" 485 | 486 | **Single-hop Property Retrieval Templates:** 487 | - "What's the [property] of [entity]'s [related_entity]?" 488 | - "What [property] does [entity] work for/belong to/connect to?" 489 | - "Tell me the [property] of [entity]'s [relationship]" 490 | - Domain-specific examples: 491 | - Movies: "What's the title of Tom Hanks' highest-rated movie?", "What genre is this actor's latest film?" 492 | - Business: "What's the industry of Alice's company?", "What's the revenue of John's employer?" 493 | - Social: "What's the name of Sarah's best friend?", "What city does Alice's friend live in?" 494 | 495 | **Two-hop Property Access Templates:** 496 | - "What's the [property] that [entity] can reach through [intermediate]?" 497 | - "What [property] connects [entity] via [path]?" 498 | - "Through [relationship], what [property] does [entity] access?" 499 | - Domain-specific examples: 500 | - Movies: "What studio distributed Tom Hanks' director's latest film?", "What's the budget of movies made by actors who worked with Steven Spielberg?" 501 | - Business: "What city is Alice's company located in?", "What industry do John's colleagues work in?" 502 | - Social: "What company do Alice's friends work for?", "What city do John's connections live in?" 503 | 504 | **Simple Existence/Count Templates:** 505 | - "How many [entities] are connected to [entity]?" 506 | - "Does [entity] have any [related_entities]?" 507 | - "How many [intermediate_entities] connect [entity1] to [entity2]?" 508 | - "What's the count of [entity]'s [direct_relationships]?" 509 | - Domain-specific examples: 510 | - Movies: "How many co-actors does Tom Hanks have?", "How many directors has this actor worked with?" 511 | - Business: "How many companies is Alice connected to?", "How many colleagues work in the same department?" 512 | - Social: "How many mutual friends do Alice and Bob have?", "How many people live in the same city as John?" 513 | 514 | **Creative Variations Encouraged:** 515 | - **Vary sentence structure**: Use questions, statements, imperatives 516 | - **Use domain synonyms**: "films/movies", "actors/performers", "companies/firms", "people/individuals" 517 | - **Add context**: "In [entity]'s profile...", "For [entity]...", "About [entity]..." 518 | - **Use natural language**: "folks", "stuff", "things" (especially for noise injection) 519 | - **Experiment with phrasing**: "Tell me...", "Find...", "I need to know...", "Show me..." 520 | 521 | **Step 3: Domain-Aware Entity Substitution** 522 | - **Use real entity names** from the provided sample paths 523 | - **Adapt vocabulary to domain**: Match the graph schema's domain (movies, business, social networks, etc.) 524 | - **Use domain-appropriate terminology**: 525 | - Movies: "films", "actors", "directors", "cast", "crew", "productions", "ratings", "box office" 526 | - Business: "companies", "employees", "executives", "revenue", "profits", "industries", "competitors" 527 | - Social: "people", "friends", "connections", "network", "relationships", "communities" 528 | - Academic: "researchers", "papers", "citations", "institutions", "publications", "collaborations" 529 | - **Ensure semantic consistency**: If schema uses "Person" nodes, questions can use "actors", "people", "individuals" interchangeably 530 | - **Validate realistic paths**: Ensure paths actually exist between chosen entities 531 | - **Check for meaningful results**: Validate queries return non-null, non-zero values when appropriate 532 | 533 | **Step 4: Sorting Query Requirements** 534 | When creating sorting queries, **always include WHERE clauses** to filter out null values: 535 | ```cypher 536 | WHERE entity.property IS NOT NULL 537 | ORDER BY entity.property DESC/ASC 538 | LIMIT 1 539 | ``` 540 | 541 | ## Output Requirements 542 | 543 | Generate exactly **25 question-answer pairs** based on the provided schema and sample paths. **Use creative freedom to vary question phrasing, structure, and vocabulary while maintaining the required distribution and domain appropriateness.** 544 | 545 | **Distribution Guidelines (Flexible Based on Domain):** 546 | - **Target Distribution**: Aim for roughly equal distribution across complexity levels: 547 | - ~8 Direct Property Access queries (0 hops - simple property lookups) 548 | - ~8 Single-hop queries (1 relationship traversal) 549 | - ~9 Two-hop queries (2 relationship traversals) 550 | 551 | - **Query Type Breakdown Within Each Complexity Level:** 552 | - **Direct Property Access**: Simple property retrieval, basic node attributes 553 | - **Single-hop**: Aggregations over direct relationships, property access via one relationship 554 | - **Two-hop**: Property access via two relationships, simple multi-step aggregations 555 | 556 | - **Domain-Driven Flexibility**: **Do NOT force weird or non-semantic questions just to achieve exact distributions.** If the domain or graph schema isn't ideal for certain complexity levels, produce more of those which make natural sense for users of this data/platform. For example: 557 | - Movie databases may have more direct property queries about ratings and titles 558 | - Social networks are ideal for single-hop friendship and connection queries 559 | - Business networks may have more two-hop queries about company locations and industries 560 | - Academic citation networks favor aggregation queries about publications 561 | 562 | - **Quality Over Rigid Distribution**: Prioritize creating realistic, meaningful questions that actual users would ask over maintaining exact 8-8-9 distribution. A distribution like 10-8-7 or 7-9-9 is perfectly acceptable if it results in more natural, semantically meaningful questions. 563 | 564 | **Creative Guidelines:** 565 | - **Vary question structure**: Mix interrogative, declarative, and imperative forms 566 | - **Use domain-appropriate vocabulary**: Match terminology to the graph schema's domain 567 | - **Experiment with phrasing**: Don't stick rigidly to templates - be creative! 568 | - **Include natural variations**: "What's the...", "Tell me the...", "Find the...", "I need to know..." 569 | - **Add contextual phrases**: "For [entity]...", "About [entity]...", "Regarding [entity]..." 570 | 571 | **Noise Injection (40% of questions):** 572 | Apply to exactly 10 questions (spread across different complexity levels): 573 | - **Typos in named entities**: Minor misspellings of person/movie/company names: 574 | - "Tom Hanks" → "Tom Henks" 575 | - "Georges Méliès" → "George Melies" 576 | - "Harold Lloyd" → "Harold Loyd" 577 | - "Microsoft" → "Mircosoft" 578 | - "Google" → "Googel" 579 | - **Colloquialisms and informal language**: 580 | - "folks" → "people", "stuff" → "things", "flicks" → "movies", "pic" → "picture" 581 | - "company" → "firm", "biz", "corp" 582 | - "person" → "guy", "individual", "someone" 583 | - "work with" → "collaborate with", "team up with" 584 | - **Grammatical variations**: 585 | - "companys" → "companies", "actorss" → "actors" 586 | - "Who has worked" → "Who's worked", "What is" → "What's" 587 | - **Domain-specific slang**: 588 | - Movies: "blockbuster", "indie film", "A-lister", "supporting actor" 589 | - Business: "startup", "enterprise", "C-suite", "workforce" 590 | - Social: "buddy", "acquaintance", "circle", "network" 591 | 592 | **Vocabulary Matching Guidelines:** 593 | - **Analyze the domain** from node types and relationships in the schema 594 | - **Use appropriate synonyms** that match the domain context 595 | - **Maintain semantic consistency** throughout questions 596 | - **Avoid ambiguous pronouns** ("their", "his", "her") - always use specific entity names 597 | - **Match formality level** to the domain (academic vs. casual social networks) 598 | 599 | ### 6. Quality Assurance 600 | - Validate each Cypher query returns exactly one value 601 | - Ensure paths are semantically meaningful and commonly queried 602 | - Test queries against your actual graph schema 603 | - Filter out queries that always return 0/null 604 | - **Verify all queries include meaningful column aliases** 605 | - **Prioritize realistic, commonly asked questions over complex edge cases** 606 | 607 | ## Output Format 608 | 609 | First, provide your analysis and reasoning in `` tags, then return exactly 25 JSON objects in the specified format. 610 | 611 | 612 | [Explain your analysis of the provided schema and sample paths, your approach to generating domain-appropriate questions across different complexity levels (0-2 hops), the distribution of query types you chose, and any specific considerations for the domain. Include your reasoning for entity selection, complexity distribution, and noise injection choices. Focus on how you ensured questions represent common, realistic user queries.] 613 | 614 | 615 | ```json 616 | [ 617 | { 618 | "question": "What's Tom Henks' age?", 619 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'}) RETURN p.age AS person_age", 620 | "query_type": "Direct Property Access", 621 | "complexity": "0-hop", 622 | "noise_applied": true, 623 | "noise_type": "typo" 624 | }, 625 | { 626 | "question": "How many movies has Tom Hanks acted in?", 627 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN COUNT(m) AS total_movies", 628 | "query_type": "Single-hop Aggregation", 629 | "complexity": "1-hop", 630 | "noise_applied": false 631 | }, 632 | { 633 | "question": "What's the industry of Alice's company?", 634 | "cypher": "MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company) RETURN c.industry AS company_industry", 635 | "query_type": "Single-hop Property Retrieval", 636 | "complexity": "1-hop", 637 | "noise_applied": false 638 | }, 639 | { 640 | "question": "What city is Alice's workplace located in?", 641 | "cypher": "MATCH (p:Person {name: 'Alice'})-[:WORKS_AT]->(c:Company)-[:LOCATED_IN]->(city:City) RETURN city.name AS workplace_city", 642 | "query_type": "Two-hop Property Retrieval", 643 | "complexity": "2-hop", 644 | "noise_applied": false 645 | }, 646 | { 647 | "question": "What's the rating of Forrest Gump?", 648 | "cypher": "MATCH (m:Movie {title: 'Forrest Gump'}) RETURN m.imdbRating AS movie_rating", 649 | "query_type": "Direct Property Access", 650 | "complexity": "0-hop", 651 | "noise_applied": false 652 | }, 653 | { 654 | "question": "How many flicks has Tom Hanks starred in?", 655 | "cypher": "MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN COUNT(m) AS total_movies", 656 | "query_type": "Single-hop Aggregation", 657 | "complexity": "1-hop", 658 | "noise_applied": true, 659 | "noise_type": "colloquialism" 660 | }, 661 | { 662 | "question": "What's the highest rating among Steven Spielberg's films?", 663 | "cypher": "MATCH (p:Person {name: 'Steven Spielberg'})-[:DIRECTED]->(m:Movie) WHERE m.imdbRating IS NOT NULL RETURN MAX(m.imdbRating) AS highest_rating", 664 | "query_type": "Single-hop Aggregation", 665 | "complexity": "1-hop", 666 | "noise_applied": false 667 | } 668 | ] 669 | ``` 670 | 671 | **Begin generation after receiving the schema and sample paths. Remember to be creative with your question phrasing while maintaining domain appropriateness and focusing on simple, commonly asked questions. Most importantly, ensure all Cypher queries include meaningful column aliases using the AS clause.** 672 | """ 673 | --------------------------------------------------------------------------------