├── README.md ├── data ├── cake.json └── schema.json └── jsonQueryRAG.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # jsonQueryRAG 2 | LLM query engine to retrieve augmented responses from json files. 3 | 4 | ```markdown 5 | # JSON Query RAG (jsonQueryRAG) 6 | 7 | This repository contains a Colab notebook that demonstrates how to set up a system to query JSON data using natural language and obtain responses. The implementation utilizes the `jsonpath-ng`, `llama-index`, `openai`, `transformers`, and `accelerate` libraries. 8 | 9 | ## Getting Started 10 | 11 | 1. Open the notebook in Google Colab by clicking the badge below: 12 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mickymult/jsonQueryRAG/blob/main/jsonQueryRAG.ipynb) 13 | 14 | 2. Install the necessary libraries by running the first few cells in the notebook. 15 | 16 | ## Usage 17 | 18 | - The notebook contains cells for installing necessary libraries, setting up logging, specifying Open AI API keys, and importing required modules. 19 | - It also includes cells for specifying the paths to your JSON and schema files, reading these files, and setting up the JSON query engines. 20 | - To query the JSON data, modify the query text in the cell with `nl_query_engine.query()` and run the cell. 21 | 22 | ## Example Query 23 | 24 | The notebook includes an example query that asks for the types of toppings available for a donut cake, and displays the natural language response. 25 | 26 | ```python 27 | nl_response = nl_query_engine.query( 28 | "what type of toppings are available for donut cake?", 29 | ) 30 | display(Markdown(f"

Natural language Response


{nl_response}")) 31 | ``` 32 | 33 | The expected output for this query is: 34 | 35 | ``` 36 | The available toppings for the donut cake are None, Glazed, Sugar, Powdered Sugar, Chocolate with Sprinkles, Chocolate, and Maple. 37 | ``` 38 | 39 | ## Dependencies 40 | 41 | - jsonpath-ng 42 | - llama-index 43 | - openai 44 | - transformers 45 | - accelerate 46 | 47 | ## License 48 | 49 | MIT 50 | 51 | 52 | ``` 53 | -------------------------------------------------------------------------------- /data/cake.json: -------------------------------------------------------------------------------- 1 | { 2 | "items": 3 | { 4 | "item": 5 | [ 6 | { 7 | "id": "0001", 8 | "type": "donut", 9 | "name": "Cake", 10 | "ppu": 0.55, 11 | "batters": 12 | { 13 | "batter": 14 | [ 15 | { "id": "1001", "type": "Regular" }, 16 | { "id": "1002", "type": "Chocolate" }, 17 | { "id": "1003", "type": "Blueberry" }, 18 | { "id": "1004", "type": "Devil's Food" } 19 | ] 20 | }, 21 | "topping": 22 | [ 23 | { "id": "5001", "type": "None" }, 24 | { "id": "5002", "type": "Glazed" }, 25 | { "id": "5005", "type": "Sugar" }, 26 | { "id": "5007", "type": "Powdered Sugar" }, 27 | { "id": "5006", "type": "Chocolate with Sprinkles" }, 28 | { "id": "5003", "type": "Chocolate" }, 29 | { "id": "5004", "type": "Maple" } 30 | ] 31 | }, 32 | { 33 | "id": "0002", 34 | "type": "donut", 35 | "name": "Raised", 36 | "ppu": 0.55, 37 | "batters": 38 | { 39 | "batter": 40 | [ 41 | { "id": "1001", "type": "Regular" } 42 | ] 43 | }, 44 | "topping": 45 | [ 46 | { "id": "5001", "type": "None" }, 47 | { "id": "5002", "type": "Glazed" }, 48 | { "id": "5005", "type": "Sugar" }, 49 | { "id": "5003", "type": "Chocolate" }, 50 | { "id": "5004", "type": "Maple" } 51 | ] 52 | }, 53 | 54 | { 55 | "id": "0003", 56 | "type": "donut", 57 | "name": "Old Fashioned", 58 | "ppu": 0.55, 59 | "batters": 60 | { 61 | "batter": 62 | [ 63 | { "id": "1001", "type": "Regular" }, 64 | { "id": "1002", "type": "Chocolate" } 65 | ] 66 | }, 67 | "topping": 68 | [ 69 | { "id": "5001", "type": "None" }, 70 | { "id": "5002", "type": "Glazed" }, 71 | { "id": "5003", "type": "Chocolate" }, 72 | { "id": "5004", "type": "Maple" } 73 | ] 74 | }, 75 | { 76 | "id": "0004", 77 | "type": "bar", 78 | "name": "Bar", 79 | "ppu": 0.75, 80 | "batters": 81 | { 82 | "batter": 83 | [ 84 | { "id": "1001", "type": "Regular" } 85 | ] 86 | }, 87 | "topping": 88 | [ 89 | { "id": "5003", "type": "Chocolate" }, 90 | { "id": "5004", "type": "Maple" } 91 | ], 92 | "fillings": 93 | { 94 | "filling": 95 | [ 96 | { "id": "7001", "name": "None", "addcost": 0 }, 97 | { "id": "7002", "name": "Custard", "addcost": 0.25 }, 98 | { "id": "7003", "name": "Whipped Cream", "addcost": 0.25 } 99 | ] 100 | } 101 | }, 102 | 103 | { 104 | "id": "0005", 105 | "type": "twist", 106 | "name": "Twist", 107 | "ppu": 0.65, 108 | "batters": 109 | { 110 | "batter": 111 | [ 112 | { "id": "1001", "type": "Regular" } 113 | ] 114 | }, 115 | "topping": 116 | [ 117 | { "id": "5002", "type": "Glazed" }, 118 | { "id": "5005", "type": "Sugar" } 119 | ] 120 | }, 121 | 122 | { 123 | "id": "0006", 124 | "type": "filled", 125 | "name": "Filled", 126 | "ppu": 0.75, 127 | "batters": 128 | { 129 | "batter": 130 | [ 131 | { "id": "1001", "type": "Regular" } 132 | ] 133 | }, 134 | "topping": 135 | [ 136 | { "id": "5002", "type": "Glazed" }, 137 | { "id": "5007", "type": "Powdered Sugar" }, 138 | { "id": "5003", "type": "Chocolate" }, 139 | { "id": "5004", "type": "Maple" } 140 | ], 141 | "fillings": 142 | { 143 | "filling": 144 | [ 145 | { "id": "7002", "name": "Custard", "addcost": 0 }, 146 | { "id": "7003", "name": "Whipped Cream", "addcost": 0 }, 147 | { "id": "7004", "name": "Strawberry Jelly", "addcost": 0 }, 148 | { "id": "7005", "name": "Rasberry Jelly", "addcost": 0 } 149 | ] 150 | } 151 | } 152 | ] 153 | } 154 | } 155 | -------------------------------------------------------------------------------- /data/schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema#", 3 | "type": "object", 4 | "properties": { 5 | "items": { 6 | "type": "object", 7 | "properties": { 8 | "item": { 9 | "type": "array", 10 | "items": { 11 | "type": "object", 12 | "properties": { 13 | "id": { "type": "string" }, 14 | "type": { "type": "string" }, 15 | "name": { "type": "string" }, 16 | "ppu": { "type": "number" }, 17 | "batters": { 18 | "type": "object", 19 | "properties": { 20 | "batter": { 21 | "type": "array", 22 | "items": { 23 | "type": "object", 24 | "properties": { 25 | "id": { "type": "string" }, 26 | "type": { "type": "string" } 27 | }, 28 | "required": ["id", "type"] 29 | } 30 | } 31 | }, 32 | "required": ["batter"] 33 | }, 34 | "topping": { 35 | "type": "array", 36 | "items": { 37 | "type": "object", 38 | "properties": { 39 | "id": { "type": "string" }, 40 | "type": { "type": "string" } 41 | }, 42 | "required": ["id", "type"] 43 | } 44 | }, 45 | "fillings": { 46 | "type": "object", 47 | "properties": { 48 | "filling": { 49 | "type": "array", 50 | "items": { 51 | "type": "object", 52 | "properties": { 53 | "id": { "type": "string" }, 54 | "name": { "type": "string" }, 55 | "addcost": { "type": "number" } 56 | }, 57 | "required": ["id", "name", "addcost"] 58 | } 59 | } 60 | }, 61 | "required": ["filling"] 62 | } 63 | }, 64 | "required": ["id", "type", "name", "ppu", "batters", "topping"], 65 | "dependencies": { 66 | "fillings": ["type"] 67 | } 68 | } 69 | } 70 | }, 71 | "required": ["item"] 72 | } 73 | }, 74 | "required": ["items"] 75 | } 76 | -------------------------------------------------------------------------------- /jsonQueryRAG.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyObK/A4SR5nUAdNmM5Vzppe", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 1, 32 | "metadata": { 33 | "colab": { 34 | "base_uri": "https://localhost:8080/" 35 | }, 36 | "id": "eiU7Mfixy3YM", 37 | "outputId": "8e08ebd3-a23b-4ebb-8515-a58fc26ca14f" 38 | }, 39 | "outputs": [ 40 | { 41 | "output_type": "stream", 42 | "name": "stdout", 43 | "text": [ 44 | "Collecting jsonpath-ng\n", 45 | " Downloading jsonpath_ng-1.6.0-py3-none-any.whl (29 kB)\n", 46 | "Collecting ply (from jsonpath-ng)\n", 47 | " Downloading ply-3.11-py2.py3-none-any.whl (49 kB)\n", 48 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.6/49.6 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 49 | "\u001b[?25hInstalling collected packages: ply, jsonpath-ng\n", 50 | "Successfully installed jsonpath-ng-1.6.0 ply-3.11\n" 51 | ] 52 | } 53 | ], 54 | "source": [ 55 | "# First, install the jsonpath-ng package which is used by default to parse & execute the JSONPath queries.\n", 56 | "!pip install jsonpath-ng" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "source": [ 62 | "!pip install -q llama-index\n", 63 | "!pip install -q openai\n", 64 | "!pip install -q transformers\n", 65 | "!pip install -q accelerate" 66 | ], 67 | "metadata": { 68 | "colab": { 69 | "base_uri": "https://localhost:8080/" 70 | }, 71 | "id": "JM6FUfFCznad", 72 | "outputId": "67cfb59c-71e3-4cdf-9f67-ade2ac70d404" 73 | }, 74 | "execution_count": 2, 75 | "outputs": [ 76 | { 77 | "output_type": "stream", 78 | "name": "stdout", 79 | "text": [ 80 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m744.1/744.1 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 81 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.8/1.8 MB\u001b[0m \u001b[31m13.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 82 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.0/77.0 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 83 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.0/2.0 MB\u001b[0m \u001b[31m17.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 84 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m143.4/143.4 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 85 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.4/49.4 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 86 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.0/40.0 kB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 87 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.7/7.7 MB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 88 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m302.0/302.0 kB\u001b[0m \u001b[31m25.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 89 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.8/3.8 MB\u001b[0m \u001b[31m57.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 90 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m56.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 91 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m31.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 92 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m258.1/258.1 kB\u001b[0m \u001b[31m5.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 93 | "\u001b[?25h" 94 | ] 95 | } 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "source": [ 101 | "import logging\n", 102 | "import sys\n", 103 | "\n", 104 | "logging.basicConfig(stream=sys.stdout, level=logging.INFO)\n", 105 | "logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))" 106 | ], 107 | "metadata": { 108 | "id": "uzZ0RuBmzBtv" 109 | }, 110 | "execution_count": 3, 111 | "outputs": [] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "source": [ 116 | "import os\n", 117 | "import openai\n", 118 | "\n", 119 | "os.environ[\"OPENAI_API_KEY\"] = \"YOUR_OPENAI_KEY_HERE\"\n", 120 | "openai.api_key = os.environ[\"OPENAI_API_KEY\"]" 121 | ], 122 | "metadata": { 123 | "id": "O4QWDOUCzIvY" 124 | }, 125 | "execution_count": 4, 126 | "outputs": [] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "source": [ 131 | "from IPython.display import Markdown, display" 132 | ], 133 | "metadata": { 134 | "id": "Q79X4l05z1bb" 135 | }, 136 | "execution_count": 5, 137 | "outputs": [] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "source": [ 142 | "import json\n", 143 | "\n", 144 | "# Specify the folders containing the JSON and schema files\n", 145 | "json_folder = 'data'\n", 146 | "schema_folder = 'data'" 147 | ], 148 | "metadata": { 149 | "id": "QE4kB3380Gte" 150 | }, 151 | "execution_count": 19, 152 | "outputs": [] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "source": [ 157 | "# Specify the filenames of the JSON and schema files\n", 158 | "json_filename = 'cake.json'\n", 159 | "schema_filename = 'schema.json'" 160 | ], 161 | "metadata": { 162 | "id": "ciWksToR0as8" 163 | }, 164 | "execution_count": 20, 165 | "outputs": [] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "source": [ 170 | "# Construct the paths to the JSON and schema files\n", 171 | "json_filepath = os.path.join(json_folder, json_filename)\n", 172 | "schema_filepath = os.path.join(schema_folder, schema_filename)" 173 | ], 174 | "metadata": { 175 | "id": "NGlR4WRk0iHb" 176 | }, 177 | "execution_count": 21, 178 | "outputs": [] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "source": [ 183 | "# Read the JSON file\n", 184 | "with open(json_filepath, 'r') as json_file:\n", 185 | " json_value = json.load(json_file)" 186 | ], 187 | "metadata": { 188 | "id": "Ctz7hB4d0mNw" 189 | }, 190 | "execution_count": 22, 191 | "outputs": [] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "source": [ 196 | "# Read the schema file\n", 197 | "with open(schema_filepath, 'r') as schema_file:\n", 198 | " json_schema = json.load(schema_file)" 199 | ], 200 | "metadata": { 201 | "id": "VwsjqQFg0pKL" 202 | }, 203 | "execution_count": 23, 204 | "outputs": [] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "source": [ 209 | "from llama_index.indices.service_context import ServiceContext\n", 210 | "from llama_index.llms import OpenAI\n", 211 | "from llama_index.indices.struct_store import JSONQueryEngine\n", 212 | "\n", 213 | "llm = OpenAI(model=\"gpt-4\")\n", 214 | "service_context = ServiceContext.from_defaults(llm=llm)\n", 215 | "nl_query_engine = JSONQueryEngine(\n", 216 | " json_value=json_value, json_schema=json_schema, service_context=service_context\n", 217 | ")\n", 218 | "raw_query_engine = JSONQueryEngine(\n", 219 | " json_value=json_value,\n", 220 | " json_schema=json_schema,\n", 221 | " service_context=service_context,\n", 222 | " synthesize_response=False,\n", 223 | ")" 224 | ], 225 | "metadata": { 226 | "colab": { 227 | "base_uri": "https://localhost:8080/" 228 | }, 229 | "id": "zw6X6Y3M0xgI", 230 | "outputId": "ce72282a-92ad-4b79-e9a6-208efbb0d655" 231 | }, 232 | "execution_count": 24, 233 | "outputs": [ 234 | { 235 | "output_type": "stream", 236 | "name": "stderr", 237 | "text": [ 238 | "[nltk_data] Downloading package punkt to /tmp/llama_index...\n", 239 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 240 | ] 241 | } 242 | ] 243 | }, 244 | { 245 | "cell_type": "code", 246 | "source": [ 247 | "nl_response = nl_query_engine.query(\n", 248 | " \"what type of toppings are available for donut cake?\",\n", 249 | ")\n" 250 | ], 251 | "metadata": { 252 | "id": "2kCk5rae1ABk" 253 | }, 254 | "execution_count": 25, 255 | "outputs": [] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "source": [ 260 | "display(Markdown(f\"

Natural language Response


{nl_response}\"))" 261 | ], 262 | "metadata": { 263 | "colab": { 264 | "base_uri": "https://localhost:8080/", 265 | "height": 109 266 | }, 267 | "id": "ji5joIz71PBZ", 268 | "outputId": "f33799c0-ae07-45f3-ed0d-375aee7d484c" 269 | }, 270 | "execution_count": 26, 271 | "outputs": [ 272 | { 273 | "output_type": "display_data", 274 | "data": { 275 | "text/plain": [ 276 | "" 277 | ], 278 | "text/markdown": "

Natural language Response


The available toppings for the donut cake are None, Glazed, Sugar, Powdered Sugar, Chocolate with Sprinkles, Chocolate, and Maple." 279 | }, 280 | "metadata": {} 281 | } 282 | ] 283 | } 284 | ] 285 | } --------------------------------------------------------------------------------