├── readme.md └── text2cql_astradb_demo.ipynb /readme.md: -------------------------------------------------------------------------------- 1 | # README 2 | 3 | Text-to-CQL demo with Astra DB. 4 | 5 | Click here to run in [Colab](https://colab.research.google.com/github/datastaxdevs/demo-astradb-text2cql/blob/main/text2cql_astradb_demo.ipynb). 6 | 7 | Alternatively you can run it locally after installing Jupyter. Requires Python 3.9+. 8 | 9 | Visit the [documentation page](https://docs.datastax.com/en/astra-db-serverless/get-started/examples.html). 10 | -------------------------------------------------------------------------------- /text2cql_astradb_demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b69b5b27-f4dc-439b-8ae5-b3f776fa3745", 6 | "metadata": { 7 | "id": "b69b5b27-f4dc-439b-8ae5-b3f776fa3745" 8 | }, 9 | "source": [ 10 | "# Using LLMs to Generate CQL" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "107770b0-abda-43ee-af5d-eae2100b65ad", 16 | "metadata": { 17 | "id": "107770b0-abda-43ee-af5d-eae2100b65ad" 18 | }, 19 | "source": [ 20 | "This demo shows usage of LLMs to integrate generation of queries in CQL (Cassandra Query Language) as part of a RAG pipeline.\n", 21 | "\n", 22 | "Starting from a natural-language question:\n", 23 | "- an LLM is asked to produce a CQL query;\n", 24 | "- the query is then executed on the database;\n", 25 | "- the resulting rows are presented to the LLM within a prompt instructing it to provide a final answer...\n", 26 | "- which is the final output, thus completing a CQL-generation-enriched RAG process.\n", 27 | "\n", 28 | "In each question-answering process, then, there are two distinct phases where LLMs are used.\n", 29 | "\n", 30 | "The database is an [Astra DB](https://docs.datastax.com/en/astra-db-serverless/index.html) instance, populated with fictitious sample data. Astra DB can be used via its Data API or, as in this case, using the CQL protocol.\n", 31 | "\n", 32 | "This example is derived from the [SQL-PaLM](https://arxiv.org/abs/2306.00739) paper. In the paper, the general strategy is described for the SQL language, consisting of first showing the LLM a DB schema in a standardized format, then ask it to produce a query for the user question." 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "id": "1ffeccb0-d70c-4f15-b924-6e8cd7a5b30e", 38 | "metadata": { 39 | "id": "1ffeccb0-d70c-4f15-b924-6e8cd7a5b30e" 40 | }, 41 | "source": [ 42 | "## Setup" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "b71640c3-3495-4459-837e-08d6e80ed410", 48 | "metadata": { 49 | "id": "b71640c3-3495-4459-837e-08d6e80ed410" 50 | }, 51 | "source": [ 52 | "#### Requirements" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 1, 58 | "id": "b43e0d9c-c760-485f-877b-df577f2cfacf", 59 | "metadata": { 60 | "id": "b43e0d9c-c760-485f-877b-df577f2cfacf" 61 | }, 62 | "outputs": [ 63 | { 64 | "name": "stdout", 65 | "output_type": "stream", 66 | "text": [ 67 | "\n", 68 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", 69 | "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "# Install requirements, if not already installed\n", 75 | "!pip install -q \"openai>=1.73\" \"cassio>=0.1.10\"" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 2, 81 | "id": "2288af41-77a4-4163-8753-5c7cd93aeabb", 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "import os\n", 86 | "from getpass import getpass\n", 87 | "\n", 88 | "import cassio\n", 89 | "import openai" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "id": "ae7063e7-80c4-49ae-84f1-3345aa3bbef1", 95 | "metadata": { 96 | "id": "ae7063e7-80c4-49ae-84f1-3345aa3bbef1" 97 | }, 98 | "source": [ 99 | "#### Connect to Services" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 3, 105 | "id": "79e94c9e-3c81-4170-bd37-6326b3875c60", 106 | "metadata": {}, 107 | "outputs": [ 108 | { 109 | "name": "stdin", 110 | "output_type": "stream", 111 | "text": [ 112 | "OpenAI API Key: ········\n" 113 | ] 114 | } 115 | ], 116 | "source": [ 117 | "# OpenAI secrets\n", 118 | "if \"OPENAI_API_KEY\" not in os.environ:\n", 119 | " os.environ[\"OPENAI_API_KEY\"] = getpass(\"OpenAI API Key: \")" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 4, 125 | "id": "350ffbca-6b42-44dd-95cd-768f2c15e027", 126 | "metadata": {}, 127 | "outputs": [ 128 | { 129 | "name": "stdin", 130 | "output_type": "stream", 131 | "text": [ 132 | "Astra DB Token: ········\n", 133 | "Astra DB API endpoint: https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com\n", 134 | "Astra DB Keyspace (empty for default): \n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "# Secrets and connection parameters for Astra DB\n", 140 | "if \"ASTRA_DB_APPLICATION_TOKEN\" not in os.environ:\n", 141 | " os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"] = getpass(\"Astra DB Token: \").strip()\n", 142 | "\n", 143 | "if \"ASTRA_DB_API_ENDPOINT\" not in os.environ:\n", 144 | " os.environ[\"ASTRA_DB_API_ENDPOINT\"] = input(\"Astra DB API endpoint: \").strip()\n", 145 | "\n", 146 | "if not os.getenv(\"ASTRA_DB_KEYSPACE\"):\n", 147 | " os.environ[\"ASTRA_DB_KEYSPACE\"] = input(\"Astra DB Keyspace (empty for default): \").strip()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 5, 153 | "id": "9767fc7a-a9d8-4e89-b32c-805243700348", 154 | "metadata": { 155 | "colab": { 156 | "base_uri": "https://localhost:8080/" 157 | }, 158 | "id": "9767fc7a-a9d8-4e89-b32c-805243700348", 159 | "outputId": "c6d2e944-ea84-4f1a-d551-6a3583c5f5a4" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "# Initialize the OpenAI Client\n", 164 | "client = openai.OpenAI()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 6, 170 | "id": "13722daf-bc68-4fef-93b8-a18d17482948", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "Connected to Astra DB with session=, keyspace=default_keyspace.\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "# Connect to Astra DB with a CQL session:\n", 183 | "database_id = os.environ[\"ASTRA_DB_API_ENDPOINT\"][8 : 8 + 36]\n", 184 | "\n", 185 | "cassio.init(\n", 186 | " database_id=database_id,\n", 187 | " token=os.environ[\"ASTRA_DB_APPLICATION_TOKEN\"],\n", 188 | " keyspace=os.getenv(\"ASTRA_DB_KEYSPACE\") or None,\n", 189 | ")\n", 190 | "\n", 191 | "session = cassio.config.resolve_session()\n", 192 | "keyspace = cassio.config.resolve_keyspace()\n", 193 | "session.execute(f\"USE {keyspace};\")\n", 194 | "\n", 195 | "print(f\"Connected to Astra DB with session={session}, keyspace={keyspace}.\")" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 7, 201 | "id": "6a47bad8-9e13-41fa-af6d-724b72f702cd", 202 | "metadata": { 203 | "id": "6a47bad8-9e13-41fa-af6d-724b72f702cd" 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "# Tool to run CQL statements\n", 208 | "def execute_statement(statement: str):\n", 209 | " # This is a simple wrapper around executing CQL statements in our\n", 210 | " # Cassandra cluster, and either raising an error or returning the results\n", 211 | " try:\n", 212 | " rows = session.execute(statement)\n", 213 | " return rows.all()\n", 214 | " except Exception as e:\n", 215 | " print(f\"Query '{statement}' failed with error {str(e)}.\")\n", 216 | " raise" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "id": "890b2537-f585-4f31-bb20-ed71910eb586", 222 | "metadata": { 223 | "id": "890b2537-f585-4f31-bb20-ed71910eb586" 224 | }, 225 | "source": [ 226 | "#### (Optional) Dummy DB Setup" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "id": "db89558d-f24b-4517-bf46-97c65c2071ac", 232 | "metadata": { 233 | "id": "db89558d-f24b-4517-bf46-97c65c2071ac" 234 | }, 235 | "source": [ 236 | "Feel free to skip this section if you are instead adapting the notebook to fit your existing Cassandra Database. Here, we will utilize the python `cassandra-driver` package to connect to a DB and create some fake tables. This schema is pulled from [this DataStax example](https://www.datastax.com/learn/data-modeling-by-example/digital-library-data-model) on creating a data model for a digital music library." 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 8, 242 | "id": "df4e69c2-0cda-42e3-8718-8f7ac6da230f", 243 | "metadata": { 244 | "id": "df4e69c2-0cda-42e3-8718-8f7ac6da230f" 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "# Create all necessary tables\n", 249 | "create_tables_cql = \"\"\"CREATE TABLE performers (\n", 250 | " name TEXT PRIMARY KEY,\n", 251 | " type TEXT,\n", 252 | " country TEXT,\n", 253 | " born INT,\n", 254 | " died INT,\n", 255 | " founded INT\n", 256 | ");\n", 257 | "\n", 258 | "CREATE TABLE albums_by_performer (\n", 259 | " performer TEXT,\n", 260 | " year INT,\n", 261 | " title TEXT,\n", 262 | " genre TEXT,\n", 263 | " PRIMARY KEY (performer, year, title)\n", 264 | ") WITH CLUSTERING ORDER BY (year DESC, title ASC);\n", 265 | "\n", 266 | "CREATE TABLE albums_by_title (\n", 267 | " title TEXT,\n", 268 | " year INT,\n", 269 | " performer TEXT,\n", 270 | " genre TEXT,\n", 271 | " PRIMARY KEY (title, year)\n", 272 | ") WITH CLUSTERING ORDER BY (year DESC);\n", 273 | "\n", 274 | "CREATE TABLE albums_by_genre (\n", 275 | " genre TEXT,\n", 276 | " year INT,\n", 277 | " performer TEXT,\n", 278 | " title TEXT,\n", 279 | " PRIMARY KEY (genre, year, performer, title)\n", 280 | ") WITH CLUSTERING ORDER BY (year DESC);\n", 281 | "\n", 282 | "CREATE TABLE tracks_by_title (\n", 283 | " title TEXT,\n", 284 | " album_title TEXT,\n", 285 | " album_year INT,\n", 286 | " number INT,\n", 287 | " length INT,\n", 288 | " genre TEXT,\n", 289 | " PRIMARY KEY (title, album_title, album_year, number)\n", 290 | ") WITH CLUSTERING ORDER BY (album_title ASC, album_year DESC, number ASC);\n", 291 | "\n", 292 | "CREATE TABLE tracks_by_album (\n", 293 | " album_title TEXT,\n", 294 | " album_year INT,\n", 295 | " number INT,\n", 296 | " title TEXT,\n", 297 | " length INT,\n", 298 | " genre TEXT STATIC,\n", 299 | " PRIMARY KEY (album_title, album_year, number)\n", 300 | ") WITH CLUSTERING ORDER BY (album_year DESC, number ASC);\n", 301 | "\n", 302 | "CREATE TABLE users (\n", 303 | " id UUID PRIMARY KEY,\n", 304 | " name TEXT\n", 305 | ");\n", 306 | "\n", 307 | "CREATE TABLE tracks_by_user (\n", 308 | " id UUID,\n", 309 | " month DATE,\n", 310 | " timestamp TIMESTAMP,\n", 311 | " album_title TEXT,\n", 312 | " album_year INT,\n", 313 | " number INT,\n", 314 | " title TEXT,\n", 315 | " length INT,\n", 316 | " PRIMARY KEY (id, timestamp)\n", 317 | ") WITH CLUSTERING ORDER BY (timestamp DESC);\"\"\"" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 9, 323 | "id": "03471e5c-47f0-4abd-b742-673a2524f0e2", 324 | "metadata": { 325 | "id": "03471e5c-47f0-4abd-b742-673a2524f0e2" 326 | }, 327 | "outputs": [ 328 | { 329 | "name": "stdout", 330 | "output_type": "stream", 331 | "text": [ 332 | "Running statement 'CREATE TABLE performers ( name TEXT ...' ... done.\n", 333 | "Running statement 'CREATE TABLE albums_by_performer ( p ...' ... done.\n", 334 | "Running statement 'CREATE TABLE albums_by_title ( title ...' ... done.\n", 335 | "Running statement 'CREATE TABLE albums_by_genre ( genre ...' ... done.\n", 336 | "Running statement 'CREATE TABLE tracks_by_title ( title ...' ... done.\n", 337 | "Running statement 'CREATE TABLE tracks_by_album ( album ...' ... done.\n", 338 | "Running statement 'CREATE TABLE users ( id UUID PRIMARY ...' ... done.\n", 339 | "Running statement 'CREATE TABLE tracks_by_user ( id UUI ...' ... done.\n" 340 | ] 341 | } 342 | ], 343 | "source": [ 344 | "# This parses the text above into executable strings by the driver\n", 345 | "for statement in create_tables_cql.split(\";\"):\n", 346 | " _statement = statement.strip().replace(\"\\n\", \" \")\n", 347 | " if _statement:\n", 348 | " print(f\"Running statement '{_statement[:40]} ...' ...\", end=\"\")\n", 349 | " execute_statement(_statement)\n", 350 | " print(\" done.\")" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 10, 356 | "id": "4698c114-1b77-4613-8555-1122b5a60295", 357 | "metadata": { 358 | "id": "4698c114-1b77-4613-8555-1122b5a60295" 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "# Now populate with some fake data\n", 363 | "insert_fake_data_cql = \"\"\"\n", 364 | "-- Insert data into performers\n", 365 | "INSERT INTO performers (name, type, country, born, died, founded) VALUES ('The Beatles', 'Band', 'UK', 1960, NULL, 1960);\n", 366 | "INSERT INTO performers (name, type, country, born, died, founded) VALUES ('Adele', 'Solo', 'UK', 1988, NULL, NULL);\n", 367 | "INSERT INTO performers (name, type, country, born, died, founded) VALUES ('Elton John', 'Solo', 'UK', 1947, NULL, NULL);\n", 368 | "INSERT INTO performers (name, type, country, born, died, founded) VALUES ('Queen', 'Band', 'UK', 1970, NULL, 1970);\n", 369 | "INSERT INTO performers (name, type, country, born, died, founded) VALUES ('Taylor Swift', 'Solo', 'USA', 1989, NULL, NULL);\n", 370 | "\n", 371 | "-- Insert data into albums by performer, title, and genre\n", 372 | "-- Assuming 'Pop' as a genre for all for simplicity\n", 373 | "INSERT INTO albums_by_performer (performer, year, title, genre) VALUES ('The Beatles', 1967, 'Sgt. Pepper''s Lonely Hearts Club Band', 'Pop');\n", 374 | "INSERT INTO albums_by_performer (performer, year, title, genre) VALUES ('Adele', 2015, '25', 'Pop');\n", 375 | "INSERT INTO albums_by_performer (performer, year, title, genre) VALUES ('Elton John', 1973, 'Goodbye Yellow Brick Road', 'Pop');\n", 376 | "INSERT INTO albums_by_performer (performer, year, title, genre) VALUES ('Queen', 1975, 'A Night at the Opera', 'Pop');\n", 377 | "INSERT INTO albums_by_performer (performer, year, title, genre) VALUES ('Taylor Swift', 2014, '1989', 'Pop');\n", 378 | "\n", 379 | "-- Repeat for albums_by_title\n", 380 | "INSERT INTO albums_by_title (title, year, performer, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 'The Beatles', 'Pop');\n", 381 | "INSERT INTO albums_by_title (title, year, performer, genre) VALUES ('25', 2015, 'Adele', 'Pop');\n", 382 | "INSERT INTO albums_by_title (title, year, performer, genre) VALUES ('Goodbye Yellow Brick Road', 1973, 'Elton John', 'Pop');\n", 383 | "INSERT INTO albums_by_title (title, year, performer, genre) VALUES ('A Night at the Opera', 1975, 'Queen', 'Pop');\n", 384 | "INSERT INTO albums_by_title (title, year, performer, genre) VALUES ('1989', 2014, 'Taylor Swift', 'Pop');\n", 385 | "\n", 386 | "-- Repeat for albums_by_genre\n", 387 | "INSERT INTO albums_by_genre (genre, year, performer, title) VALUES ('Pop', 1967, 'The Beatles', 'Sgt. Pepper''s Lonely Hearts Club Band');\n", 388 | "INSERT INTO albums_by_genre (genre, year, performer, title) VALUES ('Pop', 2015, 'Adele', '25');\n", 389 | "INSERT INTO albums_by_genre (genre, year, performer, title) VALUES ('Pop', 1973, 'Elton John', 'Goodbye Yellow Brick Road');\n", 390 | "INSERT INTO albums_by_genre (genre, year, performer, title) VALUES ('Pop', 1975, 'Queen', 'A Night at the Opera');\n", 391 | "INSERT INTO albums_by_genre (genre, year, performer, title) VALUES ('Pop', 2014, 'Taylor Swift', '1989');\n", 392 | "\n", 393 | "-- Insert data into tracks_by_title and tracks_by_album\n", 394 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Lucy in the Sky with Diamonds', 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 1, 208, 'Pop');\n", 395 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('With a Little Help from My Friends', 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 2, 163, 'Pop');\n", 396 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 3, 122, 'Pop');\n", 397 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Getting Better', 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 4, 174, 'Pop');\n", 398 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Fixing a Hole', 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 5, 139, 'Pop');\n", 399 | "\n", 400 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Hello', '25', 2015, 1, 295, 'Pop');\n", 401 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Send My Love', '25', 2015, 2, 223, 'Pop');\n", 402 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('I Miss You', '25', 2015, 3, 350, 'Pop');\n", 403 | "\n", 404 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Candle in the Wind', 'Goodbye Yellow Brick Road', 1973, 1, 219, 'Pop');\n", 405 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Bennie and the Jets', 'Goodbye Yellow Brick Road', 1973, 2, 323, 'Pop');\n", 406 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Goodbye Yellow Brick Road', 'Goodbye Yellow Brick Road', 1973, 3, 193, 'Pop');\n", 407 | "\n", 408 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Bohemian Rhapsody', 'A Night at the Opera', 1975, 1, 354, 'Pop');\n", 409 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Love of My Life', 'A Night at the Opera', 1975, 2, 220, 'Pop');\n", 410 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Youre My Best Friend', 'A Night at the Opera', 1975, 3, 178, 'Pop');\n", 411 | "\n", 412 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Welcome to New York', '1989', 2014, 1, 212, 'Pop');\n", 413 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Blank Space', '1989', 2014, 2, 231, 'Pop');\n", 414 | "INSERT INTO tracks_by_title (title, album_title, album_year, number, length, genre) VALUES ('Style', '1989', 2014, 3, 230, 'Pop');\n", 415 | "\n", 416 | "-- Repeat for tracks_by_album with corresponding track numbers\n", 417 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 1, 'Lucy in the Sky with Diamonds', 208, 'Pop');\n", 418 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 2, 'With a Little Help from My Friends', 163, 'Pop');\n", 419 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 3, 'Sgt. Pepper''s Lonely Hearts Club Band', 122, 'Pop');\n", 420 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 4, 'Getting Better', 174, 'Pop');\n", 421 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Sgt. Pepper''s Lonely Hearts Club Band', 1967, 5, 'Fixing a Hole', 139, 'Pop');\n", 422 | "\n", 423 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('25', 2015, 1, 'Hello', 295, 'Pop');\n", 424 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('25', 2015, 2, 'Send My Love', 223, 'Pop');\n", 425 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('25', 2015, 3, 'I Miss You', 350, 'Pop');\n", 426 | "\n", 427 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Goodbye Yellow Brick Road', 1973, 1, 'Candle in the Wind', 219, 'Pop');\n", 428 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Goodbye Yellow Brick Road', 1973, 2, 'Bennie and the Jets', 323, 'Pop');\n", 429 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('Goodbye Yellow Brick Road', 1973, 3, 'Goodbye Yellow Brick Road', 193, 'Pop');\n", 430 | "\n", 431 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('A Night at the Opera', 1975, 1, 'Bohemian Rhapsody', 354, 'Pop');\n", 432 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('A Night at the Opera', 1975, 2, 'Love of My Life', 220, 'Pop');\n", 433 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('A Night at the Opera', 1975, 3, 'Youre My Best Friend', 178, 'Pop');\n", 434 | "\n", 435 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('1989', 2014, 1, 'Welcome to New York', 212, 'Pop');\n", 436 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('1989', 2014, 2, 'Blank Space', 231, 'Pop');\n", 437 | "INSERT INTO tracks_by_album (album_title, album_year, number, title, length, genre) VALUES ('1989', 2014, 3, 'Style', 230, 'Pop');\n", 438 | "\n", 439 | "-- Insert data into users\n", 440 | "INSERT INTO users (id, name) VALUES (uuid(), 'John Doe');\n", 441 | "INSERT INTO users (id, name) VALUES (uuid(), 'Jane Smith');\n", 442 | "INSERT INTO users (id, name) VALUES (uuid(), 'Emily Johnson');\n", 443 | "INSERT INTO users (id, name) VALUES (uuid(), 'Michael Brown');\n", 444 | "INSERT INTO users (id, name) VALUES (uuid(), 'Jessica Davis');\n", 445 | "\n", 446 | "-- Insert data into tracks_by_user\n", 447 | "-- User ids should be copied from the users insert statements once generated\n", 448 | "-- The following are placeholders and should be replaced with actual UUIDs\n", 449 | "INSERT INTO tracks_by_user (id, month, timestamp, album_title, album_year, number, title, length) VALUES (uuid(), '2024-01-01', toTimestamp(now()), 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 1, 'Lucy in the Sky with Diamonds', 208);\n", 450 | "INSERT INTO tracks_by_user (id, month, timestamp, album_title, album_year, number, title, length) VALUES (uuid(), '2024-01-01', toTimestamp(now()), 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 2, 'With a Little Help from My Friends', 163);\n", 451 | "INSERT INTO tracks_by_user (id, month, timestamp, album_title, album_year, number, title, length) VALUES (uuid(), '2024-01-01', toTimestamp(now()), 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 3, 'Sgt. Pepper''s Lonely Hearts Club Band', 122);\n", 452 | "INSERT INTO tracks_by_user (id, month, timestamp, album_title, album_year, number, title, length) VALUES (uuid(), '2024-01-01', toTimestamp(now()), 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 4, 'Getting Better', 174);\n", 453 | "INSERT INTO tracks_by_user (id, month, timestamp, album_title, album_year, number, title, length) VALUES (uuid(), '2024-01-01', toTimestamp(now()), 'Sgt. Pepper''s Lonely Hearts Club Band', 1967, 5, 'Fixing a Hole', 139);\n", 454 | "\"\"\"" 455 | ] 456 | }, 457 | { 458 | "cell_type": "code", 459 | "execution_count": 11, 460 | "id": "71fe8d93-322d-407f-a1ed-406366a9ddbf", 461 | "metadata": { 462 | "id": "71fe8d93-322d-407f-a1ed-406366a9ddbf" 463 | }, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "Running statement 'INSERT INTO performers (name, type, coun ...' ... done.\n", 470 | "Running statement 'INSERT INTO performers (name, type, coun ...' ... done.\n", 471 | "Running statement 'INSERT INTO performers (name, type, coun ...' ... done.\n", 472 | "Running statement 'INSERT INTO performers (name, type, coun ...' ... done.\n", 473 | "Running statement 'INSERT INTO performers (name, type, coun ...' ... done.\n", 474 | "Running statement 'INSERT INTO albums_by_performer (perform ...' ... done.\n", 475 | "Running statement 'INSERT INTO albums_by_performer (perform ...' ... done.\n", 476 | "Running statement 'INSERT INTO albums_by_performer (perform ...' ... done.\n", 477 | "Running statement 'INSERT INTO albums_by_performer (perform ...' ... done.\n", 478 | "Running statement 'INSERT INTO albums_by_performer (perform ...' ... done.\n", 479 | "Running statement 'INSERT INTO albums_by_title (title, year ...' ... done.\n", 480 | "Running statement 'INSERT INTO albums_by_title (title, year ...' ... done.\n", 481 | "Running statement 'INSERT INTO albums_by_title (title, year ...' ... done.\n", 482 | "Running statement 'INSERT INTO albums_by_title (title, year ...' ... done.\n", 483 | "Running statement 'INSERT INTO albums_by_title (title, year ...' ... done.\n", 484 | "Running statement 'INSERT INTO albums_by_genre (genre, year ...' ... done.\n", 485 | "Running statement 'INSERT INTO albums_by_genre (genre, year ...' ... done.\n", 486 | "Running statement 'INSERT INTO albums_by_genre (genre, year ...' ... done.\n", 487 | "Running statement 'INSERT INTO albums_by_genre (genre, year ...' ... done.\n", 488 | "Running statement 'INSERT INTO albums_by_genre (genre, year ...' ... done.\n", 489 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 490 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 491 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 492 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 493 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 494 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 495 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 496 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 497 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 498 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 499 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 500 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 501 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 502 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 503 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 504 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 505 | "Running statement 'INSERT INTO tracks_by_title (title, albu ...' ... done.\n", 506 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 507 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 508 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 509 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 510 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 511 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 512 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 513 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 514 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 515 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 516 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 517 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 518 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 519 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 520 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 521 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 522 | "Running statement 'INSERT INTO tracks_by_album (album_title ...' ... done.\n", 523 | "Running statement 'INSERT INTO users (id, name) VALUES (uui ...' ... done.\n", 524 | "Running statement 'INSERT INTO users (id, name) VALUES (uui ...' ... done.\n", 525 | "Running statement 'INSERT INTO users (id, name) VALUES (uui ...' ... done.\n", 526 | "Running statement 'INSERT INTO users (id, name) VALUES (uui ...' ... done.\n", 527 | "Running statement 'INSERT INTO users (id, name) VALUES (uui ...' ... done.\n", 528 | "Running statement 'INSERT INTO tracks_by_user (id, month, t ...' ... done.\n", 529 | "Running statement 'INSERT INTO tracks_by_user (id, month, t ...' ... done.\n", 530 | "Running statement 'INSERT INTO tracks_by_user (id, month, t ...' ... done.\n", 531 | "Running statement 'INSERT INTO tracks_by_user (id, month, t ...' ... done.\n", 532 | "Running statement 'INSERT INTO tracks_by_user (id, month, t ...' ... done.\n" 533 | ] 534 | } 535 | ], 536 | "source": [ 537 | "# This parses the text above into executable strings by the driver\n", 538 | "for line in insert_fake_data_cql.split(\"\\n\"):\n", 539 | " if \";\" in line:\n", 540 | " _statement = line.replace(\"\\n\", \" \")\n", 541 | " print(f\"Running statement '{_statement[:40]} ...' ...\", end=\"\")\n", 542 | " execute_statement(_statement)\n", 543 | " print(\" done.\")" 544 | ] 545 | }, 546 | { 547 | "cell_type": "markdown", 548 | "id": "2495b975-8342-4386-b92d-5ae05b783853", 549 | "metadata": { 550 | "id": "2495b975-8342-4386-b92d-5ae05b783853" 551 | }, 552 | "source": [ 553 | "## (Optional) Give the LLM Additional Context with the Built-in 'Comments' Column" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "id": "23820cb9-d683-48bb-9440-b398446df4c9", 559 | "metadata": { 560 | "id": "23820cb9-d683-48bb-9440-b398446df4c9" 561 | }, 562 | "source": [ 563 | "LLM response quality greatly depends on the context they've been given - the more concise descriptions they have access to, the better. We can choose to augment the DB schema we pass to the model by utilizing the built-in `comment` property of CQL tables.\n", 564 | "\n", 565 | "NOTE: You can also include these comments at table creation by using the `WITH AND
... AND comment = ''` syntax" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 12, 571 | "id": "04d67d6c-6938-46ac-a5b5-8baaea54272a", 572 | "metadata": { 573 | "id": "04d67d6c-6938-46ac-a5b5-8baaea54272a" 574 | }, 575 | "outputs": [], 576 | "source": [ 577 | "add_comments_cql = f\"\"\"\n", 578 | "ALTER TABLE albums_by_genre WITH comment = 'Albums partitioned by musical genre';\n", 579 | "ALTER TABLE albums_by_performer WITH comment = 'Albums partitioned by name of performer/artist';\n", 580 | "ALTER TABLE albums_by_title WITH comment = 'Albums partitioned by album title';\n", 581 | "ALTER TABLE performers WITH comment = 'Performers/artists partitioned by performer name';\n", 582 | "ALTER TABLE tracks_by_album WITH comment = 'Tracks/songs partitioned by album title';\n", 583 | "ALTER TABLE tracks_by_title WITH comment = 'Tracks/songs partitioned by song title';\n", 584 | "ALTER TABLE tracks_by_user WITH comment = 'Tracks/songs users listened to partitioned by user ID and time of listen';\n", 585 | "ALTER TABLE users WITH comment = 'Users partitioned by user ID';\n", 586 | "\"\"\"" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 13, 592 | "id": "db29bea2-7600-495b-b57b-e2937fb0753f", 593 | "metadata": { 594 | "id": "db29bea2-7600-495b-b57b-e2937fb0753f" 595 | }, 596 | "outputs": [ 597 | { 598 | "name": "stdout", 599 | "output_type": "stream", 600 | "text": [ 601 | "Running statement 'ALTER TABLE albums_by_genre WITH comment ...' ... done.\n", 602 | "Running statement 'ALTER TABLE albums_by_performer WITH com ...' ... done.\n", 603 | "Running statement 'ALTER TABLE albums_by_title WITH comment ...' ... done.\n", 604 | "Running statement 'ALTER TABLE performers WITH comment = 'P ...' ... done.\n", 605 | "Running statement 'ALTER TABLE tracks_by_album WITH comment ...' ... done.\n", 606 | "Running statement 'ALTER TABLE tracks_by_title WITH comment ...' ... done.\n", 607 | "Running statement 'ALTER TABLE tracks_by_user WITH comment ...' ... done.\n", 608 | "Running statement 'ALTER TABLE users WITH comment = 'Users ...' ... done.\n" 609 | ] 610 | } 611 | ], 612 | "source": [ 613 | "# This parses the text above into executable strings by the driver\n", 614 | "for line in add_comments_cql.split(\"\\n\"):\n", 615 | " if \";\" in line:\n", 616 | " _statement = line.replace(\"\\n\", \" \")\n", 617 | " print(f\"Running statement '{_statement[:40]} ...' ...\", end=\"\")\n", 618 | " execute_statement(_statement)\n", 619 | " print(\" done.\")" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "id": "1a07cc49-7ba3-45ee-b2da-7fd297d8a86d", 625 | "metadata": { 626 | "id": "1a07cc49-7ba3-45ee-b2da-7fd297d8a86d" 627 | }, 628 | "source": [ 629 | "## Run Queries from User Questions" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "id": "942613bc-25aa-41cc-80e9-78ca15bcee06", 635 | "metadata": { 636 | "id": "942613bc-25aa-41cc-80e9-78ca15bcee06" 637 | }, 638 | "source": [ 639 | "#### Generating & Executing CQL" 640 | ] 641 | }, 642 | { 643 | "cell_type": "markdown", 644 | "id": "8542b137-2fd1-4105-a402-17358647f815", 645 | "metadata": { 646 | "id": "8542b137-2fd1-4105-a402-17358647f815" 647 | }, 648 | "source": [ 649 | "Now, we can ask ChatGPT to provide us with some queries that answer our questions! The prompt template we use is taken from [SQL-PaLM](https://arxiv.org/abs/2306.00739), and adapted to fit the CQL use case. In order to use it though, we need to retrieve the schema from our DB." 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": 14, 655 | "id": "7f442866-55e8-4284-83e1-705bb320250b", 656 | "metadata": { 657 | "id": "7f442866-55e8-4284-83e1-705bb320250b" 658 | }, 659 | "outputs": [], 660 | "source": [ 661 | "TEXT2CQL_PROMPT = \"\"\"Convert the question to CQL (Cassandra Query Language)\n", 662 | "that can retrieve an appropriate answer, or answer saying that the data model\n", 663 | "does not support answering such a question in a performant way:\n", 664 | "\n", 665 | "[Schema : values (type)]\n", 666 | "{schema}\n", 667 | "\n", 668 | "[Partition Keys]\n", 669 | "{partition_keys}\n", 670 | "\n", 671 | "[Clustering Keys]\n", 672 | "{clustering_keys}\n", 673 | "\n", 674 | "[Q]\n", 675 | "{question}\n", 676 | "\n", 677 | "[CQL]\n", 678 | "\"\"\"\n", 679 | "\n", 680 | "\n", 681 | "def generate_schema_partition_clustering_keys(keyspace: str = keyspace) -> (str, str):\n", 682 | " \"\"\"Generates a TEXT2CQL_PROMPT compatible schema for a keyspace\"\"\"\n", 683 | " # Get all table names in our keyspace\n", 684 | " table_names = execute_statement(\n", 685 | " f\"SELECT table_name, comment FROM system_schema.tables WHERE keyspace_name = '{keyspace}'\"\n", 686 | " )\n", 687 | " tn_str = \", \".join([\"'\" + tn.table_name + \"'\" for tn in table_names])\n", 688 | "\n", 689 | " # Now get all the column names corresponding to those tables\n", 690 | " columns = execute_statement(\n", 691 | " f\"SELECT * FROM system_schema.columns WHERE table_name IN ({tn_str}) AND keyspace_name = '{keyspace}' ALLOW FILTERING\"\n", 692 | " )\n", 693 | "\n", 694 | " # Now, we construct our prompt template formatted schema, partition_keys, and clustering keys\n", 695 | " # from the table and column objects returned from the DB\n", 696 | " schema = \" | \".join([\n", 697 | " f\"{table.table_name} '{table.comment}' : \" + \" , \".join([\n", 698 | " f\"{col.column_name} ({col.type})\"\n", 699 | " for col in columns\n", 700 | " if col.table_name == table.table_name\n", 701 | " ])\n", 702 | " for table in table_names\n", 703 | " ])\n", 704 | " partition_keys = \" | \".join([\n", 705 | " f\"{table.table_name} : \" + \" , \".join([\n", 706 | " col.column_name for col in columns\n", 707 | " if col.table_name == table.table_name\n", 708 | " and col.kind == \"partition_key\"\n", 709 | " ])\n", 710 | " for table in table_names\n", 711 | " ])\n", 712 | " clustering_keys = \" | \".join([\n", 713 | " f\"{table.table_name} : \" + \" , \".join([\n", 714 | " f\"{col.column_name} ({col.clustering_order})\" for col in columns\n", 715 | " if col.table_name == table.table_name\n", 716 | " and col.kind == \"clustering\"\n", 717 | " ])\n", 718 | " for table in table_names\n", 719 | " ])\n", 720 | " return schema, partition_keys, clustering_keys\n", 721 | "\n", 722 | "\n", 723 | "def execute_query_from_question(question: str, debug_cql: bool = True, debug_prompt: bool = False, return_cql: bool = False):\n", 724 | " \"\"\"Generates and executes CQL from a user question based on LLM output\"\"\"\n", 725 | " # Get all of the variables necessary to fill out the prompt\n", 726 | " schema, partition_keys, clustering_keys = generate_schema_partition_clustering_keys()\n", 727 | " prompt = TEXT2CQL_PROMPT.format(\n", 728 | " schema=schema,\n", 729 | " partition_keys=partition_keys,\n", 730 | " clustering_keys=clustering_keys,\n", 731 | " question=question,\n", 732 | " )\n", 733 | "\n", 734 | " if debug_prompt:\n", 735 | " print(f\"Prompting model with:\\n{prompt}\")\n", 736 | "\n", 737 | " # Get generated CQL from the LLM (in this case gpt-3.5-turbo)\n", 738 | " completion = client.chat.completions.create(\n", 739 | " messages=[{\n", 740 | " \"role\": \"user\",\n", 741 | " \"content\": prompt,\n", 742 | " }],\n", 743 | " model=\"gpt-3.5-turbo\",\n", 744 | " ).choices[0].message.content\n", 745 | "\n", 746 | " if debug_cql:\n", 747 | " print(f\"Question: {question}\\nGenerated Query: {completion}\\n\")\n", 748 | "\n", 749 | " # Need to trim trailing ';' if present to work with cassandra-driver\n", 750 | " if completion.find(\";\") > -1:\n", 751 | " completion = completion[:completion.find(\";\")]\n", 752 | "\n", 753 | " results = execute_statement(completion)\n", 754 | "\n", 755 | " if return_cql:\n", 756 | " return (results, completion)\n", 757 | " else:\n", 758 | " return results" 759 | ] 760 | }, 761 | { 762 | "cell_type": "code", 763 | "execution_count": 15, 764 | "id": "dd55a635-d1f5-4463-afff-b98b7bf23b56", 765 | "metadata": { 766 | "colab": { 767 | "base_uri": "https://localhost:8080/" 768 | }, 769 | "id": "dd55a635-d1f5-4463-afff-b98b7bf23b56", 770 | "outputId": "66f548a0-c268-4d3a-cba6-5fb3fa3dc755" 771 | }, 772 | "outputs": [ 773 | { 774 | "name": "stdout", 775 | "output_type": "stream", 776 | "text": [ 777 | "Prompting model with:\n", 778 | "Convert the question to CQL (Cassandra Query Language)\n", 779 | "that can retrieve an appropriate answer, or answer saying that the data model\n", 780 | "does not support answering such a question in a performant way:\n", 781 | "\n", 782 | "[Schema : values (type)]\n", 783 | "albums_by_genre 'Albums partitioned by musical genre' : genre (text) , performer (text) , title (text) , year (int) | albums_by_performer 'Albums partitioned by name of performer/artist' : genre (text) , performer (text) , title (text) , year (int) | albums_by_title 'Albums partitioned by album title' : genre (text) , performer (text) , title (text) , year (int) | performers 'Performers/artists partitioned by performer name' : born (int) , country (text) , died (int) , founded (int) , name (text) , type (text) | tracks_by_album 'Tracks/songs partitioned by album title' : album_title (text) , album_year (int) , genre (text) , length (int) , number (int) , title (text) | tracks_by_title 'Tracks/songs partitioned by song title' : album_title (text) , album_year (int) , genre (text) , length (int) , number (int) , title (text) | tracks_by_user 'Tracks/songs users listened to partitioned by user ID and time of listen' : album_title (text) , album_year (int) , id (uuid) , length (int) , month (date) , number (int) , timestamp (timestamp) , title (text) | users 'Users partitioned by user ID' : id (uuid) , name (text)\n", 784 | "\n", 785 | "[Partition Keys]\n", 786 | "albums_by_genre : genre | albums_by_performer : performer | albums_by_title : title | performers : name | tracks_by_album : album_title | tracks_by_title : title | tracks_by_user : id | users : id\n", 787 | "\n", 788 | "[Clustering Keys]\n", 789 | "albums_by_genre : performer (asc) , title (asc) , year (desc) | albums_by_performer : title (asc) , year (desc) | albums_by_title : year (desc) | performers : | tracks_by_album : album_year (desc) , number (asc) | tracks_by_title : album_title (asc) , album_year (desc) , number (asc) | tracks_by_user : timestamp (desc) | users : \n", 790 | "\n", 791 | "[Q]\n", 792 | "What songs are on A Night at the Opera?\n", 793 | "\n", 794 | "[CQL]\n", 795 | "\n", 796 | "Question: What songs are on A Night at the Opera?\n", 797 | "Generated Query: SELECT title FROM tracks_by_album WHERE album_title = 'A Night at the Opera';\n", 798 | "\n" 799 | ] 800 | }, 801 | { 802 | "data": { 803 | "text/plain": [ 804 | "[Row(title='Bohemian Rhapsody'),\n", 805 | " Row(title='Love of My Life'),\n", 806 | " Row(title='Youre My Best Friend')]" 807 | ] 808 | }, 809 | "execution_count": 15, 810 | "metadata": {}, 811 | "output_type": "execute_result" 812 | } 813 | ], 814 | "source": [ 815 | "# Show full prompting trace\n", 816 | "execute_query_from_question(\"What songs are on A Night at the Opera?\", debug_prompt=True)" 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "execution_count": 16, 822 | "id": "b3ae4ade-dc9c-4cae-824f-98edc5bc5208", 823 | "metadata": { 824 | "colab": { 825 | "base_uri": "https://localhost:8080/" 826 | }, 827 | "id": "b3ae4ade-dc9c-4cae-824f-98edc5bc5208", 828 | "outputId": "bb3c1b4b-dd61-48c9-99bf-8d2db543bbe0" 829 | }, 830 | "outputs": [ 831 | { 832 | "name": "stdout", 833 | "output_type": "stream", 834 | "text": [ 835 | "Question: What are some of the most recent Pop albums in the last decade?\n", 836 | "Generated Query: SELECT * FROM albums_by_genre WHERE genre = 'Pop' AND year > 2010 ALLOW FILTERING;\n", 837 | "\n" 838 | ] 839 | }, 840 | { 841 | "data": { 842 | "text/plain": [ 843 | "[Row(genre='Pop', year=2015, performer='Adele', title='25'),\n", 844 | " Row(genre='Pop', year=2014, performer='Taylor Swift', title='1989')]" 845 | ] 846 | }, 847 | "execution_count": 16, 848 | "metadata": {}, 849 | "output_type": "execute_result" 850 | } 851 | ], 852 | "source": [ 853 | "execute_query_from_question(\"What are some of the most recent Pop albums in the last decade?\")" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": 17, 859 | "id": "51e97f95-5f44-4de5-9342-1c00b30918c4", 860 | "metadata": { 861 | "colab": { 862 | "base_uri": "https://localhost:8080/" 863 | }, 864 | "id": "51e97f95-5f44-4de5-9342-1c00b30918c4", 865 | "outputId": "4231b990-e0f7-4824-e3f4-46450bbcc3ec" 866 | }, 867 | "outputs": [ 868 | { 869 | "name": "stdout", 870 | "output_type": "stream", 871 | "text": [ 872 | "Question: How many albums has Taylor Swift made?\n", 873 | "Generated Query: SELECT COUNT(*) FROM albums_by_performer WHERE performer = 'Taylor Swift';\n", 874 | "\n" 875 | ] 876 | }, 877 | { 878 | "data": { 879 | "text/plain": [ 880 | "[Row(count=1)]" 881 | ] 882 | }, 883 | "execution_count": 17, 884 | "metadata": {}, 885 | "output_type": "execute_result" 886 | } 887 | ], 888 | "source": [ 889 | "execute_query_from_question(\"How many albums has Taylor Swift made?\")" 890 | ] 891 | }, 892 | { 893 | "cell_type": "markdown", 894 | "id": "7adb085f-48af-422b-9e9e-15c2d02a4f58", 895 | "metadata": { 896 | "id": "7adb085f-48af-422b-9e9e-15c2d02a4f58" 897 | }, 898 | "source": [ 899 | "Pretty cool that it can find the data to answer our questions! Let's see if we can take this one step further, and actually generate coherent responses using this data:" 900 | ] 901 | }, 902 | { 903 | "cell_type": "markdown", 904 | "id": "d8afb916-f30b-4bed-9e00-5e86b261d788", 905 | "metadata": { 906 | "id": "d8afb916-f30b-4bed-9e00-5e86b261d788" 907 | }, 908 | "source": [ 909 | "#### End to End Question Answering" 910 | ] 911 | }, 912 | { 913 | "cell_type": "markdown", 914 | "id": "09aa4a19-a284-4800-abb0-8d0e24185c38", 915 | "metadata": { 916 | "id": "09aa4a19-a284-4800-abb0-8d0e24185c38" 917 | }, 918 | "source": [ 919 | "Now, let's wrap up by showing how we can make a subsequent LLM call to answer the user's question with natural language. This completes a full \"RAG\" style pipeline!" 920 | ] 921 | }, 922 | { 923 | "cell_type": "code", 924 | "execution_count": 18, 925 | "id": "dbb6002e-d5f1-408c-9bd3-e7ca4a329483", 926 | "metadata": { 927 | "id": "dbb6002e-d5f1-408c-9bd3-e7ca4a329483" 928 | }, 929 | "outputs": [], 930 | "source": [ 931 | "ANSWER_PROMPT = \"\"\"Query:\n", 932 | "```\n", 933 | "{cql}\n", 934 | "```\n", 935 | "\n", 936 | "Output:\n", 937 | "```\n", 938 | "{results_repr}\n", 939 | "```\n", 940 | "===\n", 941 | "\n", 942 | "Given the above results from querying the DB, answer the following user question:\n", 943 | "\n", 944 | "{question}\n", 945 | "\"\"\"\n", 946 | "\n", 947 | "\n", 948 | "def answer_question(question: str, debug_cql: bool = False, debug_prompt: bool = False) -> str:\n", 949 | " \"\"\"Conducts a full RAG pipeline where the LLM retrieves relevant information\n", 950 | " and references it to answer the question in natural language.\n", 951 | " \"\"\"\n", 952 | " # Get necessary fields to fill out prompt\n", 953 | " query_results, cql = execute_query_from_question(\n", 954 | " question=question,\n", 955 | " debug_cql=debug_cql,\n", 956 | " debug_prompt=debug_prompt,\n", 957 | " return_cql=True,\n", 958 | " )\n", 959 | " prompt = ANSWER_PROMPT.format(\n", 960 | " question=question,\n", 961 | " results_repr=str(query_results),\n", 962 | " cql=cql,\n", 963 | " )\n", 964 | "\n", 965 | " if debug_prompt:\n", 966 | " print(f\"Prompting model with:\\n{prompt}\")\n", 967 | "\n", 968 | " # Return the generated answer from the LLM\n", 969 | " return client.chat.completions.create(\n", 970 | " messages=[{\n", 971 | " \"role\": \"user\",\n", 972 | " \"content\": prompt,\n", 973 | " }],\n", 974 | " model=\"gpt-3.5-turbo\",\n", 975 | " ).choices[0].message.content\n" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": 19, 981 | "id": "8533ecb1-9c8e-4db1-beb4-e736d2a1b500", 982 | "metadata": { 983 | "colab": { 984 | "base_uri": "https://localhost:8080/" 985 | }, 986 | "id": "8533ecb1-9c8e-4db1-beb4-e736d2a1b500", 987 | "outputId": "68d0beca-86cc-47b2-b96e-b6c6d1c5f1e8" 988 | }, 989 | "outputs": [ 990 | { 991 | "name": "stdout", 992 | "output_type": "stream", 993 | "text": [ 994 | "Prompting model with:\n", 995 | "Convert the question to CQL (Cassandra Query Language)\n", 996 | "that can retrieve an appropriate answer, or answer saying that the data model\n", 997 | "does not support answering such a question in a performant way:\n", 998 | "\n", 999 | "[Schema : values (type)]\n", 1000 | "albums_by_genre 'Albums partitioned by musical genre' : genre (text) , performer (text) , title (text) , year (int) | albums_by_performer 'Albums partitioned by name of performer/artist' : genre (text) , performer (text) , title (text) , year (int) | albums_by_title 'Albums partitioned by album title' : genre (text) , performer (text) , title (text) , year (int) | performers 'Performers/artists partitioned by performer name' : born (int) , country (text) , died (int) , founded (int) , name (text) , type (text) | tracks_by_album 'Tracks/songs partitioned by album title' : album_title (text) , album_year (int) , genre (text) , length (int) , number (int) , title (text) | tracks_by_title 'Tracks/songs partitioned by song title' : album_title (text) , album_year (int) , genre (text) , length (int) , number (int) , title (text) | tracks_by_user 'Tracks/songs users listened to partitioned by user ID and time of listen' : album_title (text) , album_year (int) , id (uuid) , length (int) , month (date) , number (int) , timestamp (timestamp) , title (text) | users 'Users partitioned by user ID' : id (uuid) , name (text)\n", 1001 | "\n", 1002 | "[Partition Keys]\n", 1003 | "albums_by_genre : genre | albums_by_performer : performer | albums_by_title : title | performers : name | tracks_by_album : album_title | tracks_by_title : title | tracks_by_user : id | users : id\n", 1004 | "\n", 1005 | "[Clustering Keys]\n", 1006 | "albums_by_genre : performer (asc) , title (asc) , year (desc) | albums_by_performer : title (asc) , year (desc) | albums_by_title : year (desc) | performers : | tracks_by_album : album_year (desc) , number (asc) | tracks_by_title : album_title (asc) , album_year (desc) , number (asc) | tracks_by_user : timestamp (desc) | users : \n", 1007 | "\n", 1008 | "[Q]\n", 1009 | "What songs are on A Night at the Opera?\n", 1010 | "\n", 1011 | "[CQL]\n", 1012 | "\n", 1013 | "Prompting model with:\n", 1014 | "Query:\n", 1015 | "```\n", 1016 | "SELECT title FROM tracks_by_album WHERE album_title = 'A Night at the Opera'\n", 1017 | "```\n", 1018 | "\n", 1019 | "Output:\n", 1020 | "```\n", 1021 | "[Row(title='Bohemian Rhapsody'), Row(title='Love of My Life'), Row(title='Youre My Best Friend')]\n", 1022 | "```\n", 1023 | "===\n", 1024 | "\n", 1025 | "Given the above results from querying the DB, answer the following user question:\n", 1026 | "\n", 1027 | "What songs are on A Night at the Opera?\n", 1028 | "\n", 1029 | "The songs on \"A Night at the Opera\" are \"Bohemian Rhapsody\", \"Love of My Life\", and \"You're My Best Friend\".\n" 1030 | ] 1031 | } 1032 | ], 1033 | "source": [ 1034 | "# Show full prompting trace\n", 1035 | "print(\n", 1036 | " answer_question(\"What songs are on A Night at the Opera?\", debug_prompt=True)\n", 1037 | ")" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "code", 1042 | "execution_count": 20, 1043 | "id": "32edcf7a-ea91-4ae9-9543-3b987a3cb695", 1044 | "metadata": { 1045 | "colab": { 1046 | "base_uri": "https://localhost:8080/" 1047 | }, 1048 | "id": "32edcf7a-ea91-4ae9-9543-3b987a3cb695", 1049 | "outputId": "713721a2-da55-45b5-8a4a-b285c588d6a1" 1050 | }, 1051 | "outputs": [ 1052 | { 1053 | "name": "stdout", 1054 | "output_type": "stream", 1055 | "text": [ 1056 | "Some of the most recent Pop albums in the last decade are Adele's \"25\" released in 2015 and Taylor Swift's \"1989\" released in 2014.\n" 1057 | ] 1058 | } 1059 | ], 1060 | "source": [ 1061 | "print(\n", 1062 | " answer_question(\"What are some of the most recent Pop albums in the last decade?\")\n", 1063 | ")" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "code", 1068 | "execution_count": 21, 1069 | "id": "3083b6cb-7e4c-4738-b934-a4e49ac372cb", 1070 | "metadata": { 1071 | "colab": { 1072 | "base_uri": "https://localhost:8080/" 1073 | }, 1074 | "id": "3083b6cb-7e4c-4738-b934-a4e49ac372cb", 1075 | "outputId": "ae8b1eef-1871-4772-fec7-8ee279df7a8a" 1076 | }, 1077 | "outputs": [ 1078 | { 1079 | "name": "stdout", 1080 | "output_type": "stream", 1081 | "text": [ 1082 | "Taylor Swift has made 1 album.\n" 1083 | ] 1084 | } 1085 | ], 1086 | "source": [ 1087 | "print(\n", 1088 | " answer_question(\"How many albums has Taylor Swift made?\")\n", 1089 | ")" 1090 | ] 1091 | }, 1092 | { 1093 | "cell_type": "markdown", 1094 | "id": "29edd522-6cb9-454c-a5b3-b8ee0f8a9ea8", 1095 | "metadata": { 1096 | "id": "29edd522-6cb9-454c-a5b3-b8ee0f8a9ea8" 1097 | }, 1098 | "source": [ 1099 | "Awesome! Our model is answering questions based on just the data in our dummy DB, and is able to construct queries for retrieving that data in a fully automated way." 1100 | ] 1101 | } 1102 | ], 1103 | "metadata": { 1104 | "colab": { 1105 | "collapsed_sections": [ 1106 | "890b2537-f585-4f31-bb20-ed71910eb586", 1107 | "2495b975-8342-4386-b92d-5ae05b783853" 1108 | ], 1109 | "provenance": [] 1110 | }, 1111 | "kernelspec": { 1112 | "display_name": "Python 3 (ipykernel)", 1113 | "language": "python", 1114 | "name": "python3" 1115 | }, 1116 | "language_info": { 1117 | "codemirror_mode": { 1118 | "name": "ipython", 1119 | "version": 3 1120 | }, 1121 | "file_extension": ".py", 1122 | "mimetype": "text/x-python", 1123 | "name": "python", 1124 | "nbconvert_exporter": "python", 1125 | "pygments_lexer": "ipython3", 1126 | "version": "3.12.8" 1127 | } 1128 | }, 1129 | "nbformat": 4, 1130 | "nbformat_minor": 5 1131 | } 1132 | --------------------------------------------------------------------------------