├── .env.example ├── .gitignore ├── LICENSE ├── LangChain Data Demo.ipynb ├── README.md ├── agent_steps.png ├── data └── .gitkeep └── requirements.txt /.env.example: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY=SETME 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | MANIFEST 2 | build 3 | dist 4 | _build 5 | docs/man/*.gz 6 | docs/source/api/generated 7 | docs/source/config.rst 8 | docs/gh-pages 9 | notebook/i18n/*/LC_MESSAGES/*.mo 10 | notebook/i18n/*/LC_MESSAGES/nbjs.json 11 | notebook/static/components 12 | notebook/static/style/*.min.css* 13 | notebook/static/*/js/built/ 14 | notebook/static/*/built/ 15 | notebook/static/built/ 16 | notebook/static/*/js/main.min.js* 17 | notebook/static/lab/*bundle.js 18 | node_modules 19 | *.py[co] 20 | __pycache__ 21 | *.egg-info 22 | *~ 23 | *.bak 24 | .ipynb_checkpoints 25 | .tox 26 | .DS_Store 27 | \#*# 28 | .#* 29 | .coverage 30 | src 31 | 32 | *.swp 33 | *.map 34 | .idea/ 35 | Read the Docs 36 | config.rst 37 | 38 | /.project 39 | /.pydevproject 40 | 41 | package-lock.json 42 | 43 | .vscode* 44 | 45 | .env 46 | data/*.txt 47 | data/*.db* -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2023, Matt Hodges 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of LangChain-Data-Demo nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /LangChain Data Demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "attachments": {}, 5 | "cell_type": "markdown", 6 | "id": "31caaf76", 7 | "metadata": {}, 8 | "source": [ 9 | "# LangChain Data Demo\n", 10 | "\n", 11 | "_2023-04-01_\n", 12 | "\n", 13 | "**By Matt Hodges**\n", 14 | "\n", 15 | "This Jupyter Notebook demonstrates a variety of data tasks that can be accomplished with the assistance of the [LangChain](https://python.langchain.com/en/latest/index.html) framework. It shows how you can utilize [Chains](https://python.langchain.com/en/latest/modules/chains/getting_started.html), [Agents](https://python.langchain.com/en/latest/modules/agents/getting_started.html), and [Tools](https://python.langchain.com/en/latest/modules/agents/tools/getting_started.html) to work with voter data.\n", 16 | "\n", 17 | "LangChain is a flexible framework that can integrate with a [variety of LLMs](https://python.langchain.com/en/latest/reference/integrations.html). This Notebook integrates with OpenAI. To continue, obtain an [OpenAI API Key](https://platform.openai.com/). Copy the `.env.example` file to `.env` at the root of this respository, and set your key there. Since it can be hard to gauge utilization, I also recommend setting spending limits on your OpenAI account.\n", 18 | "\n", 19 | "This Notebook was written with LangChain version `0.0.128` using the `text-davinci-003` model from OpenAI.\n", 20 | "\n", 21 | "Also, be sure to install the requirements defined in `requirements.txt`:\n", 22 | "\n", 23 | "```\n", 24 | "pip install -r requirements.txt\n", 25 | "```" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "id": "b677a202", 31 | "metadata": {}, 32 | "source": [ 33 | "## Gathering Data\n", 34 | "\n", 35 | "This Notebook utilizes the [public voter data](https://dl.ncsbe.gov/index.html?prefix=data/) provided by the North Carolina State Board of Elections. We will work with a small subset of the NC voter file." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 2, 41 | "id": "5820f828", 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "data": { 46 | "text/plain": [ 47 | "True" 48 | ] 49 | }, 50 | "execution_count": 2, 51 | "metadata": {}, 52 | "output_type": "execute_result" 53 | } 54 | ], 55 | "source": [ 56 | "import sqlite3\n", 57 | "import tempfile\n", 58 | "import urllib.request\n", 59 | "import zipfile\n", 60 | "\n", 61 | "import dotenv\n", 62 | "import pandas as pd\n", 63 | "from langchain import LLMChain, OpenAI, SQLDatabase, SQLDatabaseChain\n", 64 | "from langchain.agents import (AgentExecutor, Tool, ZeroShotAgent,\n", 65 | " initialize_agent, load_tools)\n", 66 | "\n", 67 | "dotenv.load_dotenv(override=False)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "37333925", 73 | "metadata": {}, 74 | "source": [ 75 | "Loading the data into the database is the most code we'll write in this project:" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 2, 81 | "id": "4118f0d2", 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# Public data from NC Elections Board\n", 86 | "# https://dl.ncsbe.gov/index.html?prefix=data/\n", 87 | "nc_voter_url = \"https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter29.zip\"\n", 88 | "nc_vhis_url = \"https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvhis29.zip\"\n", 89 | "\n", 90 | "\n", 91 | "# Download the data and extract the Zip\n", 92 | "extracted = []\n", 93 | "for datafile in (nc_voter_url, nc_vhis_url):\n", 94 | "\n", 95 | " with tempfile.NamedTemporaryFile(suffix=\".zip\") as tmp_file:\n", 96 | " urllib.request.urlretrieve(datafile, tmp_file.name)\n", 97 | "\n", 98 | " with zipfile.ZipFile(tmp_file.name, \"r\") as zip_ref:\n", 99 | " zip_ref.extractall(\"./data/\")\n", 100 | " extracted += zip_ref.namelist()\n", 101 | "\n", 102 | "# Load the data into a local SQLite database\n", 103 | "# We are not concerned with setting Primary Keys or Foreign Keys\n", 104 | "for filename in extracted:\n", 105 | " table_name = filename.split(\".\")[0]\n", 106 | " path = f\"./data/{filename}\"\n", 107 | "\n", 108 | " df = pd.read_csv(path, delimiter=\"\\t\", encoding=\"latin1\")\n", 109 | " cols = list(df.columns)\n", 110 | " preliminary_col_types = [str(dt) for dt in list(df.dtypes)]\n", 111 | " col_types = []\n", 112 | " for col in preliminary_col_types:\n", 113 | " if col == \"int64\":\n", 114 | " col_types.append(\"INTEGER\")\n", 115 | " elif col == \"float64\":\n", 116 | " col_types.append(\"REAL\")\n", 117 | " else:\n", 118 | " col_types.append(\"TEXT\")\n", 119 | "\n", 120 | " column_def = ', '.join([f\"{col} {t}\" for col, t in zip(cols, col_types)])\n", 121 | "\n", 122 | " con = sqlite3.connect(\"./data/ncv.db\")\n", 123 | " cur = con.cursor()\n", 124 | " cur.execute(f\"CREATE TABLE IF NOT EXISTS {table_name} ({column_def});\")\n", 125 | "\n", 126 | " df.to_sql(table_name, con, if_exists=\"replace\", index=False)\n", 127 | " con.commit()\n", 128 | " con.close()" 129 | ] 130 | }, 131 | { 132 | "attachments": {}, 133 | "cell_type": "markdown", 134 | "id": "c3540d18", 135 | "metadata": {}, 136 | "source": [ 137 | "## Using the LLM to Write and Run SQL\n", 138 | "\n", 139 | "Now that our voter data is loaded into the database, we're ready to start working with it! We'll start by using the [SQLDatabaseChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html). We'll use it to craft and execute SQL queries. And we'll set it to `verbose` so that we can see and verify its work.\n", 140 | "\n", 141 | "**NOTE:** For data-sensitive projects, you can specify `return_direct=True` in the SQLDatabaseChain initialization to directly return the output of the SQL query without any additional formatting. This prevents the LLM from seeing any additional contents within the database (though it does default [init with 3 sample rows](https://github.com/hwchase17/langchain/blob/v0.0.128/langchain/sql_database.py#L22)). Note, however, the LLM still has access to the database scheme (i.e. dialect, table and key names) by default. If this were not set, the results of the query would be sent to the LLM for further summarization." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 4, 147 | "id": "8f1c066a", 148 | "metadata": {}, 149 | "outputs": [], 150 | "source": [ 151 | "db = SQLDatabase.from_uri(\"sqlite:///./data/ncv.db\")\n", 152 | "llm = OpenAI(temperature=0)\n", 153 | "db_chain = SQLDatabaseChain(\n", 154 | " llm=llm,\n", 155 | " database=db,\n", 156 | " verbose=True, # Show its work\n", 157 | " return_direct=True, # Return the results without sending back to the LLM\n", 158 | ")" 159 | ] 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "id": "54682a22", 164 | "metadata": {}, 165 | "source": [ 166 | "And now we're ready to roll! Let's try a few!" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 4, 172 | "id": "fc15c0a0", 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "\n", 180 | "\n", 181 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 182 | "How many voters are there? \n", 183 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT COUNT(*) FROM ncvoter29;\u001b[0m\n", 184 | "SQLResult: \u001b[33;1m\u001b[1;3m[(125937,)]\u001b[0m\n", 185 | "\u001b[1m> Finished chain.\u001b[0m\n" 186 | ] 187 | }, 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "{'query': 'How many voters are there?', 'result': '[(125937,)]'}" 192 | ] 193 | }, 194 | "execution_count": 4, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "db_chain(\"How many voters are there?\")" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 6, 206 | "id": "2f6ab346", 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "\n", 214 | "\n", 215 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 216 | "What is the average age for each party? \n", 217 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT party_cd, AVG(age_at_year_end) FROM ncvoter29 GROUP BY party_cd;\u001b[0m\n", 218 | "SQLResult: \u001b[33;1m\u001b[1;3m[('DEM', 57.10139401882521), ('GRE', 20.0), ('LIB', 38.924418604651166), ('REP', 55.279202607493254), ('UNA', 47.284671532846716)]\u001b[0m\n", 219 | "\u001b[1m> Finished chain.\u001b[0m\n" 220 | ] 221 | }, 222 | { 223 | "data": { 224 | "text/plain": [ 225 | "{'query': 'What is the average age for each party?',\n", 226 | " 'result': \"[('DEM', 57.10139401882521), ('GRE', 20.0), ('LIB', 38.924418604651166), ('REP', 55.279202607493254), ('UNA', 47.284671532846716)]\"}" 227 | ] 228 | }, 229 | "execution_count": 6, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "db_chain(\"What is the average age for each party?\")" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 7, 241 | "id": "d4122699", 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "\n", 249 | "\n", 250 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 251 | "What is the most common first name of DEM voters who voted in 2020? \n", 252 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT first_name, COUNT(*) AS count FROM ncvoter29 JOIN ncvhis29 ON ncvoter29.voter_reg_num = ncvhis29.voter_reg_num WHERE ncvhis29.election_lbl = '11/03/2020' AND ncvoter29.party_cd = 'DEM' GROUP BY first_name ORDER BY count DESC LIMIT 5;\u001b[0m\n", 253 | "SQLResult: \u001b[33;1m\u001b[1;3m[('JAMES', 268), ('MARY', 221), ('ROBERT', 215), ('MICHAEL', 174), ('WILLIAM', 167)]\u001b[0m\n", 254 | "\u001b[1m> Finished chain.\u001b[0m\n" 255 | ] 256 | }, 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "{'query': 'What is the most common first name of DEM voters who voted in 2020?',\n", 261 | " 'result': \"[('JAMES', 268), ('MARY', 221), ('ROBERT', 215), ('MICHAEL', 174), ('WILLIAM', 167)]\"}" 262 | ] 263 | }, 264 | "execution_count": 7, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "db_chain(\"What is the most common first name of DEM voters who voted in 2020?\")" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 8, 276 | "id": "cc77811d", 277 | "metadata": {}, 278 | "outputs": [ 279 | { 280 | "name": "stdout", 281 | "output_type": "stream", 282 | "text": [ 283 | "\n", 284 | "\n", 285 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 286 | "The unique id is ncid. How many voters have no voting history? \n", 287 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT COUNT(DISTINCT ncvoter29.ncid) FROM ncvoter29 LEFT JOIN ncvhis29 ON ncvoter29.ncid = ncvhis29.ncid WHERE ncvhis29.ncid IS NULL;\u001b[0m\n", 288 | "SQLResult: \u001b[33;1m\u001b[1;3m[(18215,)]\u001b[0m\n", 289 | "\u001b[1m> Finished chain.\u001b[0m\n" 290 | ] 291 | }, 292 | { 293 | "data": { 294 | "text/plain": [ 295 | "{'query': 'The unique id is ncid. How many voters have no voting history?',\n", 296 | " 'result': '[(18215,)]'}" 297 | ] 298 | }, 299 | "execution_count": 8, 300 | "metadata": {}, 301 | "output_type": "execute_result" 302 | } 303 | ], 304 | "source": [ 305 | "db_chain(\"The unique id is ncid. How many voters have no voting history?\")" 306 | ] 307 | }, 308 | { 309 | "attachments": {}, 310 | "cell_type": "markdown", 311 | "id": "8e841de0", 312 | "metadata": {}, 313 | "source": [ 314 | "## Using the LLM to Visualize the Data\n", 315 | "\n", 316 | "This is really cool! As you can see, we didn't have to actually tell the LLM much about the database to start querying it with natural language. It was able to infer the schema largely on its own!\n", 317 | "\n", 318 | "But the power here doesn't stop just at writing and executing SQL. We can actually _chain_ more utility. Let's get the LLM to use Python to do some data viz. To do that, we'll leverage the [Python REPL](https://python.langchain.com/en/latest/modules/agents/tools/examples/python.html) tool, and define a [custom agent](https://python.langchain.com/en/latest/modules/agents/agents/custom_agent.html) that uses our SQLite database as a tool.\n", 319 | "\n", 320 | "⚠️ **NOTE:** this executes LLM-generated Python code! This can be bad if the LLM-generated Python code is harmful. You may wish to execute this in a Virtual Machine or other sandboxed environment." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 5, 326 | "id": "55cec215", 327 | "metadata": {}, 328 | "outputs": [], 329 | "source": [ 330 | "# Add python_repl to our list of tools\n", 331 | "tools = load_tools([\"python_repl\"])\n", 332 | "\n", 333 | "# Define our voter_data tool\n", 334 | "\n", 335 | "# Set a description to help the LLM know when and how to use it.\n", 336 | "description = (\n", 337 | " \"Useful for when you need to answer questions about voters. \"\n", 338 | " \"You must not input SQL. Use this more than the Python tool if the question \"\n", 339 | " \"is about voter data, like 'how many DEM voters are there?' or 'count the number of precincts'\"\n", 340 | ")\n", 341 | "\n", 342 | "voter_data = Tool(\n", 343 | " name=\"Data\", # We'll just call it 'Data'\n", 344 | " func=db_chain.run,\n", 345 | " description=description,\n", 346 | ")\n", 347 | "\n", 348 | "tools.append(voter_data)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "id": "549046a4", 354 | "metadata": {}, 355 | "source": [ 356 | "We create the custom agent by working with the `ZeroShotAgent`. Most of the modifications come from updating the prompt and specifying the tools." 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 11, 362 | "id": "633fa84e", 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "# Standard prefix\n", 367 | "prefix = \"Fulfill the following request as best you can. You have access to the following tools:\"\n", 368 | "\n", 369 | "# Remind the agent of the Data tool, and what types of input it expects\n", 370 | "suffix = (\n", 371 | " \"Begin! When looking for data, do not write a SQL query. \"\n", 372 | " \"Pass the relevant portion of the request directly to the Data tool in its entirety.\"\n", 373 | " \"\\n\\n\"\n", 374 | " \"Request: {input}\\n\"\n", 375 | " \"{agent_scratchpad}\"\n", 376 | ")\n", 377 | "\n", 378 | "# The agent's prompt is built with the list of tools, prefix, suffix, and input variables\n", 379 | "prompt = ZeroShotAgent.create_prompt(\n", 380 | " tools, prefix=prefix, suffix=suffix, input_variables=[\"input\", \"agent_scratchpad\"]\n", 381 | ")\n", 382 | "\n", 383 | "# Set up the llm_chain\n", 384 | "llm_chain = LLMChain(llm=llm, prompt=prompt)\n", 385 | "\n", 386 | "# Specify the tools the agent may use\n", 387 | "tool_names = [tool.name for tool in tools]\n", 388 | "agent = ZeroShotAgent(llm_chain=llm_chain, allowed_tools=tool_names)\n", 389 | "\n", 390 | "# Create the AgentExecutor\n", 391 | "agent_executor = AgentExecutor.from_agent_and_tools(\n", 392 | " agent=agent, tools=tools, verbose=True\n", 393 | ")" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "id": "556525b3", 399 | "metadata": {}, 400 | "source": [ 401 | "And we can look at the prompt:" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": 12, 407 | "id": "d09702d3", 408 | "metadata": {}, 409 | "outputs": [ 410 | { 411 | "name": "stdout", 412 | "output_type": "stream", 413 | "text": [ 414 | "Fulfill the following request as best you can. You have access to the following tools:\n", 415 | "\n", 416 | "Python REPL: A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.\n", 417 | "Data: Useful for when you need to answer questions about voters. You must not input SQL. Use this more than the Python tool if the question is about voter data, like 'how many DEM voters are there?' or 'count the number of precints'\n", 418 | "\n", 419 | "Use the following format:\n", 420 | "\n", 421 | "Question: the input question you must answer\n", 422 | "Thought: you should always think about what to do\n", 423 | "Action: the action to take, should be one of [Python REPL, Data]\n", 424 | "Action Input: the input to the action\n", 425 | "Observation: the result of the action\n", 426 | "... (this Thought/Action/Action Input/Observation can repeat N times)\n", 427 | "Thought: I now know the final answer\n", 428 | "Final Answer: the final answer to the original input question\n", 429 | "\n", 430 | "Begin! When looking for data, do not write a SQL query. Pass the relevant portion of the request directly to the Data tool in its entirety.\n", 431 | "\n", 432 | "Request: {input}\n", 433 | "{agent_scratchpad}\n" 434 | ] 435 | } 436 | ], 437 | "source": [ 438 | "print(prompt.template)" 439 | ] 440 | }, 441 | { 442 | "attachments": {}, 443 | "cell_type": "markdown", 444 | "id": "f85138b4", 445 | "metadata": {}, 446 | "source": [ 447 | "Now let's give it a go! Below you can see the choice the agent makes based on its inputs and observations. Here it decides to query the database, pass the results to python, and utilize `matplotlib` to graph a chart." 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 15, 453 | "id": "73401e11", 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "name": "stdout", 458 | "output_type": "stream", 459 | "text": [ 460 | "\n", 461 | "\n", 462 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", 463 | "\u001b[32;1m\u001b[1;3m\n", 464 | "Thought: I need to find the most common first names of DEM voters who voted in 2020\n", 465 | "Action: Data\n", 466 | "Action Input: Find the most common first names of DEM voters who voted in 2020\u001b[0m\n", 467 | "\n", 468 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 469 | "Find the most common first names of DEM voters who voted in 2020 \n", 470 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT first_name, COUNT(*) AS count FROM ncvoter29 JOIN ncvhis29 ON ncvoter29.voter_reg_num = ncvhis29.voter_reg_num WHERE ncvhis29.election_lbl = '11/03/2020' AND ncvoter29.party_cd = 'DEM' GROUP BY first_name ORDER BY count DESC LIMIT 5;\u001b[0m\n", 471 | "SQLResult: \u001b[33;1m\u001b[1;3m[('JAMES', 268), ('MARY', 221), ('ROBERT', 215), ('MICHAEL', 174), ('WILLIAM', 167)]\u001b[0m\n", 472 | "\u001b[1m> Finished chain.\u001b[0m\n", 473 | "\n", 474 | "Observation: \u001b[33;1m\u001b[1;3m[('JAMES', 268), ('MARY', 221), ('ROBERT', 215), ('MICHAEL', 174), ('WILLIAM', 167)]\u001b[0m\n", 475 | "Thought:\u001b[32;1m\u001b[1;3m I need to visualize this data\n", 476 | "Action: Python REPL\n", 477 | "Action Input: import matplotlib.pyplot as plt; names = [('JAMES', 268), ('MARY', 221), ('ROBERT', 215), ('MICHAEL', 174), ('WILLIAM', 167)]; x, y = zip(*names); plt.bar(x, y)\u001b[0m\n", 478 | "Observation: \u001b[36;1m\u001b[1;3m\u001b[0m\n", 479 | "Thought:\u001b[32;1m\u001b[1;3m I now know the final answer\n", 480 | "Final Answer: A bar graph visualizing the most common first names of DEM voters who voted in 2020.\u001b[0m\n", 481 | "\n", 482 | "\u001b[1m> Finished chain.\u001b[0m\n" 483 | ] 484 | }, 485 | { 486 | "data": { 487 | "text/plain": [ 488 | "'A bar graph visualizing the most common first names of DEM voters who voted in 2020.'" 489 | ] 490 | }, 491 | "execution_count": 15, 492 | "metadata": {}, 493 | "output_type": "execute_result" 494 | }, 495 | { 496 | "data": { 497 | "image/png": "", 498 | "text/plain": [ 499 | "
" 500 | ] 501 | }, 502 | "metadata": {}, 503 | "output_type": "display_data" 504 | } 505 | ], 506 | "source": [ 507 | "request = \"Show a bar graph visualizing the answer to the following question:\" \\\n", 508 | " \"What are the most common first names of DEM voters who voted in 2020?\"\n", 509 | "\n", 510 | "agent_executor.run(request)" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "id": "52ee8cd7", 516 | "metadata": {}, 517 | "source": [ 518 | "Very cool! As we see, it took the problem statement, crafted the SQL query, realized it could use the results within Python, and then wrote and executed the code to draw the graph!\n", 519 | "\n", 520 | "Let's do one more:" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 16, 526 | "id": "333228d6", 527 | "metadata": {}, 528 | "outputs": [ 529 | { 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "\n", 534 | "\n", 535 | "\u001b[1m> Entering new AgentExecutor chain...\u001b[0m\n", 536 | "\u001b[32;1m\u001b[1;3m\n", 537 | "Thought: I need to get the data first\n", 538 | "Action: Data\n", 539 | "Action Input: What are the most common voting methods in 2020\u001b[0m\n", 540 | "\n", 541 | "\u001b[1m> Entering new SQLDatabaseChain chain...\u001b[0m\n", 542 | "What are the most common voting methods in 2020 \n", 543 | "SQLQuery:\u001b[32;1m\u001b[1;3m SELECT voting_method, COUNT(*) AS count FROM ncvhis29 WHERE election_desc = '11/03/2020 GENERAL' GROUP BY voting_method ORDER BY count DESC LIMIT 5;\u001b[0m\n", 544 | "SQLResult: \u001b[33;1m\u001b[1;3m[('ABSENTEE ONESTOP', 56553), ('IN-PERSON', 17872), ('ABSENTEE BY MAIL', 12568), ('ABSENTEE CURBSIDE', 2101), ('TRANSFER', 413)]\u001b[0m\n", 545 | "\u001b[1m> Finished chain.\u001b[0m\n", 546 | "\n", 547 | "Observation: \u001b[33;1m\u001b[1;3m[('ABSENTEE ONESTOP', 56553), ('IN-PERSON', 17872), ('ABSENTEE BY MAIL', 12568), ('ABSENTEE CURBSIDE', 2101), ('TRANSFER', 413)]\u001b[0m\n", 548 | "Thought:" 549 | ] 550 | }, 551 | { 552 | "data": { 553 | "image/png": "", 554 | "text/plain": [ 555 | "
" 556 | ] 557 | }, 558 | "metadata": {}, 559 | "output_type": "display_data" 560 | }, 561 | { 562 | "name": "stdout", 563 | "output_type": "stream", 564 | "text": [ 565 | "\u001b[32;1m\u001b[1;3m I need to plot this data\n", 566 | "Action: Python REPL\n", 567 | "Action Input: import matplotlib.pyplot as plt; labels = [x[0] for x in [('ABSENTEE ONESTOP', 56553), ('IN-PERSON', 17872), ('ABSENTEE BY MAIL', 12568), ('ABSENTEE CURBSIDE', 2101), ('TRANSFER', 413)]]; sizes = [x[1] for x in [('ABSENTEE ONESTOP', 56553), ('IN-PERSON', 17872), ('ABSENTEE BY MAIL', 12568), ('ABSENTEE CURBSIDE', 2101), ('TRANSFER', 413)]]; plt.pie(sizes, labels=labels); plt.show()\u001b[0m\n", 568 | "Observation: \u001b[36;1m\u001b[1;3m\u001b[0m\n", 569 | "Thought:\u001b[32;1m\u001b[1;3m I now know the final answer\n", 570 | "Final Answer: A pie chart showing the most common voting methods in 2020.\u001b[0m\n", 571 | "\n", 572 | "\u001b[1m> Finished chain.\u001b[0m\n" 573 | ] 574 | }, 575 | { 576 | "data": { 577 | "text/plain": [ 578 | "'A pie chart showing the most common voting methods in 2020.'" 579 | ] 580 | }, 581 | "execution_count": 16, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "agent_executor.run(\"Plot a pie chart of the most common voting methods in 2020\")" 588 | ] 589 | }, 590 | { 591 | "attachments": {}, 592 | "cell_type": "markdown", 593 | "id": "2a9c17a1", 594 | "metadata": {}, 595 | "source": [ 596 | "## Some parting thoughts\n", 597 | "\n", 598 | "That OpenAI's LLM is robust enough to craft valid SQL with minimal guidance is astounding. That it can then pipe the query results to a Python environment and build out data viz is almost impossible to believe until you see it.\n", 599 | "\n", 600 | "That said, the examples in this Notebook are the product of many hours of trial an error. I'm sure I'll be more efficient next time around. But some of the queries just did not work. Similarly, the LLM sometimes got stuck debugging its own Python code (also astounding that it can do that!) before I decided to move on to other ideas. **This is a tool that should not be left unsupervised.** It may produce bad code. It may hallucinate. The efficiency gains are apparent, but many things just don't work yet. Similarly, the LangChain library is very much still in its infancy, and is rough around the endges.\n", 601 | "\n", 602 | "**Most importantly:** We must continue to study and remain vigilant against the unknown biases built into the systems. Neural Networks and LLMs are already black boxes that are hard to assess. Singular platforms acting as the gateway to them only obscurs that more. And then adding all that into auto-generated code only obfuscates the risks further. This Notebook uses real voter data because its accessible and of personal interest to me. Any organizaiton hoping to use these tools for civic impact mush devote significant resources to safety, risk prevention and mitiation, and set hard ethical red lines." 603 | ] 604 | } 605 | ], 606 | "metadata": { 607 | "kernelspec": { 608 | "display_name": "Python 3 (ipykernel)", 609 | "language": "python", 610 | "name": "python3" 611 | }, 612 | "language_info": { 613 | "codemirror_mode": { 614 | "name": "ipython", 615 | "version": 3 616 | }, 617 | "file_extension": ".py", 618 | "mimetype": "text/x-python", 619 | "name": "python", 620 | "nbconvert_exporter": "python", 621 | "pygments_lexer": "ipython3", 622 | "version": "3.10.0" 623 | } 624 | }, 625 | "nbformat": 4, 626 | "nbformat_minor": 5 627 | } 628 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # LangChain Data Demo 2 | 3 | This is a [Jupyter Notebook](https://github.com/hodgesmr/LangChain-Data-Demo/blob/main/LangChain%20Data%20Demo.ipynb) that demonstrates a variety of data engineering and anaysis tasks one can tackle with [LangChain](https://python.langchain.com/en/latest/index.html). It walks through using the LLM (via OpenAI) to write and execute SQL queries, and then pass the results of those queries to Python for data visualization. It uses public data voter files. 4 | 5 | This Notebook was written with LangChain version `0.0.128` using the `text-davinci-003` model from OpenAI. 6 | 7 | ![The LangChain agent executing SQL and Python to generate data visualization against the voter file](https://raw.githubusercontent.com/hodgesmr/LangChain-Data-Demo/main/agent_steps.png) 8 | 9 | ⚠️ **This Notebook generates and executes Python code written by an LLM. This has the potential to run unpredictable, buggy, or even harmful executions. Take caution to run this Notebook only in a sandboxed or appropriately controlled environment.** ⚠️ 10 | 11 | ## Setup 12 | 13 | Before you can run the [Notebook](https://github.com/hodgesmr/LangChain-Data-Demo/blob/main/LangChain%20Data%20Demo.ipynb), you need to copy [.env.example](./.env.example) to [.env](./.env) (which is ignored by git) and fill in the `OPENAI_API_KEY` environment variable. 14 | 15 | Also, be sure to install the requirements: 16 | 17 | ```sh 18 | pip install -r requirements.txt 19 | ``` 20 | 21 | I wrote and tested this Notebook in Python 3.10. 22 | 23 | ## License 24 | 25 | All code is provided under the [BSD 3-Clause license](https://github.com/hodgesmr/LangChain-Data-Demo/blob/main/LICENSE). 26 | 27 | ## North Carolina Voter Data 28 | 29 | The data used by the code in this repository originated from the North Carolina State Board of Elections, with the following [license](https://s3.amazonaws.com/dl.ncsbe.gov/data/ReadMe_PUBLIC_DATA.txt): 30 | 31 | ``` 32 | /* ******************************************************************************* 33 | * name: ReadMe_PUBLIC_DATA.txt 34 | * purpose: Notification to the public, media, and interested parties. 35 | * The data and documents contained within this publicly accessible site 36 | * and all subforders herein provided by the NC State Board of Elections 37 | * are considered public information per NC General Statutes. 38 | * URL: https://dl.ncsbe.gov/list.html 39 | * updated: 09/16/2020 40 | ******************************************************************************* */ 41 | 42 | Citations: 43 | 44 | § 132-1. Public Records. 45 | https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/BySection/Chapter_132/GS_132-1.pdf 46 | 47 | § 163-82.10. Official record of voter registration. 48 | https://www.ncleg.gov/EnactedLegislation/Statutes/PDF/BySection/Chapter_163/GS_163-82.10.pdf 49 | ``` 50 | 51 | ## A Matt Hodges project 52 | 53 | This project is maintained by [@MattHodges](https://mastodon.social/@MattHodges). 54 | 55 | _Please use it for good, not evil._ 56 | -------------------------------------------------------------------------------- /agent_steps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hodgesmr/LangChain-Data-Demo/d910e1fde45e1abe5f86800aa8a17274a6822e89/agent_steps.png -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hodgesmr/LangChain-Data-Demo/d910e1fde45e1abe5f86800aa8a17274a6822e89/data/.gitkeep -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | langchain==0.0.128 2 | matplotlib==3.7.* 3 | notebook==6.5.* 4 | openai==0.27.2 5 | pandas==1.5.* 6 | python-dotenv==1.0.* 7 | --------------------------------------------------------------------------------