├── Makefile ├── README.md ├── advanced-sql.md ├── baked-data.md ├── codespaces-jupyter.gif ├── conf.py ├── datasette-lite.md ├── datasette-publish.md ├── datasette.md ├── further-reading.md ├── index.md ├── requirements.txt ├── setup.md ├── sqlite-intro.md └── sqlite-utils.md /Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | SPHINXPROJ = sqlite-utils 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 21 | 22 | livehtml: 23 | sphinx-autobuild -b html "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(0) 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data analysis with SQLite and Python 2 | 3 | A three hour tutorial presented at PyCon 2023. 4 | 5 | Now available as **[an official Datasette tutorial](https://datasette.io/tutorials/data-analysis)**. 6 | 7 | 2hr45m YouTube video of this tutorial: https://www.youtube.com/watch?v=5TdIxxBPUSI 8 | 9 | See also [Data analysis with SQLite and Python for PyCon 2023](https://simonwillison.net/2023/Apr/20/pycon-2023/) on my blog. 10 | -------------------------------------------------------------------------------- /advanced-sql.md: -------------------------------------------------------------------------------- 1 | # Advanced SQL 2 | 3 | ## Aggregations 4 | 5 | The simplest form of aggregation is the one Datasette does to implement its own faceting feature. 6 | 7 | ```sql 8 | select 9 | party, 10 | count(*) 11 | from 12 | executive_terms 13 | where 14 | type = 'prez' 15 | group by 16 | party 17 | ``` 18 | [Try that query here](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++party%2C%0D%0A++count%28*%29%0D%0Afrom%0D%0A++executive_terms%0D%0Awhere%0D%0A++type+%3D+%27prez%27%0D%0Agroup+by%0D%0A++party). 19 | 20 | The `group by` creates groups of rows, then the aggregation functions calculate a value across that entire group. 21 | 22 | The most common aggregation functions are: 23 | 24 | - `count(*)` - count the number of rows in each group 25 | - `max(column)` - maximum value for a column 26 | - `min(column)` - minimum value for a column 27 | - `sum(column)` - sum up the values in the column 28 | 29 | Here's an example of `sum()` and `count()` from [What's in the RedPajama-Data-1T LLM training set](https://simonwillison.net/2023/Apr/17/redpajama-data/): 30 | ```sql 31 | select 32 | top_folders, 33 | sum(size_gb) as total_gb, 34 | count(*) as num_files 35 | from raw 36 | group by top_folders 37 | order by sum(size_gb) desc 38 | ``` 39 | [Run that in Datasette Lite](https://lite.datasette.io/?install=datasette-copyable&json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data?sql=select%0A++top_folders%2C%0A++cast+%28sum%28size_gb%29+as+integer%29+as+total_gb%2C%0A++count%28*%29+as+num_files%0Afrom+raw%0Agroup+by+top_folders%0Aorder+by+sum%28size_gb%29+desc). 40 | 41 | Change the `total_gb` line to this to round it to the nearest integer: 42 | ```sql 43 | cast (sum(size_gb) as integer) as total_gb, 44 | ``` 45 | ## Subqueries 46 | 47 | SQLite has excellent support for subqueries. You can use them in `where X in` clauses: 48 | 49 | ```sql 50 | select html_url from releases where repo in ( 51 | select id from repos where full_name in ( 52 | select repo from plugin_repos 53 | ) 54 | ) 55 | order by created_at desc 56 | ``` 57 | [Run that on datasette.io](https://datasette.io/content?sql=select+html_url+from+releases+where+repo+in+%28%0D%0A++select+id+from+repos+where+full_name+in+%28%0D%0A++++select+repo+from+plugin_repos%0D%0A++%29%0D%0A%29%0D%0Aorder+by+created_at+desc). Sometimes I find these to be more readable than joins! 58 | 59 | You can also use them directly in `select` clauses: 60 | 61 | ```sql 62 | select 63 | full_name, 64 | ( 65 | select 66 | html_url 67 | from 68 | releases 69 | where 70 | releases.repo = repos.id 71 | order by 72 | created_at desc 73 | limit 74 | 1 75 | ) as latest_release 76 | from 77 | repos 78 | ``` 79 | [Run that here](https://datasette.io/content?sql=select+full_name%2C+%28select+html_url+from+releases+where+releases.repo+%3D+repos.id+order+by+created_at+desc+limit+1%29+as+latest_release+from+repos). 80 | 81 | ## CTEs 82 | 83 | CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query. 84 | 85 | ```sql 86 | with presidents as ( 87 | select 88 | executives.name 89 | from 90 | executive_terms 91 | join executives 92 | on executive_terms.executive_id = executives.id 93 | where 94 | executive_terms.type = 'prez' 95 | ), 96 | vice_presidents as ( 97 | select 98 | executives.name 99 | from 100 | executive_terms 101 | join executives 102 | on executive_terms.executive_id = executives.id 103 | where 104 | executive_terms.type = 'viceprez' 105 | ) 106 | select 107 | distinct name 108 | from 109 | presidents 110 | where name in vice_presidents 111 | ``` 112 | [Try this CTE query here](https://congress-legislators.datasettes.com/legislators?sql=with+presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27prez%27%0D%0A%29%2C%0D%0Avice_presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27viceprez%27%0D%0A%29%0D%0Aselect%0D%0A++distinct+name%0D%0Afrom%0D%0A++presidents%0D%0Awhere%0D%0A++name+in+vice_presidents). 113 | 114 | ## JSON 115 | 116 | SQLite has excellent JSON functionality built in. Store JSON in a `text` column and you can query it using `json_extract()` - you can also build JSON values in `select` queries. 117 | 118 | [Returning related rows in a single SQL query using JSON](https://til.simonwillison.net/sqlite/related-rows-single-query) shows some advanced tricks you can do with this. 119 | 120 | ```sql 121 | select 122 | legislators.id, 123 | legislators.name, 124 | json_group_array(json_object( 125 | 'type', legislator_terms.type, 126 | 'state', legislator_terms.state, 127 | 'start', legislator_terms.start, 128 | 'end', legislator_terms.end, 129 | 'party', legislator_terms.party 130 | )) as terms, 131 | count(*) as num_terms 132 | from 133 | legislators join legislator_terms on legislator_terms.legislator_id = legislators.id 134 | group by legislators.id 135 | order by 136 | id 137 | limit 138 | 10 139 | ``` 140 | [Run that query](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++legislators.id%2C%0D%0A++legislators.name%2C%0D%0A++json_group_array(json_object(%0D%0A++++%27type%27%2C+legislator_terms.type%2C%0D%0A++++%27state%27%2C+legislator_terms.state%2C%0D%0A++++%27start%27%2C+legislator_terms.start%2C%0D%0A++++%27end%27%2C+legislator_terms.end%2C%0D%0A++++%27party%27%2C+legislator_terms.party%0D%0A+++))+as+terms%2C%0D%0A+++count(*)+as+num_terms%0D%0Afrom%0D%0A++legislators+join+legislator_terms+on+legislator_terms.legislator_id+%3D+legislators.id%0D%0A++group+by+legislators.id%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++10). 141 | 142 | Paul Ford [said about SQLite's JSON support](https://simonwillison.net/2018/Jan/29/paul-ford/): 143 | 144 | > The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.” 145 | 146 | ## Window functions 147 | 148 | I wanted to run a query that would return the following: 149 | 150 | - The repository name 151 | - The date of the most recent release from that repository (the releases table is a many-to-one against repos) 152 | - The total number of releases 153 | - **The three most recent releases** (as a JSON array of objects) 154 | 155 | 156 | ```sql 157 | with cte as ( 158 | select 159 | repos.full_name, 160 | releases.created_at, 161 | releases.id as rel_id, 162 | releases.name as rel_name, 163 | releases.created_at as rel_created_at, 164 | rank() over (partition by repos.id order by releases.created_at desc) as rel_rank 165 | from repos 166 | left join releases on releases.repo = repos.id 167 | ) 168 | select 169 | full_name, 170 | max(created_at) as max_created_at, 171 | count(rel_id) as releases_count, 172 | json_group_array( 173 | json_object( 174 | 'id', rel_id, 175 | 'name', rel_name, 176 | 'created_at', rel_created_at 177 | ) 178 | ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases 179 | from cte 180 | group by full_name 181 | order by releases_count desc 182 | ``` 183 | [Run that query here](https://datasette.io/content?sql=with+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max%28created_at%29+as+max_created_at%2C%0D%0A++count%28rel_id%29+as+releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+filter+%28where+rel_id+is+not+null+and+rel_rank+%3C%3D+3%29+as+recent_releases%0D%0Afrom+cte%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc). 184 | 185 | [Running this smaller query](https://datasette.io/content?sql=with+cte+as+(%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank()+over+(partition+by+repos.id+order+by+releases.created_at+desc)+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A)%0D%0Aselect+*+from+cte) helps show what's going on with that `rel_rank` column: 186 | 187 | ```sql 188 | with cte as ( 189 | select 190 | repos.full_name, 191 | releases.created_at, 192 | releases.id as rel_id, 193 | releases.name as rel_name, 194 | releases.created_at as rel_created_at, 195 | rank() over (partition by repos.id order by releases.created_at desc) as rel_rank 196 | from repos 197 | left join releases on releases.repo = repos.id 198 | ) 199 | select * from cte 200 | ``` 201 | -------------------------------------------------------------------------------- /baked-data.md: -------------------------------------------------------------------------------- 1 | # Baked Data 2 | 3 | The [The Baked Data architectural pattern](https://simonwillison.net/2021/Jul/28/baked-data/) describes this approach, which is key to taking full advantage of SQLite and Datasette. 4 | 5 | I like to build my databases in GitHub Actions. 6 | 7 | ## Niche Museums and TILs 8 | 9 | - is published from the repository - one big YAML file for the content. 10 | - is published - separate Markdown files for each item. 11 | 12 | Both of these sites have Atom feeds that are defined using a Datasette [canned query](https://docs.datasette.io/en/stable/sql_queries.html#canned-queries), in conjunction with the [datasette-atom](https://datasette.io/plugins/datasette-atom) plugin. 13 | 14 | - 15 | - 16 | 17 | ## Generating a newsletter with an Observable notebook 18 | 19 | I wrote about this in [Semi-automating a Substack newsletter with an Observable notebook](https://simonwillison.net/2023/Apr/4/substack-observable/): 20 | 21 | - is a Datasette/SQLite copy of my Django blog, created using [db-to-sqlite](https://datasette.io/tools/db-to-sqlite) by my GitHub repository. 22 | - is my Observable notebook that assembles a newsletter from that data. 23 | - is the Substack newsletter that I copy that content into. 24 | -------------------------------------------------------------------------------- /codespaces-jupyter.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simonw/sqlite-tutorial-pycon-2023/2037cca22c6e3a3d87e35796d73e912eef9d12b0/codespaces-jupyter.gif -------------------------------------------------------------------------------- /conf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | 4 | from subprocess import PIPE, Popen 5 | 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | # If extensions (or modules to document with autodoc) are in another directory, 16 | # add these directories to sys.path here. If the directory is relative to the 17 | # documentation root, use os.path.abspath to make it absolute, like shown here. 18 | # 19 | # import os 20 | # import sys 21 | # sys.path.insert(0, os.path.abspath('.')) 22 | 23 | 24 | # -- General configuration ------------------------------------------------ 25 | 26 | # If your documentation needs a minimal Sphinx version, state it here. 27 | # 28 | # needs_sphinx = '1.0' 29 | 30 | # Add any Sphinx extension module names here, as strings. They can be 31 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 32 | # ones. 33 | extensions = ["myst_parser"] 34 | 35 | # Add any paths that contain templates here, relative to this directory. 36 | templates_path = ["_templates"] 37 | 38 | # The suffix(es) of source filenames. 39 | # You can specify multiple suffix as a list of string: 40 | # 41 | # source_suffix = ['.rst', '.md'] 42 | source_suffix = ".rst" 43 | 44 | # The master toctree document. 45 | master_doc = "index" 46 | 47 | # General information about the project. 48 | project = "Data analysis with SQLite and Python, PyCon 2023" 49 | copyright = "2023, Simon Willison" 50 | author = "Simon Willison" 51 | 52 | # The version info for the project you're documenting, acts as replacement for 53 | # |version| and |release|, also used in various other places throughout the 54 | # built documents. 55 | # 56 | # The short X.Y version. 57 | pipe = Popen("git describe --tags --always", stdout=PIPE, shell=True) 58 | git_version = pipe.stdout.read().decode("utf8") 59 | 60 | if git_version: 61 | version = git_version.rsplit("-", 1)[0] 62 | release = git_version 63 | else: 64 | version = "" 65 | release = "" 66 | 67 | # The language for content autogenerated by Sphinx. Refer to documentation 68 | # for a list of supported languages. 69 | # 70 | # This is also used if you do content translation via gettext catalogs. 71 | # Usually you set "language" from the command line for these cases. 72 | language = None 73 | 74 | # List of patterns, relative to source directory, that match files and 75 | # directories to ignore when looking for source files. 76 | # This patterns also effect to html_static_path and html_extra_path 77 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] 78 | 79 | # The name of the Pygments (syntax highlighting) style to use. 80 | pygments_style = "sphinx" 81 | 82 | # If true, `todo` and `todoList` produce output, else they produce nothing. 83 | todo_include_todos = False 84 | 85 | 86 | # -- Options for HTML output ---------------------------------------------- 87 | 88 | # The theme to use for HTML and HTML Help pages. See the documentation for 89 | # a list of builtin themes. 90 | # 91 | html_theme = "sphinx_rtd_theme" 92 | 93 | # Theme options are theme-specific and customize the look and feel of a theme 94 | # further. For a list of options available for each theme, see the 95 | # documentation. 96 | # 97 | # html_theme_options = {} 98 | 99 | # Add any paths that contain custom static files (such as style sheets) here, 100 | # relative to this directory. They are copied after the builtin static files, 101 | # so a file named "default.css" will overwrite the builtin "default.css". 102 | html_static_path = ["_static"] 103 | 104 | # Custom sidebar templates, must be a dictionary that maps document names 105 | # to template names. 106 | # 107 | # This is required for the alabaster theme 108 | # refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars 109 | html_sidebars = { 110 | "**": [ 111 | "relations.html", # needs 'show_related': True theme option to display 112 | "searchbox.html", 113 | ] 114 | } 115 | 116 | 117 | # -- Options for HTMLHelp output ------------------------------------------ 118 | 119 | # Output file base name for HTML help builder. 120 | htmlhelp_basename = "sqlite-tutorial-pycon-2023-doc" 121 | 122 | 123 | # -- Options for LaTeX output --------------------------------------------- 124 | 125 | latex_elements = { 126 | # The paper size ('letterpaper' or 'a4paper'). 127 | # 128 | # 'papersize': 'letterpaper', 129 | # The font size ('10pt', '11pt' or '12pt'). 130 | # 131 | # 'pointsize': '10pt', 132 | # Additional stuff for the LaTeX preamble. 133 | # 134 | # 'preamble': '', 135 | # Latex figure (float) alignment 136 | # 137 | # 'figure_align': 'htbp', 138 | } 139 | 140 | # Grouping the document tree into LaTeX files. List of tuples 141 | # (source start file, target name, title, 142 | # author, documentclass [howto, manual, or own class]). 143 | latex_documents = [ 144 | ( 145 | master_doc, 146 | "sqlite-tutorial-pycon-2023.tex", 147 | "SQLite Tutorial PyCon 2023", 148 | "Simon Willison", 149 | "manual", 150 | ) 151 | ] 152 | 153 | 154 | # -- Options for manual page output --------------------------------------- 155 | 156 | # One entry per manual page. List of tuples 157 | # (source start file, name, description, authors, manual section). 158 | man_pages = [ 159 | ( 160 | master_doc, 161 | "sqlite-tutorial-pycon-2023", 162 | "SQLite Tutorial PyCon 2023", 163 | [author], 164 | 1, 165 | ) 166 | ] 167 | 168 | 169 | # -- Options for Texinfo output ------------------------------------------- 170 | 171 | # Grouping the document tree into Texinfo files. List of tuples 172 | # (source start file, target name, title, author, 173 | # dir menu entry, description, category) 174 | texinfo_documents = [ 175 | ( 176 | master_doc, 177 | "sqlite-tutorial-pycon-2023", 178 | "SQLite Tutorial PyCon 2023", 179 | author, 180 | "sqlite-tutorial-pycon-2023", 181 | "A tutorial at PyCon 2023", 182 | "Miscellaneous", 183 | ) 184 | ] 185 | -------------------------------------------------------------------------------- /datasette-lite.md: -------------------------------------------------------------------------------- 1 | # Datasette Lite 2 | 3 | It's Datasette... running entirely in your web browser with WebAssembly and Pyodide! 4 | 5 | 6 | 7 | ## Loading SQLite, CSV and JSON data 8 | 9 | - SQLite: - see [Measuring traffic during the Half Moon Bay Pumpkin Festival](https://simonwillison.net/2022/Oct/19/measuring-traffic/) 10 | - CSV: 11 | - JSON: 12 | 13 | ## Installing plugins 14 | 15 | Add `?install=name-of-plugin` to `pip install` that plugin into your browser's environment! 16 | 17 | This only works with a subset of plugins. 18 | 19 | - 20 | 21 | ## Further reading 22 | 23 | - [Datasette Lite: a server-side Python web application running in a browser](https://simonwillison.net/2022/May/4/datasette-lite/) 24 | - [Plugin support for Datasette Lite](https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/) 25 | - [Joining CSV files in your browser using Datasette Lite](https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/) -------------------------------------------------------------------------------- /datasette-publish.md: -------------------------------------------------------------------------------- 1 | # Publishing a database to Vercel 2 | 3 | First, install both Vercel and the `datasette-publish-vercel` plugin. 4 | 5 | has documentation for installing the Vercel CLI. 6 | 7 | On macOS: 8 | 9 | brew install vercel-cli 10 | 11 | Or use one of these: 12 | 13 | npm i -g vercel 14 | 15 | Or: 16 | 17 | pnpm i -g vercel 18 | 19 | Now run this command to login: 20 | 21 | vercel login 22 | 23 | Install the plugin: 24 | 25 | datasette install datasette-publish-vercel 26 | 27 | And deploy the database: 28 | 29 | datasette publish vercel /tmp/peps.db --project python-peps 30 | 31 | ## Other publishing options 32 | 33 | Datasette [can publish](https://docs.datasette.io/en/stable/publish.html) to the following providers: 34 | 35 | - Heroku (`datasette publish heroku`) 36 | - Google Cloud Run (`datasette publish cloudrun`) 37 | - Vercel (with [datasette-publish-vercel](https://datasette.io/plugins/datasette-publish-vercel)) 38 | - Fly (with [datasette-publish-fly](https://datasette.io/plugins/datasette-publish-fly)) 39 | 40 | Further deployment options are described [in the documentation](https://docs.datasette.io/en/stable/deploying.html). 41 | -------------------------------------------------------------------------------- /datasette.md: -------------------------------------------------------------------------------- 1 | # Exploring data with Datasette 2 | 3 | [Datasette](https://datasette.io/) is "an open source multi-tool for exploring and publishing data". 4 | 5 | ## Installing Datasette locally 6 | 7 | ``` 8 | pip install datasette 9 | ``` 10 | Or if you prefer `pipx`: 11 | ``` 12 | pipx install datasette 13 | ``` 14 | Or Homebrew (on macOS): 15 | ``` 16 | brew install datasette 17 | ``` 18 | [More installations options](https://docs.datasette.io/en/stable/installation.html). 19 | 20 | In Codespaces you should also install the `datasette-codespaces` plugin: 21 | 22 | datasette install datasette-codespaces 23 | 24 | ## Try a database: legislators.db 25 | 26 | ``` 27 | wget https://congress-legislators.datasettes.com/legislators.db 28 | ``` 29 | 30 | This is a database of US legislators, presidents and vice presidents. 31 | 32 | You can explore it online at 33 | 34 | Open it in Datasette like this: 35 | 36 | datasette legislators.db 37 | 38 | We'll follow this tutorial to explore Datasette's features: **[Exploring a database with Datasette](https://datasette.io/tutorials/explore)** 39 | 40 | ## Install some plugins 41 | 42 | Datasette has over a hundred plugins: 43 | 44 | You can `pip install` them, but it's better to use `datasette install` as that ensures they will go in the correct virtual environment, especially useful if you used `pipx` or Homebrew to install Datasette itself. 45 | 46 | datasette install datasette-cluster-map 47 | 48 | Now restart Datasette and visit the "offices" table to see the result. 49 | 50 | You can review what plugins are installed with: 51 | 52 | datasette plugins 53 | 54 | Or by visiting the `/-/plugins` page in Datasette. 55 | 56 | Plugins can be uninstalled with: 57 | 58 | datasette uninstall datasette-cluster-map 59 | 60 | ## Learning SQL with Datasette 61 | 62 | The "✎ View and edit SQL" link is a quick way to start learning basic SQL queries. 63 | 64 | We'll follow this tutorial next: **[Learn SQL with Datasette](https://datasette.io/tutorials/learn-sql)** 65 | 66 | 67 | -------------------------------------------------------------------------------- /further-reading.md: -------------------------------------------------------------------------------- 1 | # More demos and further reading 2 | 3 | ## Fun demos 4 | 5 | I post a lot of Datasette projects on my blog. Some of my recent favourites: 6 | 7 | - [Exploring the training data behind Stable Diffusion](https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/) describes how I built [laion-aesthetic.datasette.io](https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images) to explore 12m training images used for Stable Diffusion. 8 | - [Analyzing ScotRail audio announcements with Datasette—from prototype to production](https://simonwillison.net/2022/Aug/21/scotrail/) explains the [scotrail.datasette.io](https://scotrail.datasette.io/) project. 9 | - [I built a ChatGPT plugin to answer questions about data hosted in Datasette](https://simonwillison.net/2023/Mar/24/datasette-chatgpt-plugin/) is one of my earlier explorations at the intersection of LLM AI and Datasette. 10 | 11 | ## SpatiaLite 12 | 13 | - [Building a location to time zone API with SpatiaLite](https://datasette.io/tutorials/spatialite) shows how to use SpatiaLite and Datasette to create an API for looking up time zones based on latitude/longitude points. 14 | - [Drawing shapes on a map to query a SpatiaLite database](https://simonwillison.net/2021/Jan/24/drawing-shapes-spatialite/) introduces the `datasette-leaflet-freedraw` plugin and introduces the [calands.datasettes.com](https://calands.datasettes.com/) demo. 15 | - [Joining CSV and JSON data with an in-memory SQLite database](https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/) -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | # Data analysis with SQLite and Python, PyCon 2023 2 | 3 | On YouTube: [Data analysis with SQLite and Python - 2hr45m](https://www.youtube.com/watch?v=5TdIxxBPUSI) 4 | 5 | ```{toctree} 6 | --- 7 | maxdepth: 3 8 | --- 9 | setup 10 | sqlite-intro 11 | datasette 12 | sqlite-utils 13 | datasette-publish 14 | datasette-lite 15 | advanced-sql 16 | baked-data 17 | further-reading 18 | ``` 19 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx_rtd_theme 2 | sphinx-autobuild 3 | myst-parser 4 | -------------------------------------------------------------------------------- /setup.md: -------------------------------------------------------------------------------- 1 | # What you'll need 2 | 3 | ## python3 and pip 4 | 5 | For the first part of this tutorial, you'll need a Python 3 interpreter with the `sqlite3` standard library module available. 6 | 7 | You can run this on your own computer, or use a browser-based environment. 8 | 9 | will work for a purely browser-based (WebAssembly) environment. 10 | 11 | For the second part, you'll also need the ability to `pip install` Python packages. 12 | 13 | Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use [GitHub Codespaces](https://github.com/github/codespaces-jupyter) or [Google Colab](https://colab.research.google.com/) or [Jupyter Hub](https://jupyter.org/try) or your online notebook solution of choice. 14 | 15 | ## Optional: GitHub Codespaces 16 | 17 | I'll be working through the tutorial using GitHub Codespaces, using 18 | 19 | ![Animated demo of Codespaces Jupyter](codespaces-jupyter.gif) -------------------------------------------------------------------------------- /sqlite-intro.md: -------------------------------------------------------------------------------- 1 | # Introduction to SQLite 2 | 3 | ## Why SQLite? 4 | 5 | - It's the database you already have - `sqlite3` has been built into Python since 2006 6 | - It's screamingly fast, and surprisingly powerful 7 | - Amazing compatibility: bindings for every language, files work on every platform, fantastic track record for backwards compatibility, so it's safe to trust your data to a SQLite file 8 | - Databases are just files on disk. You can create and discard them without any ceremony. 9 | - It handles text (including JSON), integers, floating point numbers, and binary blobs. Which means it can store _anything_. 10 | - It can handle up to 2.8TB of data(!) 11 | - It has some interesting characteristics, for example [Many Small Queries Are Efficient In SQLite](https://www.sqlite.org/np1queryprob.html) 12 | 13 | ## First steps with Python 14 | 15 | Let's download a database to play with - we'll use the database that powers the website: 16 | 17 | ``` 18 | wget https://datasette.io/content.db 19 | ``` 20 | To access it from Python: 21 | ```python 22 | import sqlite3 23 | 24 | db = sqlite3.connect("content.db") 25 | 26 | print(db.execute("select sqlite_version()").fetchall()) 27 | # [('3.39.0',)] 28 | 29 | # Show rows from the plugin_repos table 30 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"): 31 | print(row) 32 | 33 | # Each row is a tuple. We can change that like this: 34 | db.row_factory = sqlite3.Row 35 | 36 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"): 37 | print(row) 38 | 39 | # This outputs 40 | # We can use dict() to turn those into dictionaries instead 41 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"): 42 | print(dict(row)) 43 | 44 | ``` 45 | 46 | ## Creating a table 47 | 48 | Let's create a table: 49 | ```python 50 | db.execute(""" 51 | create table peps ( 52 | id integer primary key, 53 | title text, 54 | author text, 55 | status text, 56 | type text, 57 | created text, 58 | body text 59 | ); 60 | """) 61 | ``` 62 | 63 | ## Inserting some data 64 | 65 | Here's a function I wrote that can parse a PEP: 66 | 67 | ```python 68 | def parse_pep(s): 69 | intro, body = s.split("\n\n", 1) 70 | pep = {} 71 | current_key = None 72 | current_value = None 73 | for line in intro.split("\n"): 74 | # If the line starts with whitespace, it's a continuation of the previous value 75 | if line.startswith(" ") or line.startswith("\t"): 76 | if current_key is not None: 77 | current_value += " " + line.strip() 78 | pep[current_key] = current_value.strip() 79 | else: 80 | # Split the line into key and value 81 | parts = line.split(": ", 1) 82 | if len(parts) == 2: 83 | key, value = parts 84 | # Update the current key and value 85 | current_key = key 86 | current_value = value 87 | # Add the key-value pair to the pep dictionary 88 | pep[current_key] = current_value.strip() 89 | pep["Body"] = body.strip() 90 | return pep 91 | ``` 92 | Let's fetch and parse the Zen of Python: 93 | ```python 94 | import urllib.request 95 | 96 | zen = urllib.request.urlopen( 97 | "https://raw.githubusercontent.com/python/peps/main/peps/pep-0020.rst" 98 | ).read().decode("utf-8") 99 | 100 | pep = parse_pep(zen) 101 | ``` 102 | And insert that into our database: 103 | ```python 104 | db.execute(""" 105 | insert into peps ( 106 | id, title, author, status, type, created, body 107 | ) values ( 108 | ?, ?, ?, ?, ?, ?, ? 109 | ) 110 | """, ( 111 | pep["PEP"], 112 | pep["Title"], 113 | pep["Author"], 114 | pep["Status"], 115 | pep["Type"], 116 | pep["Created"], 117 | pep["Body"], 118 | )) 119 | ``` 120 | Since this is a dictionary already, we can use alternative syntax like this: 121 | ```python 122 | db.execute("delete from peps where id = 20") 123 | db.execute(""" 124 | insert into peps ( 125 | id, title, author, status, type, created, body 126 | ) values ( 127 | :PEP, :Title, :Author, :Status, :Type, :Created, :Body 128 | ) 129 | """, pep) 130 | ``` 131 | To confirm that it was correctly inserted: 132 | ```python 133 | print(db.execute("select * from peps").fetchall()) 134 | ``` 135 | 136 | ## UPDATE and DELETE 137 | 138 | To update a record: 139 | 140 | ```python 141 | with db: 142 | db.execute(""" 143 | update peps set author = ? 144 | where id = ? 145 | """, ["Tim Peters", 20]) 146 | ``` 147 | This will run in a transaction. 148 | 149 | To delete a record: 150 | 151 | ```python 152 | with db: 153 | db.execute(""" 154 | delete from peps 155 | where id = ? 156 | """, [20]) 157 | ``` 158 | Or to delete everything: 159 | ```sql 160 | delete from peps 161 | ``` 162 | ## SQLite column types 163 | 164 | SQLite `create table` is easier than many other databases, because there are less types to worry about. There are four types: 165 | 166 | - `integer` 167 | - `real` 168 | - `text` 169 | - `blob` 170 | 171 | Unlike other databases, length limits are neither required or enforced - so don't worry about `varchar(255)`, just use `text`. 172 | 173 | Tables automatically get an ID column called `rowid` - an incrementing integer. This will be the primary key if you don't specify one. 174 | 175 | If you specify `integer primary key` it will be auto-incrementing and will actually map to that underlying `rowid`. 176 | 177 | You can set `id text primary key` for a text primary key - this will not increment, you will have to set it to a unique value for each row yourself. You could do this with UUIDs generated using `uuid.uuid4()` for example. 178 | 179 | SQLite is loosely typed by default: you can insert any type into any column, even if it conflicts with the column type! 180 | 181 | A lot of people find this very uncomfortable. 182 | 183 | As-of SQLite 3.37.0 (2021-11-27) you can set [strict mode](https://www.sqlite.org/stricttables.html) on a table to opt-out of this loose typing: 184 | 185 | ```sql 186 | create table peps ( 187 | id integer primary key, 188 | title text, 189 | author text, 190 | body text 191 | ) strict 192 | ``` 193 | 194 | ## Transactions 195 | 196 | Here's an example of the impact transactions have on file-based databases: 197 | 198 | ```pycon 199 | >>> import sqlite3 200 | >>> db = sqlite3.connect("/tmp/data.db") 201 | >>> db.execute("create table foo (id integer primary key, name text)") 202 | 203 | ``` 204 | In another window: 205 | ``` 206 | % sqlite3 data.db .dump 207 | PRAGMA foreign_keys=OFF; 208 | BEGIN TRANSACTION; 209 | CREATE TABLE foo (id integer primary key, name text); 210 | COMMIT; 211 | ``` 212 | ```pycon 213 | >>> db.execute('insert into foo (name) values (?)', ['text']) 214 | 215 | ``` 216 | In the other window: 217 | ``` 218 | % sqlite3 data.db .dump 219 | PRAGMA foreign_keys=OFF; 220 | BEGIN TRANSACTION; 221 | CREATE TABLE foo (id integer primary key, name text); 222 | COMMIT; 223 | ``` 224 | ```pycon 225 | >>> db.commit() 226 | ``` 227 | And now: 228 | ``` 229 | % sqlite3 data.db .dump 230 | PRAGMA foreign_keys=OFF; 231 | BEGIN TRANSACTION; 232 | CREATE TABLE foo (id integer primary key, name text); 233 | INSERT INTO foo VALUES(1,'text'); 234 | COMMIT; 235 | ``` 236 | A nicer pattern is to do this: 237 | ```pycon 238 | >>> with db: 239 | ... db.execute('insert into foo (name) values (?)', ['text']) 240 | ``` 241 | The `with db:` wraps everything inside that block in a transaction. 242 | -------------------------------------------------------------------------------- /sqlite-utils.md: -------------------------------------------------------------------------------- 1 | # sqlite-utils 2 | 3 | [sqlite-utils](https://sqlite-utils.datasette.io/) provides "CLI tool and Python utility functions for manipulating SQLite databases". 4 | 5 | You can install it the same way as Datasette: 6 | 7 | pip install sqlite-utils 8 | 9 | Or with `pipx`: 10 | 11 | pipx install sqlite-utils 12 | 13 | Or with Homebrew: 14 | 15 | brew install sqlite-utils 16 | 17 | It works as both a CLI tool and a Python library. 18 | 19 | ## Using the command-line tools to clean data 20 | 21 | We'll follow this tutorial next: **[Cleaning data with sqlite-utils and Datasette](https://datasette.io/tutorials/clean-data)** 22 | 23 | ## Using sqlite-utils as a Python library, to import all the PEPs 24 | 25 | Let's take our PEPs example from earlier and implement it again, but better, using `sqlite-utils`. 26 | 27 | I'll do this in a notebook. 28 | 29 | ``` 30 | !git clone https://github.com/python/peps /tmp/peps 31 | ``` 32 | We now have ALL of the PEPs in `/tmp/peps` 33 | 34 | ```python 35 | import pathlib 36 | 37 | files = list(pathlib.Path("/tmp/peps/peps").glob("pep-*.rst")) 38 | ``` 39 | And parse them with our function from earlier: 40 | ```python 41 | def parse_pep(s): 42 | intro, body = s.split("\n\n", 1) 43 | pep = {} 44 | current_key = None 45 | current_value = None 46 | for line in intro.split("\n"): 47 | # If the line starts with whitespace, it's a continuation of the previous value 48 | if line.startswith(" ") or line.startswith("\t"): 49 | if current_key is not None: 50 | current_value += " " + line.strip() 51 | pep[current_key] = current_value.strip() 52 | else: 53 | # Split the line into key and value 54 | parts = line.split(": ", 1) 55 | if len(parts) == 2: 56 | key, value = parts 57 | # Update the current key and value 58 | current_key = key 59 | current_value = value 60 | # Add the key-value pair to the pep dictionary 61 | pep[current_key] = current_value.strip() 62 | pep["Body"] = body.strip() 63 | return pep 64 | ``` 65 | ```python 66 | peps = [] 67 | for file in files: 68 | peps.append(parse_pep(file.read_text())) 69 | ``` 70 | We now have a list of dictionaries. Let's load them into SQLite: 71 | ``` 72 | %pip install sqlite-utils 73 | ``` 74 | ```python 75 | import sqlite_utils 76 | db = sqlite_utils.Database("/tmp/peps.db") 77 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True) 78 | ``` 79 | I got this error: 80 | ``` 81 | OperationalError: table peps has no column named PEP-Delegate 82 | ``` 83 | To fix that: 84 | ``` 85 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True) 86 | print(db["peps"].count) 87 | # Outputs 429 88 | ``` 89 | ## Enabling full-text search 90 | 91 | SQLite has surprisingly good full-text search built in. 92 | 93 | `sqlite-utils` can help you enable it: 94 | 95 | ```python 96 | db["peps"].enable_fts(["Title", "Body"]) 97 | ``` 98 | Datasette will detect this and add a search box to the top of the table page. 99 | 100 | To run searches in relevance order you'll need to execute a custom SQL query: 101 | 102 | ```sql 103 | select 104 | PEP, 105 | peps.Title, 106 | Version, 107 | Author, 108 | Status, 109 | Type, 110 | Created, 111 | peps.Body, 112 | peps_fts.rank 113 | from 114 | peps 115 | join 116 | peps_fts on peps.rowid = peps_fts.rowid 117 | where 118 | peps_fts match :search 119 | order by 120 | peps_fts.rank 121 | limit 122 | 20 123 | ``` 124 | --------------------------------------------------------------------------------