├── Makefile
├── README.md
├── advanced-sql.md
├── baked-data.md
├── codespaces-jupyter.gif
├── conf.py
├── datasette-lite.md
├── datasette-publish.md
├── datasette.md
├── further-reading.md
├── index.md
├── requirements.txt
├── setup.md
├── sqlite-intro.md
└── sqlite-utils.md
/Makefile:
--------------------------------------------------------------------------------
1 | # Minimal makefile for Sphinx documentation
2 | #
3 |
4 | # You can set these variables from the command line.
5 | SPHINXOPTS =
6 | SPHINXBUILD = sphinx-build
7 | SPHINXPROJ = sqlite-utils
8 | SOURCEDIR = .
9 | BUILDDIR = _build
10 |
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 |
15 | .PHONY: help Makefile
16 |
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 |
22 | livehtml:
23 | sphinx-autobuild -b html "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(0)
24 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data analysis with SQLite and Python
2 |
3 | A three hour tutorial presented at PyCon 2023.
4 |
5 | Now available as **[an official Datasette tutorial](https://datasette.io/tutorials/data-analysis)**.
6 |
7 | 2hr45m YouTube video of this tutorial: https://www.youtube.com/watch?v=5TdIxxBPUSI
8 |
9 | See also [Data analysis with SQLite and Python for PyCon 2023](https://simonwillison.net/2023/Apr/20/pycon-2023/) on my blog.
10 |
--------------------------------------------------------------------------------
/advanced-sql.md:
--------------------------------------------------------------------------------
1 | # Advanced SQL
2 |
3 | ## Aggregations
4 |
5 | The simplest form of aggregation is the one Datasette does to implement its own faceting feature.
6 |
7 | ```sql
8 | select
9 | party,
10 | count(*)
11 | from
12 | executive_terms
13 | where
14 | type = 'prez'
15 | group by
16 | party
17 | ```
18 | [Try that query here](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++party%2C%0D%0A++count%28*%29%0D%0Afrom%0D%0A++executive_terms%0D%0Awhere%0D%0A++type+%3D+%27prez%27%0D%0Agroup+by%0D%0A++party).
19 |
20 | The `group by` creates groups of rows, then the aggregation functions calculate a value across that entire group.
21 |
22 | The most common aggregation functions are:
23 |
24 | - `count(*)` - count the number of rows in each group
25 | - `max(column)` - maximum value for a column
26 | - `min(column)` - minimum value for a column
27 | - `sum(column)` - sum up the values in the column
28 |
29 | Here's an example of `sum()` and `count()` from [What's in the RedPajama-Data-1T LLM training set](https://simonwillison.net/2023/Apr/17/redpajama-data/):
30 | ```sql
31 | select
32 | top_folders,
33 | sum(size_gb) as total_gb,
34 | count(*) as num_files
35 | from raw
36 | group by top_folders
37 | order by sum(size_gb) desc
38 | ```
39 | [Run that in Datasette Lite](https://lite.datasette.io/?install=datasette-copyable&json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data?sql=select%0A++top_folders%2C%0A++cast+%28sum%28size_gb%29+as+integer%29+as+total_gb%2C%0A++count%28*%29+as+num_files%0Afrom+raw%0Agroup+by+top_folders%0Aorder+by+sum%28size_gb%29+desc).
40 |
41 | Change the `total_gb` line to this to round it to the nearest integer:
42 | ```sql
43 | cast (sum(size_gb) as integer) as total_gb,
44 | ```
45 | ## Subqueries
46 |
47 | SQLite has excellent support for subqueries. You can use them in `where X in` clauses:
48 |
49 | ```sql
50 | select html_url from releases where repo in (
51 | select id from repos where full_name in (
52 | select repo from plugin_repos
53 | )
54 | )
55 | order by created_at desc
56 | ```
57 | [Run that on datasette.io](https://datasette.io/content?sql=select+html_url+from+releases+where+repo+in+%28%0D%0A++select+id+from+repos+where+full_name+in+%28%0D%0A++++select+repo+from+plugin_repos%0D%0A++%29%0D%0A%29%0D%0Aorder+by+created_at+desc). Sometimes I find these to be more readable than joins!
58 |
59 | You can also use them directly in `select` clauses:
60 |
61 | ```sql
62 | select
63 | full_name,
64 | (
65 | select
66 | html_url
67 | from
68 | releases
69 | where
70 | releases.repo = repos.id
71 | order by
72 | created_at desc
73 | limit
74 | 1
75 | ) as latest_release
76 | from
77 | repos
78 | ```
79 | [Run that here](https://datasette.io/content?sql=select+full_name%2C+%28select+html_url+from+releases+where+releases.repo+%3D+repos.id+order+by+created_at+desc+limit+1%29+as+latest_release+from+repos).
80 |
81 | ## CTEs
82 |
83 | CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.
84 |
85 | ```sql
86 | with presidents as (
87 | select
88 | executives.name
89 | from
90 | executive_terms
91 | join executives
92 | on executive_terms.executive_id = executives.id
93 | where
94 | executive_terms.type = 'prez'
95 | ),
96 | vice_presidents as (
97 | select
98 | executives.name
99 | from
100 | executive_terms
101 | join executives
102 | on executive_terms.executive_id = executives.id
103 | where
104 | executive_terms.type = 'viceprez'
105 | )
106 | select
107 | distinct name
108 | from
109 | presidents
110 | where name in vice_presidents
111 | ```
112 | [Try this CTE query here](https://congress-legislators.datasettes.com/legislators?sql=with+presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27prez%27%0D%0A%29%2C%0D%0Avice_presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27viceprez%27%0D%0A%29%0D%0Aselect%0D%0A++distinct+name%0D%0Afrom%0D%0A++presidents%0D%0Awhere%0D%0A++name+in+vice_presidents).
113 |
114 | ## JSON
115 |
116 | SQLite has excellent JSON functionality built in. Store JSON in a `text` column and you can query it using `json_extract()` - you can also build JSON values in `select` queries.
117 |
118 | [Returning related rows in a single SQL query using JSON](https://til.simonwillison.net/sqlite/related-rows-single-query) shows some advanced tricks you can do with this.
119 |
120 | ```sql
121 | select
122 | legislators.id,
123 | legislators.name,
124 | json_group_array(json_object(
125 | 'type', legislator_terms.type,
126 | 'state', legislator_terms.state,
127 | 'start', legislator_terms.start,
128 | 'end', legislator_terms.end,
129 | 'party', legislator_terms.party
130 | )) as terms,
131 | count(*) as num_terms
132 | from
133 | legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
134 | group by legislators.id
135 | order by
136 | id
137 | limit
138 | 10
139 | ```
140 | [Run that query](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++legislators.id%2C%0D%0A++legislators.name%2C%0D%0A++json_group_array(json_object(%0D%0A++++%27type%27%2C+legislator_terms.type%2C%0D%0A++++%27state%27%2C+legislator_terms.state%2C%0D%0A++++%27start%27%2C+legislator_terms.start%2C%0D%0A++++%27end%27%2C+legislator_terms.end%2C%0D%0A++++%27party%27%2C+legislator_terms.party%0D%0A+++))+as+terms%2C%0D%0A+++count(*)+as+num_terms%0D%0Afrom%0D%0A++legislators+join+legislator_terms+on+legislator_terms.legislator_id+%3D+legislators.id%0D%0A++group+by+legislators.id%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++10).
141 |
142 | Paul Ford [said about SQLite's JSON support](https://simonwillison.net/2018/Jan/29/paul-ford/):
143 |
144 | > The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
145 |
146 | ## Window functions
147 |
148 | I wanted to run a query that would return the following:
149 |
150 | - The repository name
151 | - The date of the most recent release from that repository (the releases table is a many-to-one against repos)
152 | - The total number of releases
153 | - **The three most recent releases** (as a JSON array of objects)
154 |
155 |
156 | ```sql
157 | with cte as (
158 | select
159 | repos.full_name,
160 | releases.created_at,
161 | releases.id as rel_id,
162 | releases.name as rel_name,
163 | releases.created_at as rel_created_at,
164 | rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
165 | from repos
166 | left join releases on releases.repo = repos.id
167 | )
168 | select
169 | full_name,
170 | max(created_at) as max_created_at,
171 | count(rel_id) as releases_count,
172 | json_group_array(
173 | json_object(
174 | 'id', rel_id,
175 | 'name', rel_name,
176 | 'created_at', rel_created_at
177 | )
178 | ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
179 | from cte
180 | group by full_name
181 | order by releases_count desc
182 | ```
183 | [Run that query here](https://datasette.io/content?sql=with+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max%28created_at%29+as+max_created_at%2C%0D%0A++count%28rel_id%29+as+releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+filter+%28where+rel_id+is+not+null+and+rel_rank+%3C%3D+3%29+as+recent_releases%0D%0Afrom+cte%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc).
184 |
185 | [Running this smaller query](https://datasette.io/content?sql=with+cte+as+(%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank()+over+(partition+by+repos.id+order+by+releases.created_at+desc)+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A)%0D%0Aselect+*+from+cte) helps show what's going on with that `rel_rank` column:
186 |
187 | ```sql
188 | with cte as (
189 | select
190 | repos.full_name,
191 | releases.created_at,
192 | releases.id as rel_id,
193 | releases.name as rel_name,
194 | releases.created_at as rel_created_at,
195 | rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
196 | from repos
197 | left join releases on releases.repo = repos.id
198 | )
199 | select * from cte
200 | ```
201 |
--------------------------------------------------------------------------------
/baked-data.md:
--------------------------------------------------------------------------------
1 | # Baked Data
2 |
3 | The [The Baked Data architectural pattern](https://simonwillison.net/2021/Jul/28/baked-data/) describes this approach, which is key to taking full advantage of SQLite and Datasette.
4 |
5 | I like to build my databases in GitHub Actions.
6 |
7 | ## Niche Museums and TILs
8 |
9 | - is published from the repository - one big YAML file for the content.
10 | - is published - separate Markdown files for each item.
11 |
12 | Both of these sites have Atom feeds that are defined using a Datasette [canned query](https://docs.datasette.io/en/stable/sql_queries.html#canned-queries), in conjunction with the [datasette-atom](https://datasette.io/plugins/datasette-atom) plugin.
13 |
14 | -
15 | -
16 |
17 | ## Generating a newsletter with an Observable notebook
18 |
19 | I wrote about this in [Semi-automating a Substack newsletter with an Observable notebook](https://simonwillison.net/2023/Apr/4/substack-observable/):
20 |
21 | - is a Datasette/SQLite copy of my Django blog, created using [db-to-sqlite](https://datasette.io/tools/db-to-sqlite) by my GitHub repository.
22 | - is my Observable notebook that assembles a newsletter from that data.
23 | - is the Substack newsletter that I copy that content into.
24 |
--------------------------------------------------------------------------------
/codespaces-jupyter.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonw/sqlite-tutorial-pycon-2023/2037cca22c6e3a3d87e35796d73e912eef9d12b0/codespaces-jupyter.gif
--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 |
4 | from subprocess import PIPE, Popen
5 |
6 | # This file is execfile()d with the current directory set to its
7 | # containing dir.
8 | #
9 | # Note that not all possible configuration values are present in this
10 | # autogenerated file.
11 | #
12 | # All configuration values have a default; values that are commented out
13 | # serve to show the default.
14 |
15 | # If extensions (or modules to document with autodoc) are in another directory,
16 | # add these directories to sys.path here. If the directory is relative to the
17 | # documentation root, use os.path.abspath to make it absolute, like shown here.
18 | #
19 | # import os
20 | # import sys
21 | # sys.path.insert(0, os.path.abspath('.'))
22 |
23 |
24 | # -- General configuration ------------------------------------------------
25 |
26 | # If your documentation needs a minimal Sphinx version, state it here.
27 | #
28 | # needs_sphinx = '1.0'
29 |
30 | # Add any Sphinx extension module names here, as strings. They can be
31 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
32 | # ones.
33 | extensions = ["myst_parser"]
34 |
35 | # Add any paths that contain templates here, relative to this directory.
36 | templates_path = ["_templates"]
37 |
38 | # The suffix(es) of source filenames.
39 | # You can specify multiple suffix as a list of string:
40 | #
41 | # source_suffix = ['.rst', '.md']
42 | source_suffix = ".rst"
43 |
44 | # The master toctree document.
45 | master_doc = "index"
46 |
47 | # General information about the project.
48 | project = "Data analysis with SQLite and Python, PyCon 2023"
49 | copyright = "2023, Simon Willison"
50 | author = "Simon Willison"
51 |
52 | # The version info for the project you're documenting, acts as replacement for
53 | # |version| and |release|, also used in various other places throughout the
54 | # built documents.
55 | #
56 | # The short X.Y version.
57 | pipe = Popen("git describe --tags --always", stdout=PIPE, shell=True)
58 | git_version = pipe.stdout.read().decode("utf8")
59 |
60 | if git_version:
61 | version = git_version.rsplit("-", 1)[0]
62 | release = git_version
63 | else:
64 | version = ""
65 | release = ""
66 |
67 | # The language for content autogenerated by Sphinx. Refer to documentation
68 | # for a list of supported languages.
69 | #
70 | # This is also used if you do content translation via gettext catalogs.
71 | # Usually you set "language" from the command line for these cases.
72 | language = None
73 |
74 | # List of patterns, relative to source directory, that match files and
75 | # directories to ignore when looking for source files.
76 | # This patterns also effect to html_static_path and html_extra_path
77 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
78 |
79 | # The name of the Pygments (syntax highlighting) style to use.
80 | pygments_style = "sphinx"
81 |
82 | # If true, `todo` and `todoList` produce output, else they produce nothing.
83 | todo_include_todos = False
84 |
85 |
86 | # -- Options for HTML output ----------------------------------------------
87 |
88 | # The theme to use for HTML and HTML Help pages. See the documentation for
89 | # a list of builtin themes.
90 | #
91 | html_theme = "sphinx_rtd_theme"
92 |
93 | # Theme options are theme-specific and customize the look and feel of a theme
94 | # further. For a list of options available for each theme, see the
95 | # documentation.
96 | #
97 | # html_theme_options = {}
98 |
99 | # Add any paths that contain custom static files (such as style sheets) here,
100 | # relative to this directory. They are copied after the builtin static files,
101 | # so a file named "default.css" will overwrite the builtin "default.css".
102 | html_static_path = ["_static"]
103 |
104 | # Custom sidebar templates, must be a dictionary that maps document names
105 | # to template names.
106 | #
107 | # This is required for the alabaster theme
108 | # refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
109 | html_sidebars = {
110 | "**": [
111 | "relations.html", # needs 'show_related': True theme option to display
112 | "searchbox.html",
113 | ]
114 | }
115 |
116 |
117 | # -- Options for HTMLHelp output ------------------------------------------
118 |
119 | # Output file base name for HTML help builder.
120 | htmlhelp_basename = "sqlite-tutorial-pycon-2023-doc"
121 |
122 |
123 | # -- Options for LaTeX output ---------------------------------------------
124 |
125 | latex_elements = {
126 | # The paper size ('letterpaper' or 'a4paper').
127 | #
128 | # 'papersize': 'letterpaper',
129 | # The font size ('10pt', '11pt' or '12pt').
130 | #
131 | # 'pointsize': '10pt',
132 | # Additional stuff for the LaTeX preamble.
133 | #
134 | # 'preamble': '',
135 | # Latex figure (float) alignment
136 | #
137 | # 'figure_align': 'htbp',
138 | }
139 |
140 | # Grouping the document tree into LaTeX files. List of tuples
141 | # (source start file, target name, title,
142 | # author, documentclass [howto, manual, or own class]).
143 | latex_documents = [
144 | (
145 | master_doc,
146 | "sqlite-tutorial-pycon-2023.tex",
147 | "SQLite Tutorial PyCon 2023",
148 | "Simon Willison",
149 | "manual",
150 | )
151 | ]
152 |
153 |
154 | # -- Options for manual page output ---------------------------------------
155 |
156 | # One entry per manual page. List of tuples
157 | # (source start file, name, description, authors, manual section).
158 | man_pages = [
159 | (
160 | master_doc,
161 | "sqlite-tutorial-pycon-2023",
162 | "SQLite Tutorial PyCon 2023",
163 | [author],
164 | 1,
165 | )
166 | ]
167 |
168 |
169 | # -- Options for Texinfo output -------------------------------------------
170 |
171 | # Grouping the document tree into Texinfo files. List of tuples
172 | # (source start file, target name, title, author,
173 | # dir menu entry, description, category)
174 | texinfo_documents = [
175 | (
176 | master_doc,
177 | "sqlite-tutorial-pycon-2023",
178 | "SQLite Tutorial PyCon 2023",
179 | author,
180 | "sqlite-tutorial-pycon-2023",
181 | "A tutorial at PyCon 2023",
182 | "Miscellaneous",
183 | )
184 | ]
185 |
--------------------------------------------------------------------------------
/datasette-lite.md:
--------------------------------------------------------------------------------
1 | # Datasette Lite
2 |
3 | It's Datasette... running entirely in your web browser with WebAssembly and Pyodide!
4 |
5 |
6 |
7 | ## Loading SQLite, CSV and JSON data
8 |
9 | - SQLite: - see [Measuring traffic during the Half Moon Bay Pumpkin Festival](https://simonwillison.net/2022/Oct/19/measuring-traffic/)
10 | - CSV:
11 | - JSON:
12 |
13 | ## Installing plugins
14 |
15 | Add `?install=name-of-plugin` to `pip install` that plugin into your browser's environment!
16 |
17 | This only works with a subset of plugins.
18 |
19 | -
20 |
21 | ## Further reading
22 |
23 | - [Datasette Lite: a server-side Python web application running in a browser](https://simonwillison.net/2022/May/4/datasette-lite/)
24 | - [Plugin support for Datasette Lite](https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/)
25 | - [Joining CSV files in your browser using Datasette Lite](https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/)
--------------------------------------------------------------------------------
/datasette-publish.md:
--------------------------------------------------------------------------------
1 | # Publishing a database to Vercel
2 |
3 | First, install both Vercel and the `datasette-publish-vercel` plugin.
4 |
5 | has documentation for installing the Vercel CLI.
6 |
7 | On macOS:
8 |
9 | brew install vercel-cli
10 |
11 | Or use one of these:
12 |
13 | npm i -g vercel
14 |
15 | Or:
16 |
17 | pnpm i -g vercel
18 |
19 | Now run this command to login:
20 |
21 | vercel login
22 |
23 | Install the plugin:
24 |
25 | datasette install datasette-publish-vercel
26 |
27 | And deploy the database:
28 |
29 | datasette publish vercel /tmp/peps.db --project python-peps
30 |
31 | ## Other publishing options
32 |
33 | Datasette [can publish](https://docs.datasette.io/en/stable/publish.html) to the following providers:
34 |
35 | - Heroku (`datasette publish heroku`)
36 | - Google Cloud Run (`datasette publish cloudrun`)
37 | - Vercel (with [datasette-publish-vercel](https://datasette.io/plugins/datasette-publish-vercel))
38 | - Fly (with [datasette-publish-fly](https://datasette.io/plugins/datasette-publish-fly))
39 |
40 | Further deployment options are described [in the documentation](https://docs.datasette.io/en/stable/deploying.html).
41 |
--------------------------------------------------------------------------------
/datasette.md:
--------------------------------------------------------------------------------
1 | # Exploring data with Datasette
2 |
3 | [Datasette](https://datasette.io/) is "an open source multi-tool for exploring and publishing data".
4 |
5 | ## Installing Datasette locally
6 |
7 | ```
8 | pip install datasette
9 | ```
10 | Or if you prefer `pipx`:
11 | ```
12 | pipx install datasette
13 | ```
14 | Or Homebrew (on macOS):
15 | ```
16 | brew install datasette
17 | ```
18 | [More installations options](https://docs.datasette.io/en/stable/installation.html).
19 |
20 | In Codespaces you should also install the `datasette-codespaces` plugin:
21 |
22 | datasette install datasette-codespaces
23 |
24 | ## Try a database: legislators.db
25 |
26 | ```
27 | wget https://congress-legislators.datasettes.com/legislators.db
28 | ```
29 |
30 | This is a database of US legislators, presidents and vice presidents.
31 |
32 | You can explore it online at
33 |
34 | Open it in Datasette like this:
35 |
36 | datasette legislators.db
37 |
38 | We'll follow this tutorial to explore Datasette's features: **[Exploring a database with Datasette](https://datasette.io/tutorials/explore)**
39 |
40 | ## Install some plugins
41 |
42 | Datasette has over a hundred plugins:
43 |
44 | You can `pip install` them, but it's better to use `datasette install` as that ensures they will go in the correct virtual environment, especially useful if you used `pipx` or Homebrew to install Datasette itself.
45 |
46 | datasette install datasette-cluster-map
47 |
48 | Now restart Datasette and visit the "offices" table to see the result.
49 |
50 | You can review what plugins are installed with:
51 |
52 | datasette plugins
53 |
54 | Or by visiting the `/-/plugins` page in Datasette.
55 |
56 | Plugins can be uninstalled with:
57 |
58 | datasette uninstall datasette-cluster-map
59 |
60 | ## Learning SQL with Datasette
61 |
62 | The "✎ View and edit SQL" link is a quick way to start learning basic SQL queries.
63 |
64 | We'll follow this tutorial next: **[Learn SQL with Datasette](https://datasette.io/tutorials/learn-sql)**
65 |
66 |
67 |
--------------------------------------------------------------------------------
/further-reading.md:
--------------------------------------------------------------------------------
1 | # More demos and further reading
2 |
3 | ## Fun demos
4 |
5 | I post a lot of Datasette projects on my blog. Some of my recent favourites:
6 |
7 | - [Exploring the training data behind Stable Diffusion](https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/) describes how I built [laion-aesthetic.datasette.io](https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images) to explore 12m training images used for Stable Diffusion.
8 | - [Analyzing ScotRail audio announcements with Datasette—from prototype to production](https://simonwillison.net/2022/Aug/21/scotrail/) explains the [scotrail.datasette.io](https://scotrail.datasette.io/) project.
9 | - [I built a ChatGPT plugin to answer questions about data hosted in Datasette](https://simonwillison.net/2023/Mar/24/datasette-chatgpt-plugin/) is one of my earlier explorations at the intersection of LLM AI and Datasette.
10 |
11 | ## SpatiaLite
12 |
13 | - [Building a location to time zone API with SpatiaLite](https://datasette.io/tutorials/spatialite) shows how to use SpatiaLite and Datasette to create an API for looking up time zones based on latitude/longitude points.
14 | - [Drawing shapes on a map to query a SpatiaLite database](https://simonwillison.net/2021/Jan/24/drawing-shapes-spatialite/) introduces the `datasette-leaflet-freedraw` plugin and introduces the [calands.datasettes.com](https://calands.datasettes.com/) demo.
15 | - [Joining CSV and JSON data with an in-memory SQLite database](https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/)
--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
1 | # Data analysis with SQLite and Python, PyCon 2023
2 |
3 | On YouTube: [Data analysis with SQLite and Python - 2hr45m](https://www.youtube.com/watch?v=5TdIxxBPUSI)
4 |
5 | ```{toctree}
6 | ---
7 | maxdepth: 3
8 | ---
9 | setup
10 | sqlite-intro
11 | datasette
12 | sqlite-utils
13 | datasette-publish
14 | datasette-lite
15 | advanced-sql
16 | baked-data
17 | further-reading
18 | ```
19 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | sphinx_rtd_theme
2 | sphinx-autobuild
3 | myst-parser
4 |
--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
1 | # What you'll need
2 |
3 | ## python3 and pip
4 |
5 | For the first part of this tutorial, you'll need a Python 3 interpreter with the `sqlite3` standard library module available.
6 |
7 | You can run this on your own computer, or use a browser-based environment.
8 |
9 | will work for a purely browser-based (WebAssembly) environment.
10 |
11 | For the second part, you'll also need the ability to `pip install` Python packages.
12 |
13 | Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use [GitHub Codespaces](https://github.com/github/codespaces-jupyter) or [Google Colab](https://colab.research.google.com/) or [Jupyter Hub](https://jupyter.org/try) or your online notebook solution of choice.
14 |
15 | ## Optional: GitHub Codespaces
16 |
17 | I'll be working through the tutorial using GitHub Codespaces, using
18 |
19 | 
--------------------------------------------------------------------------------
/sqlite-intro.md:
--------------------------------------------------------------------------------
1 | # Introduction to SQLite
2 |
3 | ## Why SQLite?
4 |
5 | - It's the database you already have - `sqlite3` has been built into Python since 2006
6 | - It's screamingly fast, and surprisingly powerful
7 | - Amazing compatibility: bindings for every language, files work on every platform, fantastic track record for backwards compatibility, so it's safe to trust your data to a SQLite file
8 | - Databases are just files on disk. You can create and discard them without any ceremony.
9 | - It handles text (including JSON), integers, floating point numbers, and binary blobs. Which means it can store _anything_.
10 | - It can handle up to 2.8TB of data(!)
11 | - It has some interesting characteristics, for example [Many Small Queries Are Efficient In SQLite](https://www.sqlite.org/np1queryprob.html)
12 |
13 | ## First steps with Python
14 |
15 | Let's download a database to play with - we'll use the database that powers the website:
16 |
17 | ```
18 | wget https://datasette.io/content.db
19 | ```
20 | To access it from Python:
21 | ```python
22 | import sqlite3
23 |
24 | db = sqlite3.connect("content.db")
25 |
26 | print(db.execute("select sqlite_version()").fetchall())
27 | # [('3.39.0',)]
28 |
29 | # Show rows from the plugin_repos table
30 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
31 | print(row)
32 |
33 | # Each row is a tuple. We can change that like this:
34 | db.row_factory = sqlite3.Row
35 |
36 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
37 | print(row)
38 |
39 | # This outputs
40 | # We can use dict() to turn those into dictionaries instead
41 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
42 | print(dict(row))
43 |
44 | ```
45 |
46 | ## Creating a table
47 |
48 | Let's create a table:
49 | ```python
50 | db.execute("""
51 | create table peps (
52 | id integer primary key,
53 | title text,
54 | author text,
55 | status text,
56 | type text,
57 | created text,
58 | body text
59 | );
60 | """)
61 | ```
62 |
63 | ## Inserting some data
64 |
65 | Here's a function I wrote that can parse a PEP:
66 |
67 | ```python
68 | def parse_pep(s):
69 | intro, body = s.split("\n\n", 1)
70 | pep = {}
71 | current_key = None
72 | current_value = None
73 | for line in intro.split("\n"):
74 | # If the line starts with whitespace, it's a continuation of the previous value
75 | if line.startswith(" ") or line.startswith("\t"):
76 | if current_key is not None:
77 | current_value += " " + line.strip()
78 | pep[current_key] = current_value.strip()
79 | else:
80 | # Split the line into key and value
81 | parts = line.split(": ", 1)
82 | if len(parts) == 2:
83 | key, value = parts
84 | # Update the current key and value
85 | current_key = key
86 | current_value = value
87 | # Add the key-value pair to the pep dictionary
88 | pep[current_key] = current_value.strip()
89 | pep["Body"] = body.strip()
90 | return pep
91 | ```
92 | Let's fetch and parse the Zen of Python:
93 | ```python
94 | import urllib.request
95 |
96 | zen = urllib.request.urlopen(
97 | "https://raw.githubusercontent.com/python/peps/main/peps/pep-0020.rst"
98 | ).read().decode("utf-8")
99 |
100 | pep = parse_pep(zen)
101 | ```
102 | And insert that into our database:
103 | ```python
104 | db.execute("""
105 | insert into peps (
106 | id, title, author, status, type, created, body
107 | ) values (
108 | ?, ?, ?, ?, ?, ?, ?
109 | )
110 | """, (
111 | pep["PEP"],
112 | pep["Title"],
113 | pep["Author"],
114 | pep["Status"],
115 | pep["Type"],
116 | pep["Created"],
117 | pep["Body"],
118 | ))
119 | ```
120 | Since this is a dictionary already, we can use alternative syntax like this:
121 | ```python
122 | db.execute("delete from peps where id = 20")
123 | db.execute("""
124 | insert into peps (
125 | id, title, author, status, type, created, body
126 | ) values (
127 | :PEP, :Title, :Author, :Status, :Type, :Created, :Body
128 | )
129 | """, pep)
130 | ```
131 | To confirm that it was correctly inserted:
132 | ```python
133 | print(db.execute("select * from peps").fetchall())
134 | ```
135 |
136 | ## UPDATE and DELETE
137 |
138 | To update a record:
139 |
140 | ```python
141 | with db:
142 | db.execute("""
143 | update peps set author = ?
144 | where id = ?
145 | """, ["Tim Peters", 20])
146 | ```
147 | This will run in a transaction.
148 |
149 | To delete a record:
150 |
151 | ```python
152 | with db:
153 | db.execute("""
154 | delete from peps
155 | where id = ?
156 | """, [20])
157 | ```
158 | Or to delete everything:
159 | ```sql
160 | delete from peps
161 | ```
162 | ## SQLite column types
163 |
164 | SQLite `create table` is easier than many other databases, because there are less types to worry about. There are four types:
165 |
166 | - `integer`
167 | - `real`
168 | - `text`
169 | - `blob`
170 |
171 | Unlike other databases, length limits are neither required or enforced - so don't worry about `varchar(255)`, just use `text`.
172 |
173 | Tables automatically get an ID column called `rowid` - an incrementing integer. This will be the primary key if you don't specify one.
174 |
175 | If you specify `integer primary key` it will be auto-incrementing and will actually map to that underlying `rowid`.
176 |
177 | You can set `id text primary key` for a text primary key - this will not increment, you will have to set it to a unique value for each row yourself. You could do this with UUIDs generated using `uuid.uuid4()` for example.
178 |
179 | SQLite is loosely typed by default: you can insert any type into any column, even if it conflicts with the column type!
180 |
181 | A lot of people find this very uncomfortable.
182 |
183 | As-of SQLite 3.37.0 (2021-11-27) you can set [strict mode](https://www.sqlite.org/stricttables.html) on a table to opt-out of this loose typing:
184 |
185 | ```sql
186 | create table peps (
187 | id integer primary key,
188 | title text,
189 | author text,
190 | body text
191 | ) strict
192 | ```
193 |
194 | ## Transactions
195 |
196 | Here's an example of the impact transactions have on file-based databases:
197 |
198 | ```pycon
199 | >>> import sqlite3
200 | >>> db = sqlite3.connect("/tmp/data.db")
201 | >>> db.execute("create table foo (id integer primary key, name text)")
202 |
203 | ```
204 | In another window:
205 | ```
206 | % sqlite3 data.db .dump
207 | PRAGMA foreign_keys=OFF;
208 | BEGIN TRANSACTION;
209 | CREATE TABLE foo (id integer primary key, name text);
210 | COMMIT;
211 | ```
212 | ```pycon
213 | >>> db.execute('insert into foo (name) values (?)', ['text'])
214 |
215 | ```
216 | In the other window:
217 | ```
218 | % sqlite3 data.db .dump
219 | PRAGMA foreign_keys=OFF;
220 | BEGIN TRANSACTION;
221 | CREATE TABLE foo (id integer primary key, name text);
222 | COMMIT;
223 | ```
224 | ```pycon
225 | >>> db.commit()
226 | ```
227 | And now:
228 | ```
229 | % sqlite3 data.db .dump
230 | PRAGMA foreign_keys=OFF;
231 | BEGIN TRANSACTION;
232 | CREATE TABLE foo (id integer primary key, name text);
233 | INSERT INTO foo VALUES(1,'text');
234 | COMMIT;
235 | ```
236 | A nicer pattern is to do this:
237 | ```pycon
238 | >>> with db:
239 | ... db.execute('insert into foo (name) values (?)', ['text'])
240 | ```
241 | The `with db:` wraps everything inside that block in a transaction.
242 |
--------------------------------------------------------------------------------
/sqlite-utils.md:
--------------------------------------------------------------------------------
1 | # sqlite-utils
2 |
3 | [sqlite-utils](https://sqlite-utils.datasette.io/) provides "CLI tool and Python utility functions for manipulating SQLite databases".
4 |
5 | You can install it the same way as Datasette:
6 |
7 | pip install sqlite-utils
8 |
9 | Or with `pipx`:
10 |
11 | pipx install sqlite-utils
12 |
13 | Or with Homebrew:
14 |
15 | brew install sqlite-utils
16 |
17 | It works as both a CLI tool and a Python library.
18 |
19 | ## Using the command-line tools to clean data
20 |
21 | We'll follow this tutorial next: **[Cleaning data with sqlite-utils and Datasette](https://datasette.io/tutorials/clean-data)**
22 |
23 | ## Using sqlite-utils as a Python library, to import all the PEPs
24 |
25 | Let's take our PEPs example from earlier and implement it again, but better, using `sqlite-utils`.
26 |
27 | I'll do this in a notebook.
28 |
29 | ```
30 | !git clone https://github.com/python/peps /tmp/peps
31 | ```
32 | We now have ALL of the PEPs in `/tmp/peps`
33 |
34 | ```python
35 | import pathlib
36 |
37 | files = list(pathlib.Path("/tmp/peps/peps").glob("pep-*.rst"))
38 | ```
39 | And parse them with our function from earlier:
40 | ```python
41 | def parse_pep(s):
42 | intro, body = s.split("\n\n", 1)
43 | pep = {}
44 | current_key = None
45 | current_value = None
46 | for line in intro.split("\n"):
47 | # If the line starts with whitespace, it's a continuation of the previous value
48 | if line.startswith(" ") or line.startswith("\t"):
49 | if current_key is not None:
50 | current_value += " " + line.strip()
51 | pep[current_key] = current_value.strip()
52 | else:
53 | # Split the line into key and value
54 | parts = line.split(": ", 1)
55 | if len(parts) == 2:
56 | key, value = parts
57 | # Update the current key and value
58 | current_key = key
59 | current_value = value
60 | # Add the key-value pair to the pep dictionary
61 | pep[current_key] = current_value.strip()
62 | pep["Body"] = body.strip()
63 | return pep
64 | ```
65 | ```python
66 | peps = []
67 | for file in files:
68 | peps.append(parse_pep(file.read_text()))
69 | ```
70 | We now have a list of dictionaries. Let's load them into SQLite:
71 | ```
72 | %pip install sqlite-utils
73 | ```
74 | ```python
75 | import sqlite_utils
76 | db = sqlite_utils.Database("/tmp/peps.db")
77 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
78 | ```
79 | I got this error:
80 | ```
81 | OperationalError: table peps has no column named PEP-Delegate
82 | ```
83 | To fix that:
84 | ```
85 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
86 | print(db["peps"].count)
87 | # Outputs 429
88 | ```
89 | ## Enabling full-text search
90 |
91 | SQLite has surprisingly good full-text search built in.
92 |
93 | `sqlite-utils` can help you enable it:
94 |
95 | ```python
96 | db["peps"].enable_fts(["Title", "Body"])
97 | ```
98 | Datasette will detect this and add a search box to the top of the table page.
99 |
100 | To run searches in relevance order you'll need to execute a custom SQL query:
101 |
102 | ```sql
103 | select
104 | PEP,
105 | peps.Title,
106 | Version,
107 | Author,
108 | Status,
109 | Type,
110 | Created,
111 | peps.Body,
112 | peps_fts.rank
113 | from
114 | peps
115 | join
116 | peps_fts on peps.rowid = peps_fts.rowid
117 | where
118 | peps_fts match :search
119 | order by
120 | peps_fts.rank
121 | limit
122 | 20
123 | ```
124 |
--------------------------------------------------------------------------------