├── Makefile
├── README.md
├── advanced-sql.md
├── baked-data.md
├── codespaces-jupyter.gif
├── conf.py
├── datasette-lite.md
├── datasette-publish.md
├── datasette.md
├── further-reading.md
├── index.md
├── requirements.txt
├── setup.md
├── sqlite-intro.md
└── sqlite-utils.md


/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line.
 5 | SPHINXOPTS    =
 6 | SPHINXBUILD   = sphinx-build
 7 | SPHINXPROJ    = sqlite-utils
 8 | SOURCEDIR     = .
 9 | BUILDDIR      = _build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21 | 
22 | livehtml:
23 | 	sphinx-autobuild -b html "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(0)
24 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data analysis with SQLite and Python
 2 | 
 3 | A three hour tutorial presented at PyCon 2023.
 4 | 
 5 | Now available as **[an official Datasette tutorial](https://datasette.io/tutorials/data-analysis)**.
 6 | 
 7 | 2hr45m YouTube video of this tutorial: https://www.youtube.com/watch?v=5TdIxxBPUSI
 8 | 
 9 | See also [Data analysis with SQLite and Python for PyCon 2023](https://simonwillison.net/2023/Apr/20/pycon-2023/) on my blog.
10 | 


--------------------------------------------------------------------------------
/advanced-sql.md:
--------------------------------------------------------------------------------
  1 | # Advanced SQL
  2 | 
  3 | ## Aggregations
  4 | 
  5 | The simplest form of aggregation is the one Datasette does to implement its own faceting feature.
  6 | 
  7 | ```sql
  8 | select
  9 |   party,
 10 |   count(*)
 11 | from
 12 |   executive_terms
 13 | where
 14 |   type = 'prez'
 15 | group by
 16 |   party
 17 | ```
 18 | [Try that query here](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++party%2C%0D%0A++count%28*%29%0D%0Afrom%0D%0A++executive_terms%0D%0Awhere%0D%0A++type+%3D+%27prez%27%0D%0Agroup+by%0D%0A++party).
 19 | 
 20 | The `group by` creates groups of rows, then the aggregation functions calculate a value across that entire group.
 21 | 
 22 | The most common aggregation functions are:
 23 | 
 24 | - `count(*)` - count the number of rows in each group
 25 | - `max(column)` - maximum value for a column
 26 | - `min(column)` - minimum value for a column
 27 | - `sum(column)` - sum up the values in the column
 28 | 
 29 | Here's an example of `sum()` and `count()` from [What's in the RedPajama-Data-1T LLM training set](https://simonwillison.net/2023/Apr/17/redpajama-data/):
 30 | ```sql
 31 | select
 32 |   top_folders,
 33 |   sum(size_gb) as total_gb,
 34 |   count(*) as num_files
 35 | from raw
 36 | group by top_folders
 37 | order by sum(size_gb) desc
 38 | ```
 39 | [Run that in Datasette Lite](https://lite.datasette.io/?install=datasette-copyable&json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464#/data?sql=select%0A++top_folders%2C%0A++cast+%28sum%28size_gb%29+as+integer%29+as+total_gb%2C%0A++count%28*%29+as+num_files%0Afrom+raw%0Agroup+by+top_folders%0Aorder+by+sum%28size_gb%29+desc).
 40 | 
 41 | Change the `total_gb` line to this to round it to the nearest integer:
 42 | ```sql
 43 |   cast (sum(size_gb) as integer) as total_gb,
 44 | ```
 45 | ## Subqueries
 46 | 
 47 | SQLite has excellent support for subqueries. You can use them in `where X in` clauses:
 48 | 
 49 | ```sql
 50 | select html_url from releases where repo in (
 51 |   select id from repos where full_name in (
 52 |     select repo from plugin_repos
 53 |   )
 54 | )
 55 | order by created_at desc
 56 | ```
 57 | [Run that on datasette.io](https://datasette.io/content?sql=select+html_url+from+releases+where+repo+in+%28%0D%0A++select+id+from+repos+where+full_name+in+%28%0D%0A++++select+repo+from+plugin_repos%0D%0A++%29%0D%0A%29%0D%0Aorder+by+created_at+desc). Sometimes I find these to be more readable than joins!
 58 | 
 59 | You can also use them directly in `select` clauses:
 60 | 
 61 | ```sql
 62 | select
 63 |   full_name,
 64 |   (
 65 |     select
 66 |       html_url
 67 |     from
 68 |       releases
 69 |     where
 70 |       releases.repo = repos.id
 71 |     order by
 72 |       created_at desc
 73 |     limit
 74 |       1
 75 |   ) as latest_release
 76 | from
 77 |   repos
 78 | ```
 79 | [Run that here](https://datasette.io/content?sql=select+full_name%2C+%28select+html_url+from+releases+where+releases.repo+%3D+repos.id+order+by+created_at+desc+limit+1%29+as+latest_release+from+repos).
 80 | 
 81 | ## CTEs
 82 | 
 83 | CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.
 84 | 
 85 | ```sql
 86 | with presidents as (
 87 |   select
 88 |     executives.name
 89 |   from
 90 |     executive_terms
 91 |     join executives
 92 |       on executive_terms.executive_id = executives.id
 93 |   where
 94 |     executive_terms.type = 'prez'
 95 | ),
 96 | vice_presidents as (
 97 |   select
 98 |     executives.name
 99 |   from
100 |     executive_terms
101 |     join executives
102 |       on executive_terms.executive_id = executives.id
103 |   where
104 |     executive_terms.type = 'viceprez'
105 | )
106 | select
107 |   distinct name
108 | from
109 |   presidents
110 | where name in vice_presidents
111 | ```
112 | [Try this CTE query here](https://congress-legislators.datasettes.com/legislators?sql=with+presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27prez%27%0D%0A%29%2C%0D%0Avice_presidents+as+%28%0D%0A++select%0D%0A++++executives.name%0D%0A++from%0D%0A++++executive_terms%0D%0A++++join+executives+on+executive_terms.executive_id+%3D+executives.id%0D%0A++where%0D%0A++++executive_terms.type+%3D+%27viceprez%27%0D%0A%29%0D%0Aselect%0D%0A++distinct+name%0D%0Afrom%0D%0A++presidents%0D%0Awhere%0D%0A++name+in+vice_presidents).
113 | 
114 | ## JSON
115 | 
116 | SQLite has excellent JSON functionality built in. Store JSON in a `text` column and you can query it using `json_extract()` - you can also build JSON values in `select` queries.
117 | 
118 | [Returning related rows in a single SQL query using JSON](https://til.simonwillison.net/sqlite/related-rows-single-query) shows some advanced tricks you can do with this.
119 | 
120 | ```sql
121 | select
122 |   legislators.id,
123 |   legislators.name,
124 |   json_group_array(json_object(
125 |     'type', legislator_terms.type,
126 |     'state', legislator_terms.state,
127 |     'start', legislator_terms.start,
128 |     'end', legislator_terms.end,
129 |     'party', legislator_terms.party
130 |    )) as terms,
131 |    count(*) as num_terms
132 | from
133 |   legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
134 |   group by legislators.id
135 | order by
136 |   id
137 | limit
138 |   10
139 | ```
140 | [Run that query](https://congress-legislators.datasettes.com/legislators?sql=select%0D%0A++legislators.id%2C%0D%0A++legislators.name%2C%0D%0A++json_group_array(json_object(%0D%0A++++%27type%27%2C+legislator_terms.type%2C%0D%0A++++%27state%27%2C+legislator_terms.state%2C%0D%0A++++%27start%27%2C+legislator_terms.start%2C%0D%0A++++%27end%27%2C+legislator_terms.end%2C%0D%0A++++%27party%27%2C+legislator_terms.party%0D%0A+++))+as+terms%2C%0D%0A+++count(*)+as+num_terms%0D%0Afrom%0D%0A++legislators+join+legislator_terms+on+legislator_terms.legislator_id+%3D+legislators.id%0D%0A++group+by+legislators.id%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++10).
141 | 
142 | Paul Ford [said about SQLite's JSON support](https://simonwillison.net/2018/Jan/29/paul-ford/):
143 | 
144 | > The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
145 | 
146 | ## Window functions
147 | 
148 | I wanted to run a query that would return the following:
149 | 
150 | - The repository name
151 | - The date of the most recent release from that repository (the releases table is a many-to-one against repos)
152 | - The total number of releases
153 | - **The three most recent releases** (as a JSON array of objects)
154 | 
155 | 
156 | ```sql
157 | with cte as (
158 |   select
159 |     repos.full_name,
160 |     releases.created_at,
161 |     releases.id as rel_id,
162 |     releases.name as rel_name,
163 |     releases.created_at as rel_created_at,
164 |     rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
165 |   from repos
166 |     left join releases on releases.repo = repos.id
167 | )
168 | select
169 |   full_name,
170 |   max(created_at) as max_created_at,
171 |   count(rel_id) as releases_count,
172 |   json_group_array(
173 |     json_object(
174 |       'id', rel_id,
175 |       'name', rel_name,
176 |       'created_at', rel_created_at
177 |     )
178 |   ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
179 | from cte
180 | group by full_name
181 | order by releases_count desc
182 | ```
183 | [Run that query here](https://datasette.io/content?sql=with+cte+as+%28%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank%28%29+over+%28partition+by+repos.id+order+by+releases.created_at+desc%29+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A%29%0D%0Aselect%0D%0A++full_name%2C%0D%0A++max%28created_at%29+as+max_created_at%2C%0D%0A++count%28rel_id%29+as+releases_count%2C%0D%0A++json_group_array%28%0D%0A++++json_object%28%0D%0A++++++%27id%27%2C+rel_id%2C%0D%0A++++++%27name%27%2C+rel_name%2C%0D%0A++++++%27created_at%27%2C+rel_created_at%0D%0A++++%29%0D%0A++%29+filter+%28where+rel_id+is+not+null+and+rel_rank+%3C%3D+3%29+as+recent_releases%0D%0Afrom+cte%0D%0Agroup+by+full_name%0D%0Aorder+by+releases_count+desc).
184 | 
185 | [Running this smaller query](https://datasette.io/content?sql=with+cte+as+(%0D%0A++select%0D%0A++++repos.full_name%2C%0D%0A++++releases.created_at%2C%0D%0A++++releases.id+as+rel_id%2C%0D%0A++++releases.name+as+rel_name%2C%0D%0A++++releases.created_at+as+rel_created_at%2C%0D%0A++++rank()+over+(partition+by+repos.id+order+by+releases.created_at+desc)+as+rel_rank%0D%0A++from+repos%0D%0A++++left+join+releases+on+releases.repo+%3D+repos.id%0D%0A)%0D%0Aselect+*+from+cte) helps show what's going on with that `rel_rank` column:
186 | 
187 | ```sql
188 | with cte as (
189 |   select
190 |     repos.full_name,
191 |     releases.created_at,
192 |     releases.id as rel_id,
193 |     releases.name as rel_name,
194 |     releases.created_at as rel_created_at,
195 |     rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
196 |   from repos
197 |     left join releases on releases.repo = repos.id
198 | )
199 | select * from cte
200 | ```
201 | 


--------------------------------------------------------------------------------
/baked-data.md:
--------------------------------------------------------------------------------
 1 | # Baked Data
 2 | 
 3 | The [The Baked Data architectural pattern](https://simonwillison.net/2021/Jul/28/baked-data/) describes this approach, which is key to taking full advantage of SQLite and Datasette.
 4 | 
 5 | I like to build my databases in GitHub Actions.
 6 | 
 7 | ## Niche Museums and TILs
 8 | 
 9 | - <https://www.niche-museums.com/> is published from the <https://github.com/simonw/museums> repository - one big YAML file for the content.
10 | - <https://til.simonwillison.net/> is published <https://github.com/simonw/til> - separate Markdown files for each item.
11 | 
12 | Both of these sites have Atom feeds that are defined using a Datasette [canned query](https://docs.datasette.io/en/stable/sql_queries.html#canned-queries), in conjunction with the [datasette-atom](https://datasette.io/plugins/datasette-atom) plugin.
13 | 
14 | - <https://www.niche-museums.com/browse/feed>
15 | - <https://til.simonwillison.net/tils/feed>
16 | 
17 | ## Generating a newsletter with an Observable notebook
18 | 
19 | I wrote about this in [Semi-automating a Substack newsletter with an Observable notebook](https://simonwillison.net/2023/Apr/4/substack-observable/):
20 | 
21 | - <https://datasette.simonwillison.net/simonwillisonblog> is a Datasette/SQLite copy of my Django blog, created using [db-to-sqlite](https://datasette.io/tools/db-to-sqlite) by my <https://github.com/simonw/simonwillisonblog-backup> GitHub repository.
22 | - <https://observablehq.com/@simonw/blog-to-newsletter> is my Observable notebook that assembles a newsletter from that data.
23 | - <https://simonw.substack.com/> is the Substack newsletter that I copy that content into.
24 | 


--------------------------------------------------------------------------------
/codespaces-jupyter.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/simonw/sqlite-tutorial-pycon-2023/2037cca22c6e3a3d87e35796d73e912eef9d12b0/codespaces-jupyter.gif


--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from subprocess import PIPE, Popen
  5 | 
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | # If extensions (or modules to document with autodoc) are in another directory,
 16 | # add these directories to sys.path here. If the directory is relative to the
 17 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 18 | #
 19 | # import os
 20 | # import sys
 21 | # sys.path.insert(0, os.path.abspath('.'))
 22 | 
 23 | 
 24 | # -- General configuration ------------------------------------------------
 25 | 
 26 | # If your documentation needs a minimal Sphinx version, state it here.
 27 | #
 28 | # needs_sphinx = '1.0'
 29 | 
 30 | # Add any Sphinx extension module names here, as strings. They can be
 31 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 32 | # ones.
 33 | extensions = ["myst_parser"]
 34 | 
 35 | # Add any paths that contain templates here, relative to this directory.
 36 | templates_path = ["_templates"]
 37 | 
 38 | # The suffix(es) of source filenames.
 39 | # You can specify multiple suffix as a list of string:
 40 | #
 41 | # source_suffix = ['.rst', '.md']
 42 | source_suffix = ".rst"
 43 | 
 44 | # The master toctree document.
 45 | master_doc = "index"
 46 | 
 47 | # General information about the project.
 48 | project = "Data analysis with SQLite and Python, PyCon 2023"
 49 | copyright = "2023, Simon Willison"
 50 | author = "Simon Willison"
 51 | 
 52 | # The version info for the project you're documenting, acts as replacement for
 53 | # |version| and |release|, also used in various other places throughout the
 54 | # built documents.
 55 | #
 56 | # The short X.Y version.
 57 | pipe = Popen("git describe --tags --always", stdout=PIPE, shell=True)
 58 | git_version = pipe.stdout.read().decode("utf8")
 59 | 
 60 | if git_version:
 61 |     version = git_version.rsplit("-", 1)[0]
 62 |     release = git_version
 63 | else:
 64 |     version = ""
 65 |     release = ""
 66 | 
 67 | # The language for content autogenerated by Sphinx. Refer to documentation
 68 | # for a list of supported languages.
 69 | #
 70 | # This is also used if you do content translation via gettext catalogs.
 71 | # Usually you set "language" from the command line for these cases.
 72 | language = None
 73 | 
 74 | # List of patterns, relative to source directory, that match files and
 75 | # directories to ignore when looking for source files.
 76 | # This patterns also effect to html_static_path and html_extra_path
 77 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
 78 | 
 79 | # The name of the Pygments (syntax highlighting) style to use.
 80 | pygments_style = "sphinx"
 81 | 
 82 | # If true, `todo` and `todoList` produce output, else they produce nothing.
 83 | todo_include_todos = False
 84 | 
 85 | 
 86 | # -- Options for HTML output ----------------------------------------------
 87 | 
 88 | # The theme to use for HTML and HTML Help pages.  See the documentation for
 89 | # a list of builtin themes.
 90 | #
 91 | html_theme = "sphinx_rtd_theme"
 92 | 
 93 | # Theme options are theme-specific and customize the look and feel of a theme
 94 | # further.  For a list of options available for each theme, see the
 95 | # documentation.
 96 | #
 97 | # html_theme_options = {}
 98 | 
 99 | # Add any paths that contain custom static files (such as style sheets) here,
100 | # relative to this directory. They are copied after the builtin static files,
101 | # so a file named "default.css" will overwrite the builtin "default.css".
102 | html_static_path = ["_static"]
103 | 
104 | # Custom sidebar templates, must be a dictionary that maps document names
105 | # to template names.
106 | #
107 | # This is required for the alabaster theme
108 | # refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
109 | html_sidebars = {
110 |     "**": [
111 |         "relations.html",  # needs 'show_related': True theme option to display
112 |         "searchbox.html",
113 |     ]
114 | }
115 | 
116 | 
117 | # -- Options for HTMLHelp output ------------------------------------------
118 | 
119 | # Output file base name for HTML help builder.
120 | htmlhelp_basename = "sqlite-tutorial-pycon-2023-doc"
121 | 
122 | 
123 | # -- Options for LaTeX output ---------------------------------------------
124 | 
125 | latex_elements = {
126 |     # The paper size ('letterpaper' or 'a4paper').
127 |     #
128 |     # 'papersize': 'letterpaper',
129 |     # The font size ('10pt', '11pt' or '12pt').
130 |     #
131 |     # 'pointsize': '10pt',
132 |     # Additional stuff for the LaTeX preamble.
133 |     #
134 |     # 'preamble': '',
135 |     # Latex figure (float) alignment
136 |     #
137 |     # 'figure_align': 'htbp',
138 | }
139 | 
140 | # Grouping the document tree into LaTeX files. List of tuples
141 | # (source start file, target name, title,
142 | #  author, documentclass [howto, manual, or own class]).
143 | latex_documents = [
144 |     (
145 |         master_doc,
146 |         "sqlite-tutorial-pycon-2023.tex",
147 |         "SQLite Tutorial PyCon 2023",
148 |         "Simon Willison",
149 |         "manual",
150 |     )
151 | ]
152 | 
153 | 
154 | # -- Options for manual page output ---------------------------------------
155 | 
156 | # One entry per manual page. List of tuples
157 | # (source start file, name, description, authors, manual section).
158 | man_pages = [
159 |     (
160 |         master_doc,
161 |         "sqlite-tutorial-pycon-2023",
162 |         "SQLite Tutorial PyCon 2023",
163 |         [author],
164 |         1,
165 |     )
166 | ]
167 | 
168 | 
169 | # -- Options for Texinfo output -------------------------------------------
170 | 
171 | # Grouping the document tree into Texinfo files. List of tuples
172 | # (source start file, target name, title, author,
173 | #  dir menu entry, description, category)
174 | texinfo_documents = [
175 |     (
176 |         master_doc,
177 |         "sqlite-tutorial-pycon-2023",
178 |         "SQLite Tutorial PyCon 2023",
179 |         author,
180 |         "sqlite-tutorial-pycon-2023",
181 |         "A tutorial at PyCon 2023",
182 |         "Miscellaneous",
183 |     )
184 | ]
185 | 


--------------------------------------------------------------------------------
/datasette-lite.md:
--------------------------------------------------------------------------------
 1 | # Datasette Lite
 2 | 
 3 | It's Datasette... running entirely in your web browser with WebAssembly and Pyodide!
 4 | 
 5 | <https://lite.datasette.io/>
 6 | 
 7 | ## Loading SQLite, CSV and JSON data
 8 | 
 9 | - SQLite: <https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&install=datasette-copyable#/hmb?sql=with+item1+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item_version%0A++++join+commits+on+item_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%2C%0Aitem2+as+(%0A++select%0A++++time(datetime(commits.commit_at%2C+'-7+hours'))+as+t%2C%0A++++duration_in_traffic+%2F+60+as+mins_in_traffic%0A++from%0A++++item2_version%0A++++join+commits+on+item2_version._commit+%3D+commits.id%0A++order+by%0A++++commits.commit_at%0A)%0Aselect%0A++item1.*%2C%0A++item2.mins_in_traffic+as+mins_in_traffic_other_way%0Afrom%0A++item1%0A++join+item2+on+item1.t+%3D+item2.t> - see [Measuring traffic during the Half Moon Bay Pumpkin Festival](https://simonwillison.net/2022/Oct/19/measuring-traffic/)
10 | - CSV: <https://lite.datasette.io/?csv=https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv>
11 | - JSON: <https://lite.datasette.io/?json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464>
12 | 
13 | ## Installing plugins
14 | 
15 | Add `?install=name-of-plugin` to `pip install` that plugin into your browser's environment!
16 | 
17 | This only works with a subset of plugins.
18 | 
19 | - <https://lite.datasette.io/?install=datasette-copyable&json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464>
20 | 
21 | ## Further reading
22 | 
23 | - [Datasette Lite: a server-side Python web application running in a browser](https://simonwillison.net/2022/May/4/datasette-lite/)
24 | - [Plugin support for Datasette Lite](https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/)
25 | - [Joining CSV files in your browser using Datasette Lite](https://simonwillison.net/2022/Jun/20/datasette-lite-csvs/)


--------------------------------------------------------------------------------
/datasette-publish.md:
--------------------------------------------------------------------------------
 1 | # Publishing a database to Vercel
 2 | 
 3 | First, install both Vercel and the `datasette-publish-vercel` plugin.
 4 | 
 5 | <https://vercel.com/docs/cli> has documentation for installing the Vercel CLI.
 6 | 
 7 | On macOS:
 8 | 
 9 |     brew install vercel-cli
10 | 
11 | Or use one of these:
12 | 
13 |     npm i -g vercel
14 | 
15 | Or:
16 | 
17 |     pnpm i -g vercel
18 | 
19 | Now run this command to login:
20 | 
21 |     vercel login
22 | 
23 | Install the plugin:
24 | 
25 |     datasette install datasette-publish-vercel
26 | 
27 | And deploy the database:
28 | 
29 |     datasette publish vercel /tmp/peps.db --project python-peps
30 | 
31 | ## Other publishing options
32 | 
33 | Datasette [can publish](https://docs.datasette.io/en/stable/publish.html) to the following providers:
34 | 
35 | - Heroku (`datasette publish heroku`)
36 | - Google Cloud Run (`datasette publish cloudrun`)
37 | - Vercel (with [datasette-publish-vercel](https://datasette.io/plugins/datasette-publish-vercel))
38 | - Fly (with [datasette-publish-fly](https://datasette.io/plugins/datasette-publish-fly))
39 | 
40 | Further deployment options are described [in the documentation](https://docs.datasette.io/en/stable/deploying.html).
41 | 


--------------------------------------------------------------------------------
/datasette.md:
--------------------------------------------------------------------------------
 1 | # Exploring data with Datasette
 2 | 
 3 | [Datasette](https://datasette.io/) is "an open source multi-tool for exploring and publishing data".
 4 | 
 5 | ## Installing Datasette locally
 6 | 
 7 | ```
 8 | pip install datasette
 9 | ```
10 | Or if you prefer `pipx`:
11 | ```
12 | pipx install datasette
13 | ```
14 | Or Homebrew (on macOS):
15 | ```
16 | brew install datasette
17 | ```
18 | [More installations options](https://docs.datasette.io/en/stable/installation.html).
19 | 
20 | In Codespaces you should also install the `datasette-codespaces` plugin:
21 | 
22 |     datasette install datasette-codespaces
23 | 
24 | ## Try a database: legislators.db
25 | 
26 | ```
27 | wget https://congress-legislators.datasettes.com/legislators.db
28 | ```
29 | 
30 | This is a database of US legislators, presidents and vice presidents.
31 | 
32 | You can explore it online at <https://congress-legislators.datasettes.com/legislators>
33 | 
34 | Open it in Datasette like this:
35 | 
36 |     datasette legislators.db
37 | 
38 | We'll follow this tutorial to explore Datasette's features: **[Exploring a database with Datasette](https://datasette.io/tutorials/explore)**
39 | 
40 | ## Install some plugins
41 | 
42 | Datasette has over a hundred plugins: <https://datasette.io/plugins>
43 | 
44 | You can `pip install` them, but it's better to use `datasette install` as that ensures they will go in the correct virtual environment, especially useful if you used `pipx` or Homebrew to install Datasette itself.
45 | 
46 |     datasette install datasette-cluster-map
47 | 
48 | Now restart Datasette and visit the "offices" table to see the result.
49 | 
50 | You can review what plugins are installed with:
51 | 
52 |     datasette plugins
53 | 
54 | Or by visiting the `/-/plugins` page in Datasette.
55 | 
56 | Plugins can be uninstalled with:
57 | 
58 |     datasette uninstall datasette-cluster-map
59 | 
60 | ## Learning SQL with Datasette
61 | 
62 | The "✎ View and edit SQL" link is a quick way to start learning basic SQL queries.
63 | 
64 | We'll follow this tutorial next: **[Learn SQL with Datasette](https://datasette.io/tutorials/learn-sql)**
65 | 
66 | 
67 | 


--------------------------------------------------------------------------------
/further-reading.md:
--------------------------------------------------------------------------------
 1 | # More demos and further reading
 2 | 
 3 | ## Fun demos
 4 | 
 5 | I post a lot of Datasette projects on my blog. Some of my recent favourites:
 6 | 
 7 | - [Exploring the training data behind Stable Diffusion](https://simonwillison.net/2022/Sep/5/laion-aesthetics-weeknotes/) describes how I built [laion-aesthetic.datasette.io](https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images) to explore 12m training images used for Stable Diffusion.
 8 | - [Analyzing ScotRail audio announcements with Datasette—from prototype to production](https://simonwillison.net/2022/Aug/21/scotrail/) explains the [scotrail.datasette.io](https://scotrail.datasette.io/) project.
 9 | - [I built a ChatGPT plugin to answer questions about data hosted in Datasette](https://simonwillison.net/2023/Mar/24/datasette-chatgpt-plugin/) is one of my earlier explorations at the intersection of LLM AI and Datasette.
10 | 
11 | ## SpatiaLite
12 | 
13 | - [Building a location to time zone API with SpatiaLite](https://datasette.io/tutorials/spatialite) shows how to use SpatiaLite and Datasette to create an API for looking up time zones based on latitude/longitude points.
14 | - [Drawing shapes on a map to query a SpatiaLite database](https://simonwillison.net/2021/Jan/24/drawing-shapes-spatialite/) introduces the `datasette-leaflet-freedraw` plugin and introduces the [calands.datasettes.com](https://calands.datasettes.com/) demo.
15 | - [Joining CSV and JSON data with an in-memory SQLite database](https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/)


--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
 1 | # Data analysis with SQLite and Python, PyCon 2023
 2 | 
 3 | On YouTube: [Data analysis with SQLite and Python - 2hr45m](https://www.youtube.com/watch?v=5TdIxxBPUSI)
 4 | 
 5 | ```{toctree}
 6 | ---
 7 | maxdepth: 3
 8 | ---
 9 | setup
10 | sqlite-intro
11 | datasette
12 | sqlite-utils
13 | datasette-publish
14 | datasette-lite
15 | advanced-sql
16 | baked-data
17 | further-reading
18 | ```
19 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | sphinx_rtd_theme
2 | sphinx-autobuild
3 | myst-parser
4 | 


--------------------------------------------------------------------------------
/setup.md:
--------------------------------------------------------------------------------
 1 | # What you'll need
 2 | 
 3 | ## python3 and pip
 4 | 
 5 | For the first part of this tutorial, you'll need a Python 3 interpreter with the `sqlite3` standard library module available.
 6 | 
 7 | You can run this on your own computer, or use a browser-based environment.
 8 | 
 9 | <https://pyodide.org/en/stable/console.html> will work for a purely browser-based (WebAssembly) environment.
10 | 
11 | For the second part, you'll also need the ability to `pip install` Python packages.
12 | 
13 | Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use [GitHub Codespaces](https://github.com/github/codespaces-jupyter) or [Google Colab](https://colab.research.google.com/) or [Jupyter Hub](https://jupyter.org/try) or your online notebook solution of choice.
14 | 
15 | ## Optional: GitHub Codespaces
16 | 
17 | I'll be working through the tutorial using GitHub Codespaces, using <https://github.com/github/codespaces-jupyter>
18 | 
19 | ![Animated demo of Codespaces Jupyter](codespaces-jupyter.gif)


--------------------------------------------------------------------------------
/sqlite-intro.md:
--------------------------------------------------------------------------------
  1 | # Introduction to SQLite
  2 | 
  3 | ## Why SQLite?
  4 | 
  5 | - It's the database you already have - `sqlite3` has been built into Python since 2006
  6 | - It's screamingly fast, and surprisingly powerful
  7 | - Amazing compatibility: bindings for every language, files work on every platform, fantastic track record for backwards compatibility, so it's safe to trust your data to a SQLite file
  8 | - Databases are just files on disk. You can create and discard them without any ceremony.
  9 | - It handles text (including JSON), integers, floating point numbers, and binary blobs. Which means it can store _anything_.
 10 | - It can handle up to 2.8TB of data(!)
 11 | - It has some interesting characteristics, for example [Many Small Queries Are Efficient In SQLite](https://www.sqlite.org/np1queryprob.html)
 12 | 
 13 | ## First steps with Python
 14 | 
 15 | Let's download a database to play with - we'll use the database that powers the <https://datasette.io/> website:
 16 | 
 17 | ```
 18 | wget https://datasette.io/content.db
 19 | ```
 20 | To access it from Python:
 21 | ```python
 22 | import sqlite3
 23 | 
 24 | db = sqlite3.connect("content.db")
 25 | 
 26 | print(db.execute("select sqlite_version()").fetchall())
 27 | # [('3.39.0',)]
 28 | 
 29 | # Show rows from the plugin_repos table
 30 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
 31 |     print(row)
 32 | 
 33 | # Each row is a tuple. We can change that like this:
 34 | db.row_factory = sqlite3.Row
 35 | 
 36 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
 37 |     print(row)
 38 | 
 39 | # This outputs <sqlite3.Row object at 0x7f5d3d8a3760>
 40 | # We can use dict() to turn those into dictionaries instead
 41 | for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
 42 |     print(dict(row))
 43 | 
 44 | ```
 45 | 
 46 | ## Creating a table
 47 | 
 48 | Let's create a table:
 49 | ```python
 50 | db.execute("""
 51 | create table peps (
 52 |   id integer primary key,
 53 |   title text,
 54 |   author text,
 55 |   status text,
 56 |   type text,
 57 |   created text,
 58 |   body text
 59 | );
 60 | """)
 61 | ```
 62 | 
 63 | ## Inserting some data
 64 | 
 65 | Here's a function I wrote that can parse a PEP:
 66 | 
 67 | ```python
 68 | def parse_pep(s):
 69 |     intro, body = s.split("\n\n", 1)
 70 |     pep = {}
 71 |     current_key = None
 72 |     current_value = None
 73 |     for line in intro.split("\n"):
 74 |         # If the line starts with whitespace, it's a continuation of the previous value
 75 |         if line.startswith(" ") or line.startswith("\t"):
 76 |             if current_key is not None:
 77 |                 current_value += " " + line.strip()
 78 |                 pep[current_key] = current_value.strip()
 79 |         else:
 80 |             # Split the line into key and value
 81 |             parts = line.split(": ", 1)
 82 |             if len(parts) == 2:
 83 |                 key, value = parts
 84 |                 # Update the current key and value
 85 |                 current_key = key
 86 |                 current_value = value
 87 |                 # Add the key-value pair to the pep dictionary
 88 |                 pep[current_key] = current_value.strip()
 89 |     pep["Body"] = body.strip()
 90 |     return pep
 91 | ```
 92 | Let's fetch and parse the Zen of Python:
 93 | ```python
 94 | import urllib.request
 95 | 
 96 | zen = urllib.request.urlopen(
 97 |     "https://raw.githubusercontent.com/python/peps/main/peps/pep-0020.rst"
 98 | ).read().decode("utf-8")
 99 | 
100 | pep = parse_pep(zen)
101 | ```
102 | And insert that into our database:
103 | ```python
104 | db.execute("""
105 |     insert into peps (
106 |         id, title, author, status, type, created, body
107 |     ) values (
108 |         ?, ?, ?, ?, ?, ?, ?
109 |     )
110 | """, (
111 |     pep["PEP"],
112 |     pep["Title"],
113 |     pep["Author"],
114 |     pep["Status"],
115 |     pep["Type"],
116 |     pep["Created"],
117 |     pep["Body"],
118 | ))
119 | ```
120 | Since this is a dictionary already, we can use alternative syntax like this:
121 | ```python
122 | db.execute("delete from peps where id = 20")
123 | db.execute("""
124 |     insert into peps (
125 |         id, title, author, status, type, created, body
126 |     ) values (
127 |         :PEP, :Title, :Author, :Status, :Type, :Created, :Body
128 |     )
129 | """, pep)
130 | ```
131 | To confirm that it was correctly inserted:
132 | ```python
133 | print(db.execute("select * from peps").fetchall())
134 | ```
135 | 
136 | ## UPDATE and DELETE
137 | 
138 | To update a record:
139 | 
140 | ```python
141 | with db:
142 |     db.execute("""
143 |     update peps set author = ?
144 |     where id = ?
145 |     """, ["Tim Peters", 20])
146 | ```
147 | This will run in a transaction.
148 | 
149 | To delete a record:
150 | 
151 | ```python
152 | with db:
153 |     db.execute("""
154 |     delete from peps
155 |     where id = ?
156 |     """, [20])
157 | ```
158 | Or to delete everything:
159 | ```sql
160 | delete from peps
161 | ```
162 | ## SQLite column types
163 | 
164 | SQLite `create table` is easier than many other databases, because there are less types to worry about. There are four types:
165 | 
166 | - `integer`
167 | - `real`
168 | - `text`
169 | - `blob`
170 | 
171 | Unlike other databases, length limits are neither required or enforced - so don't worry about `varchar(255)`, just use `text`.
172 | 
173 | Tables automatically get an ID column called `rowid` - an incrementing integer. This will be the primary key if you don't specify one.
174 | 
175 | If you specify `integer primary key` it will be auto-incrementing and will actually map to that underlying `rowid`.
176 | 
177 | You can set `id text primary key` for a text primary key - this will not increment, you will have to set it to a unique value for each row yourself. You could do this with UUIDs generated using `uuid.uuid4()` for example.
178 | 
179 | SQLite is loosely typed by default: you can insert any type into any column, even if it conflicts with the column type!
180 | 
181 | A lot of people find this very uncomfortable.
182 | 
183 | As-of SQLite 3.37.0 (2021-11-27) you can set [strict mode](https://www.sqlite.org/stricttables.html) on a table to opt-out of this loose typing:
184 | 
185 | ```sql
186 | create table peps (
187 |   id integer primary key,
188 |   title text,
189 |   author text,
190 |   body text
191 | ) strict
192 | ```
193 | 
194 | ## Transactions
195 | 
196 | Here's an example of the impact transactions have on file-based databases:
197 | 
198 | ```pycon
199 | >>> import sqlite3
200 | >>> db = sqlite3.connect("/tmp/data.db")
201 | >>> db.execute("create table foo (id integer primary key, name text)")
202 | <sqlite3.Cursor object at 0x102ec5c40>
203 | ```
204 | In another window:
205 | ```
206 | % sqlite3 data.db .dump
207 | PRAGMA foreign_keys=OFF;
208 | BEGIN TRANSACTION;
209 | CREATE TABLE foo (id integer primary key, name text);
210 | COMMIT;
211 | ```
212 | ```pycon
213 | >>> db.execute('insert into foo (name) values (?)', ['text'])
214 | <sqlite3.Cursor object at 0x102ec5bc0>
215 | ```
216 | In the other window:
217 | ```
218 | % sqlite3 data.db .dump
219 | PRAGMA foreign_keys=OFF;
220 | BEGIN TRANSACTION;
221 | CREATE TABLE foo (id integer primary key, name text);
222 | COMMIT;
223 | ```
224 | ```pycon
225 | >>> db.commit()
226 | ```
227 | And now:
228 | ```
229 | % sqlite3 data.db .dump
230 | PRAGMA foreign_keys=OFF;
231 | BEGIN TRANSACTION;
232 | CREATE TABLE foo (id integer primary key, name text);
233 | INSERT INTO foo VALUES(1,'text');
234 | COMMIT;
235 | ```
236 | A nicer pattern is to do this:
237 | ```pycon
238 | >>> with db:
239 | ...     db.execute('insert into foo (name) values (?)', ['text'])
240 | ```
241 | The `with db:` wraps everything inside that block in a transaction.
242 | 


--------------------------------------------------------------------------------
/sqlite-utils.md:
--------------------------------------------------------------------------------
  1 | # sqlite-utils
  2 | 
  3 | [sqlite-utils](https://sqlite-utils.datasette.io/) provides "CLI tool and Python utility functions for manipulating SQLite databases".
  4 | 
  5 | You can install it the same way as Datasette:
  6 | 
  7 |     pip install sqlite-utils
  8 | 
  9 | Or with `pipx`:
 10 | 
 11 |     pipx install sqlite-utils
 12 | 
 13 | Or with Homebrew:
 14 | 
 15 |     brew install sqlite-utils
 16 | 
 17 | It works as both a CLI tool and a Python library.
 18 | 
 19 | ## Using the command-line tools to clean data
 20 | 
 21 | We'll follow this tutorial next: **[Cleaning data with sqlite-utils and Datasette](https://datasette.io/tutorials/clean-data)**
 22 | 
 23 | ## Using sqlite-utils as a Python library, to import all the PEPs
 24 | 
 25 | Let's take our PEPs example from earlier and implement it again, but better, using `sqlite-utils`.
 26 | 
 27 | I'll do this in a notebook.
 28 | 
 29 | ```
 30 | !git clone https://github.com/python/peps /tmp/peps
 31 | ```
 32 | We now have ALL of the PEPs in `/tmp/peps`
 33 | 
 34 | ```python
 35 | import pathlib
 36 | 
 37 | files = list(pathlib.Path("/tmp/peps/peps").glob("pep-*.rst"))
 38 | ```
 39 | And parse them with our function from earlier:
 40 | ```python
 41 | def parse_pep(s):
 42 |     intro, body = s.split("\n\n", 1)
 43 |     pep = {}
 44 |     current_key = None
 45 |     current_value = None
 46 |     for line in intro.split("\n"):
 47 |         # If the line starts with whitespace, it's a continuation of the previous value
 48 |         if line.startswith(" ") or line.startswith("\t"):
 49 |             if current_key is not None:
 50 |                 current_value += " " + line.strip()
 51 |                 pep[current_key] = current_value.strip()
 52 |         else:
 53 |             # Split the line into key and value
 54 |             parts = line.split(": ", 1)
 55 |             if len(parts) == 2:
 56 |                 key, value = parts
 57 |                 # Update the current key and value
 58 |                 current_key = key
 59 |                 current_value = value
 60 |                 # Add the key-value pair to the pep dictionary
 61 |                 pep[current_key] = current_value.strip()
 62 |     pep["Body"] = body.strip()
 63 |     return pep
 64 | ```
 65 | ```python
 66 | peps = []
 67 | for file in files:
 68 |     peps.append(parse_pep(file.read_text()))
 69 | ```
 70 | We now have a list of dictionaries. Let's load them into SQLite:
 71 | ```
 72 | %pip install sqlite-utils
 73 | ```
 74 | ```python
 75 | import sqlite_utils
 76 | db = sqlite_utils.Database("/tmp/peps.db")
 77 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
 78 | ```
 79 | I got this error:
 80 | ```
 81 | OperationalError: table peps has no column named PEP-Delegate
 82 | ```
 83 | To fix that:
 84 | ```
 85 | db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
 86 | print(db["peps"].count)
 87 | # Outputs 429 
 88 | ```
 89 | ## Enabling full-text search
 90 | 
 91 | SQLite has surprisingly good full-text search built in.
 92 | 
 93 | `sqlite-utils` can help you enable it:
 94 | 
 95 | ```python
 96 | db["peps"].enable_fts(["Title", "Body"])
 97 | ```
 98 | Datasette will detect this and add a search box to the top of the table page.
 99 | 
100 | To run searches in relevance order you'll need to execute a custom SQL query:
101 | 
102 | ```sql
103 | select
104 |   PEP,
105 |   peps.Title,
106 |   Version,
107 |   Author,
108 |   Status,
109 |   Type,
110 |   Created,
111 |   peps.Body,
112 |   peps_fts.rank
113 | from
114 |   peps
115 | join
116 |   peps_fts on peps.rowid = peps_fts.rowid
117 | where
118 |   peps_fts match :search
119 | order by
120 |   peps_fts.rank
121 | limit
122 |   20
123 | ```
124 | 


--------------------------------------------------------------------------------