├── requirements.txt
├── .github
    └── workflows
    │   ├── sandpaper-version.txt
    │   ├── pr-close-signal.yaml
    │   ├── pr-post-remove-branch.yaml
    │   ├── pr-preflight.yaml
    │   ├── sandpaper-main.yaml
    │   ├── update-workflows.yaml
    │   ├── pr-receive.yaml
    │   ├── update-cache.yaml
    │   ├── pr-comment.yaml
    │   └── README.md
├── .update-copyright.conf
├── episodes
    ├── data
    │   ├── site.csv
    │   ├── person.csv
    │   ├── visited.csv
    │   └── survey.csv
    ├── files
    │   ├── survey.db
    │   └── sql-novice-survey-data.zip
    ├── fig
    │   ├── sql-filter.odg
    │   ├── sql-aggregation.odg
    │   └── sql-join-structure.svg
    ├── 08-hygiene.md
    ├── 04-calc.md
    ├── 02-sort-dup.md
    ├── 05-null.md
    ├── 11-prog-R.md
    ├── 09-create.md
    ├── 03-filter.md
    ├── 10-prog.md
    ├── 01-select.md
    ├── 07-join.md
    └── 06-agg.md
├── site
    └── README.md
├── profiles
    └── learner-profiles.md
├── learners
    ├── discuss.md
    ├── setup.md
    └── reference.md
├── CITATION
├── CODE_OF_CONDUCT.md
├── .editorconfig
├── AUTHORS
├── bin
    └── create-db.sql
├── .gitignore
├── .zenodo.json
├── .mailmap
├── index.md
├── README.md
├── config.yaml
├── LICENSE.md
├── instructors
    └── instructor-notes.md
└── CONTRIBUTING.md


/requirements.txt:
--------------------------------------------------------------------------------
1 | PyYAML
2 | update-copyright
3 | 


--------------------------------------------------------------------------------
/.github/workflows/sandpaper-version.txt:
--------------------------------------------------------------------------------
1 | 0.16.12
2 | 


--------------------------------------------------------------------------------
/.update-copyright.conf:
--------------------------------------------------------------------------------
1 | [project]
2 | vcs: Git
3 | 
4 | [files]
5 | authors: yes
6 | files: no
7 | 


--------------------------------------------------------------------------------
/episodes/data/site.csv:
--------------------------------------------------------------------------------
1 | DR-1,-49.85,-128.57
2 | DR-3,-47.15,-126.72
3 | MSK-4,-48.87,-123.4
4 | 


--------------------------------------------------------------------------------
/site/README.md:
--------------------------------------------------------------------------------
1 | This directory contains rendered lesson materials. Please do not edit files
2 | here.  
3 | 


--------------------------------------------------------------------------------
/episodes/files/survey.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/files/survey.db


--------------------------------------------------------------------------------
/episodes/fig/sql-filter.odg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/fig/sql-filter.odg


--------------------------------------------------------------------------------
/profiles/learner-profiles.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: FIXME
3 | ---
4 | 
5 | This is a placeholder file. Please add content here. 
6 | 


--------------------------------------------------------------------------------
/episodes/fig/sql-aggregation.odg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/fig/sql-aggregation.odg


--------------------------------------------------------------------------------
/episodes/data/person.csv:
--------------------------------------------------------------------------------
1 | dyer,William,Dyer
2 | pb,Frank,Pabodie
3 | lake,Anderson,Lake
4 | roe,Valentina,Roerich
5 | danforth,Frank,Danforth
6 | 


--------------------------------------------------------------------------------
/episodes/files/sql-novice-survey-data.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/files/sql-novice-survey-data.zip


--------------------------------------------------------------------------------
/episodes/data/visited.csv:
--------------------------------------------------------------------------------
1 | 619,DR-1,1927-02-08
2 | 622,DR-1,1927-02-10
3 | 734,DR-3,1930-01-07
4 | 735,DR-3,1930-01-12
5 | 751,DR-3,1930-02-26
6 | 752,DR-3,
7 | 837,MSK-4,1932-01-14
8 | 844,DR-1,1932-03-22
9 | 


--------------------------------------------------------------------------------
/learners/discuss.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Discussion
 3 | ---
 4 | 
 5 | Relational databases are the most widely used by far,
 6 | but other kinds also exist,
 7 | such as the document-oriented database [MongoDB](https://www.mongodb.com/).
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/CITATION:
--------------------------------------------------------------------------------
1 | Please cite as:
2 | 
3 | Abigail Cabunoc and Sheldon McKay (eds): "Software Carpentry: Using
4 | Databases and SQL."  Version 2017.08, August 2017,
5 | https://github.com/swcarpentry/sql-novice-survey,
6 | https://doi.org/10.5281/zenodo.838776
7 | 


--------------------------------------------------------------------------------
/episodes/data/survey.csv:
--------------------------------------------------------------------------------
 1 | 619,dyer,rad,9.82
 2 | 619,dyer,sal,0.13
 3 | 622,dyer,rad,7.8
 4 | 622,dyer,sal,0.09
 5 | 734,pb,rad,8.41
 6 | 734,lake,sal,0.05
 7 | 734,pb,temp,-21.5
 8 | 735,pb,rad,7.22
 9 | 735,,sal,0.06
10 | 735,,temp,-26.0
11 | 751,pb,rad,4.35
12 | 751,pb,temp,-18.5
13 | 751,lake,sal,0.1
14 | 752,lake,rad,2.19
15 | 752,lake,sal,0.09
16 | 752,lake,temp,-16.0
17 | 752,roe,sal,41.6
18 | 837,lake,rad,1.46
19 | 837,lake,sal,0.21
20 | 837,roe,sal,22.5
21 | 844,roe,rad,11.25
22 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Contributor Code of Conduct"
 3 | ---
 4 | 
 5 | As contributors and maintainers of this project,
 6 | we pledge to follow the [The Carpentries Code of Conduct][coc].
 7 | 
 8 | Instances of abusive, harassing, or otherwise unacceptable behavior
 9 | may be reported by following our [reporting guidelines][coc-reporting].
10 | 
11 | 
12 | [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
13 | [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
14 | 


--------------------------------------------------------------------------------
/.editorconfig:
--------------------------------------------------------------------------------
 1 | root = true
 2 | 
 3 | [*]
 4 | charset = utf-8
 5 | insert_final_newline = true
 6 | trim_trailing_whitespace = true
 7 | 
 8 | [*.md]
 9 | indent_size = 2
10 | indent_style = space
11 | max_line_length = 100  # Please keep this in sync with bin/lesson_check.py!
12 | trim_trailing_whitespace = false  # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (<br/>)
13 | 
14 | [*.r]
15 | max_line_length = 80
16 | 
17 | [*.py]
18 | indent_size = 4
19 | indent_style = space
20 | max_line_length = 79
21 | 
22 | [*.sh]
23 | end_of_line = lf
24 | 
25 | [Makefile]
26 | indent_style = tab
27 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-close-signal.yaml:
--------------------------------------------------------------------------------
 1 | name: "Bot: Send Close Pull Request Signal"
 2 | 
 3 | on:
 4 |   pull_request:
 5 |     types:
 6 |       [closed]
 7 | 
 8 | jobs:
 9 |   send-close-signal:
10 |     name: "Send closing signal"
11 |     runs-on: ubuntu-22.04
12 |     if: ${{ github.event.action == 'closed' }}
13 |     steps:
14 |       - name: "Create PRtifact"
15 |         run: |
16 |           mkdir -p ./pr
17 |           printf ${{ github.event.number }} > ./pr/NUM
18 |       - name: Upload Diff
19 |         uses: actions/upload-artifact@v4
20 |         with:
21 |           name: pr
22 |           path: ./pr
23 | 


--------------------------------------------------------------------------------
/AUTHORS:
--------------------------------------------------------------------------------
 1 | Paula Andrea
 2 | Pauline Barmby
 3 | Karl Broman
 4 | Amy Brown
 5 | Abigail Cabunoc Mayes
 6 | Daniel Chen
 7 | Liam Clark
 8 | Peter Cock
 9 | Matthew Collins
10 | Harrison Dekker
11 | Deborah Digges
12 | Stevan Earl
13 | Dirk Eddelbuettel
14 | Ivan Gonzalez
15 | John Gosset
16 | Thomas Guignard
17 | Jonathan Guyer
18 | Kate Hertweck
19 | Nick James
20 | Luke W. Johnston
21 | W. Trevor King
22 | Andrew Kubiak
23 | Avishek Kumar
24 | Peter Li
25 | Tobin Magle
26 | Sue McClatchy
27 | Sheldon McKay
28 | James Mickley
29 | John R. Moreau
30 | Joshua Nahum
31 | Maneesha Sane
32 | Raniere Silva
33 | Luc Small
34 | Donald Speer
35 | Mark Stacy
36 | Jeff Stafford
37 | Scott Talafuse
38 | Morgan Taschuk
39 | Chris Tomlinson
40 | Ioan Vancea
41 | Ben Waugh
42 | Ethan White
43 | Greg Wilson
44 | Donny Winston
45 | 


--------------------------------------------------------------------------------
/bin/create-db.sql:
--------------------------------------------------------------------------------
 1 | -- Create database to be used for learners.
 2 | -- The data for the database are available as CSV files.
 3 | -- For more information, see https://www.sqlite.org/cli.html#csv
 4 | 
 5 | -- Generate tables.
 6 | create table Person (id text, personal text, family text);
 7 | create table Site (name text, lat real, long real);
 8 | create table Visited (id text, site text, dated text);
 9 | create table Survey (taken integer, person text, quant text, reading real);
10 | 
11 | -- Import data.
12 | .mode csv
13 | .import data/person.csv Person
14 | .import data/site.csv Site
15 | .import data/survey.csv Survey
16 | .import data/visited.csv Visited
17 | 
18 | -- Convert empty strings to NULLs.
19 | UPDATE Visited SET dated = null WHERE dated = '';
20 | UPDATE Survey SET person = null WHERE person = '';
21 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-post-remove-branch.yaml:
--------------------------------------------------------------------------------
 1 | name: "Bot: Remove Temporary PR Branch"
 2 | 
 3 | on:
 4 |   workflow_run:
 5 |     workflows: ["Bot: Send Close Pull Request Signal"]
 6 |     types:
 7 |       - completed
 8 | 
 9 | jobs:
10 |   delete:
11 |     name: "Delete branch from Pull Request"
12 |     runs-on: ubuntu-22.04
13 |     if: >
14 |       github.event.workflow_run.event == 'pull_request' &&
15 |       github.event.workflow_run.conclusion == 'success'
16 |     permissions:
17 |       contents: write
18 |     steps:
19 |       - name: 'Download artifact'
20 |         uses: carpentries/actions/download-workflow-artifact@main
21 |         with:
22 |           run: ${{ github.event.workflow_run.id }}
23 |           name: pr
24 |       - name: "Get PR Number"
25 |         id: get-pr
26 |         run: |
27 |           unzip pr.zip
28 |           echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT
29 |       - name: 'Remove branch'
30 |         uses: carpentries/actions/remove-branch@main
31 |         with:
32 |           pr: ${{ steps.get-pr.outputs.NUM }}
33 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # sandpaper files
 2 | episodes/*html
 3 | site/*
 4 | !site/README.md
 5 | 
 6 | # History files
 7 | .Rhistory
 8 | .Rapp.history
 9 | # Session Data files
10 | .RData
11 | # User-specific files
12 | .Ruserdata
13 | # Example code in package build process
14 | *-Ex.R
15 | # Output files from R CMD build
16 | /*.tar.gz
17 | # Output files from R CMD check
18 | /*.Rcheck/
19 | # RStudio files
20 | .Rproj.user/
21 | # produced vignettes
22 | vignettes/*.html
23 | vignettes/*.pdf
24 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
25 | .httr-oauth
26 | # knitr and R markdown default cache directories
27 | *_cache/
28 | /cache/
29 | # Temporary files created by R markdown
30 | *.utf8.md
31 | *.knit.md
32 | # R Environment Variables
33 | .Renviron
34 | # pkgdown site
35 | docs/
36 | # translation temp files
37 | po/*~
38 | # renv detritus
39 | renv/sandbox/
40 | *.pyc
41 | *~
42 | .DS_Store
43 | .ipynb_checkpoints
44 | .sass-cache
45 | .jekyll-cache/
46 | .jekyll-metadata
47 | __pycache__
48 | _site
49 | .Rproj.user
50 | .bundle/
51 | .vendor/
52 | vendor/
53 | .docker-vendor/
54 | Gemfile.lock
55 | .*history
56 | 


--------------------------------------------------------------------------------
/.zenodo.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "contributors": [
 3 |     {
 4 |       "type": "Editor",
 5 |       "name": "Henry Senyondo",
 6 |       "orcid": "0000-0001-7105-5808"
 7 |     }
 8 |   ],
 9 |   "creators": [
10 |     {
11 |       "name": "Henry Senyondo",
12 |       "orcid": "0000-0001-7105-5808"
13 |     },
14 |     {
15 |       "name": "James Scott-Brown"
16 |     },
17 |     {
18 |       "name": "Dan Michael Heggø"
19 |     },
20 |     {
21 |       "name": "Kyrre Traavik Låberg"
22 |     },
23 |     {
24 |       "name": "Colin Sauze"
25 |     },
26 |     {
27 |       "name": "Mengzhen Sun"
28 |     },
29 |     {
30 |       "name": "Simon Willison"
31 |     },
32 |     {
33 |       "name": "Jenna Jordan",
34 |       "orcid": "0000-0001-9246-5355"
35 |     },
36 |     {
37 |       "name": "Mohammed Tanash",
38 |       "orcid": "0000-0002-2877-5735"
39 |     },
40 |     {
41 |       "name": "Peter Aronoff"
42 |     },
43 |     {
44 |       "name": "Samuel Lelièvre",
45 |       "orcid": "0000-0002-7275-0965"
46 |     },
47 |     {
48 |       "name": "Wilfred Tyler Gee"
49 |     },
50 |     {
51 |       "name": "Andrew Jerrison"
52 |     }
53 |   ],
54 |   "license": {
55 |     "id": "CC-BY-4.0"
56 |   }
57 | }


--------------------------------------------------------------------------------
/.mailmap:
--------------------------------------------------------------------------------
 1 | Abigail Cabunoc Mayes <abigail.cabunoc@gmail.com>
 2 | Abigail Cabunoc Mayes <abigail.cabunoc@gmail.com> <abigail.cabunoc@oicr.on.ca>
 3 | Chris Tomlinson <chris.tomlinson@imperial.ac.uk>
 4 | Deborah Digges <deborah.gertrude.digges@gmail.com>
 5 | Donald Speer <speerdw@gmail.com>
 6 | François Michonneau <francois.michonneau@gmail.com>
 7 | Greg Wilson <gvwilson@software-carpentry.org> <gvwilson@third-bit.com>
 8 | James Allen <james@sharelatex.com> <jamesallen0108@gmail.com>
 9 | John R. Moreau <John.R.Moreau@gmail.com>
10 | Liam Clark <lgxlg3@nottingham.ac.uk>
11 | Luke W. Johnston <lwjohnst@gmail.com>
12 | Maneesha Sane <ahseenam@gmail.com>
13 | Mike Jackson <m.jackson@software.ac.uk> <michaelj@epcc.ed.ac.uk>
14 | Nick James <nick.james150@yahoo.com>
15 | Pauline Barmby <pbarmby@uwo.ca>
16 | Raniere Silva <raniere@rgaiacs.com> <ra092767@ime.unicamp.br>
17 | Raniere Silva <raniere@rgaiacs.com> <raniere@ime.unicamp.br>
18 | Rémi Emonet <remi@heeere.com> <remi.emonet@reverse--com.heeere>
19 | Rémi Emonet <remi@heeere.com> <twitwi@users.noreply.github.com>
20 | Sheldon McKay <sheldon.mckay@gmail.com> <sheldon.mckay@gmail>
21 | Sue McClatchy <susan.mcclatchy@jax.org>
22 | Timothée Poisot <t.poisot@gmail.com> <tim@poisotlab.io>
23 | Tobin Magle <maglet@colostate.edu>
24 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-preflight.yaml:
--------------------------------------------------------------------------------
 1 | name: "Pull Request Preflight Check"
 2 | 
 3 | on:
 4 |   pull_request_target:
 5 |     branches:
 6 |       ["main"]
 7 |     types:
 8 |       ["opened", "synchronize", "reopened"]
 9 | 
10 | jobs:
11 |   test-pr:
12 |     name: "Test if pull request is valid"
13 |     if: ${{ github.event.action != 'closed' }}
14 |     runs-on: ubuntu-22.04
15 |     outputs:
16 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
17 |     permissions:
18 |       pull-requests: write
19 |     steps:
20 |       - name: "Get Invalid Hashes File"
21 |         id: hash
22 |         run: |
23 |           echo "json<<EOF
24 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
25 |           EOF" >> $GITHUB_OUTPUT
26 |       - name: "Check PR"
27 |         id: check-pr
28 |         uses: carpentries/actions/check-valid-pr@main
29 |         with:
30 |           pr: ${{ github.event.number }}
31 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
32 |           fail_on_error: true
33 |       - name: "Comment result of validation"
34 |         id: comment-diff
35 |         if: ${{ always() }}
36 |         uses: carpentries/actions/comment-diff@main
37 |         with:
38 |           pr: ${{ github.event.number }}
39 |           body: ${{ steps.check-pr.outputs.MSG }}
40 | 


--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | site: sandpaper::sandpaper_site
 3 | ---
 4 | 
 5 | In the late 1920s and early 1930s,
 6 | William Dyer,
 7 | Frank Pabodie,
 8 | and Valentina Roerich led expeditions to the
 9 | [Pole of Inaccessibility](https://en.wikipedia.org/wiki/Pole_of_inaccessibility)
10 | in the South Pacific,
11 | and then onward to Antarctica.
12 | Two years ago,
13 | their expeditions were found in a storage locker at Miskatonic University.
14 | We have scanned and OCR the data they contain,
15 | and we now want to store that information
16 | in a way that will make search and analysis easy.
17 | 
18 | Three common options for storage are
19 | text files,
20 | spreadsheets,
21 | and databases.
22 | Text files are easiest to create,
23 | and work well with version control,
24 | but then we would have to build search and analysis tools ourselves.
25 | Spreadsheets are good for doing simple analyses,
26 | but they don't handle large or complex data sets well.
27 | Databases, however, include powerful tools for search and analysis,
28 | and can handle large, complex data sets.
29 | These lessons will show how to use a database to explore the expeditions' data.
30 | 
31 | ::::::::::::::::::::::::::::::::::::::::::  prereq
32 | 
33 | ## Prerequisites
34 | 
35 | - This lesson requires the Unix shell, plus [SQLite3](https://www.sqlite.org/) or [DB Browser for SQLite](https://sqlitebrowser.org/).
36 | - Please download the database we will use: [survey.db](episodes/files/survey.db)
37 |   
38 | 
39 | ::::::::::::::::::::::::::::::::::::::::::::::::::
40 | 
41 | 
42 | 


--------------------------------------------------------------------------------
/learners/setup.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Setup
 3 | ---
 4 | 
 5 | # Software
 6 | 
 7 | For this course you will need the UNIX shell (described in the [UNIX Shell lesson](https://swcarpentry.github.io/shell-novice/#install-software)), plus [SQLite3](https://www.sqlite.org/) or
 8 | [DB Browser for SQLite](https://sqlitebrowser.org/).
 9 | 
10 | If you are running **macOS** you should already have SQLite installed. You can run `sqlite3 --version`
11 | in a terminal to confirm that it is available. You can also download DB Browser for SQLite from
12 | [their website](https://sqlitebrowser.org/dl/.)
13 | 
14 | If you are running **Linux**, you may already have SQLite3 installed, please use the command
15 | `which sqlite3` to see the path of the program, otherwise you should be able to get it
16 | from your package manager (on Debian/Ubuntu, you can use the command `apt install sqlite3`).
17 | 
18 | If you are running **Windows**, download installers and run them as administrator.
19 | Make sure you select the right installer version for your system.
20 | If the installer asks to add the path to the environment variables, check yes, otherwise you have to manually add the path of the executable to the `PATH` environmental variables.
21 | This path informs the system where to find the executable program.
22 | 
23 | If installing SQLite3 using Anaconda, refer to the [anaconda sqlite docs](https://anaconda.org/anaconda/sqlite).
24 | 
25 | After the installation and the setting of the paths, close the terminal and reopen a new terminal.
26 | This enables paths and configurations to be loaded.
27 | 
28 | # Files
29 | 
30 | Please download the database we'll be using: [survey.db](files/survey.db)
31 | 
32 | 
33 | 


--------------------------------------------------------------------------------
/.github/workflows/sandpaper-main.yaml:
--------------------------------------------------------------------------------
 1 | name: "01 Build and Deploy Site"
 2 | 
 3 | on:
 4 |   push:
 5 |     branches:
 6 |       - main
 7 |       - master
 8 |   schedule:
 9 |     - cron: '0 0 * * 2'
10 |   workflow_dispatch:
11 |     inputs:
12 |       name:
13 |         description: 'Who triggered this build?'
14 |         required: true
15 |         default: 'Maintainer (via GitHub)'
16 |       reset:
17 |         description: 'Reset cached markdown files'
18 |         required: false
19 |         default: false
20 |         type: boolean
21 | jobs:
22 |   full-build:
23 |     name: "Build Full Site"
24 | 
25 |     # 2024-10-01: ubuntu-latest is now 24.04 and R is not installed by default in the runner image
26 |     # pin to 22.04 for now
27 |     runs-on: ubuntu-22.04
28 |     permissions:
29 |       checks: write
30 |       contents: write
31 |       pages: write
32 |     env:
33 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
34 |       RENV_PATHS_ROOT: ~/.local/share/renv/
35 |     steps:
36 | 
37 |       - name: "Checkout Lesson"
38 |         uses: actions/checkout@v4
39 | 
40 |       - name: "Set up R"
41 |         uses: r-lib/actions/setup-r@v2
42 |         with:
43 |           use-public-rspm: true
44 |           install-r: false
45 | 
46 |       - name: "Set up Pandoc"
47 |         uses: r-lib/actions/setup-pandoc@v2
48 | 
49 |       - name: "Setup Lesson Engine"
50 |         uses: carpentries/actions/setup-sandpaper@main
51 |         with:
52 |           cache-version: ${{ secrets.CACHE_VERSION }}
53 | 
54 |       - name: "Setup Package Cache"
55 |         uses: carpentries/actions/setup-lesson-deps@main
56 |         with:
57 |           cache-version: ${{ secrets.CACHE_VERSION }}
58 | 
59 |       - name: "Deploy Site"
60 |         run: |
61 |           reset <- "${{ github.event.inputs.reset }}" == "true"
62 |           sandpaper::package_cache_trigger(TRUE)
63 |           sandpaper:::ci_deploy(reset = reset)
64 |         shell: Rscript {0}
65 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![GitHub release][shields_release]][swc_sql_novice_survey_releases]
 2 | [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://slack-invite.carpentries.org/)
 3 | [![Slack Status](https://img.shields.io/badge/Slack_Channel-swc--sql-E01563.svg)](https://carpentries.slack.com/messages/C9X3YNVNY)
 4 | [![DOI][zenodo_badge_DOI]][all_releases_DOI]
 5 | 
 6 | # sql-novice-survey
 7 | 
 8 | An introduction to databases and SQL using Antarctic survey data.
 9 | Please see [https://swcarpentry.github.io/sql-novice-survey/](https://swcarpentry.github.io/sql-novice-survey/) for a rendered version of this material,
10 | [the lesson template documentation][lesson-example]
11 | for instructions on formatting, building, and submitting material,
12 | or run `make` in this directory for a list of helpful commands.
13 | 
14 | ## Authors
15 | 
16 | A list of contributors to the lesson can be found in [AUTHORS](AUTHORS)
17 | 
18 | ## License
19 | 
20 | Instructional material from this lesson is made available under the Creative
21 | Commons Attribution (CC BY 4.0) license. Except where otherwise noted, example
22 | programs and software included as part of this lesson are made available under
23 | the MIT license. For more information, see [LICENSE.md](LICENSE.md).
24 | 
25 | ## Citation
26 | 
27 | To cite this lesson, please consult with [CITATION](CITATION)
28 | To use a particular version's DOI, refer to [all Zenodo released versions][all_zenodo_versions]
29 | 
30 | Maintainer(s):
31 | 
32 | - [Henry Senyondo](https://carpentries.org/instructors/#henrykironde)
33 | - [Kellie Templeman](https://github.com/kltempleman)
34 | - [Novica Nakov](https://github.com/novica)
35 | 
36 | [swc_sql_novice_survey_releases]: https://github.com/swcarpentry/sql-novice-survey/releases
37 | [shields_release]: https://img.shields.io/github/v/release/swcarpentry/sql-novice-survey
38 | [all_releases_DOI]: https://doi.org/10.5281/zenodo.3265270
39 | [zenodo_badge_DOI]: https://zenodo.org/badge/DOI/10.5281/zenodo.3265271.svg
40 | [lesson-example]: https://carpentries.github.io/lesson-example/
41 | [all_zenodo_versions]: https://zenodo.org/search?page=1&size=20&q=3265271&sort=-version&all_versions=True
42 | 
43 | 
44 | 
45 | 


--------------------------------------------------------------------------------
/.github/workflows/update-workflows.yaml:
--------------------------------------------------------------------------------
 1 | name: "02 Maintain: Update Workflow Files"
 2 | 
 3 | on:
 4 |   workflow_dispatch:
 5 |     inputs:
 6 |       name:
 7 |         description: 'Who triggered this build (enter github username to tag yourself)?'
 8 |         required: true
 9 |         default: 'weekly run'
10 |       clean:
11 |         description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)'
12 |         required: false
13 |         default: '.yaml'
14 |   schedule:
15 |     # Run every Tuesday
16 |     - cron: '0 0 * * 2'
17 | 
18 | jobs:
19 |   check_token:
20 |     name: "Check SANDPAPER_WORKFLOW token"
21 |     runs-on: ubuntu-22.04
22 |     outputs:
23 |       workflow: ${{ steps.validate.outputs.wf }}
24 |       repo: ${{ steps.validate.outputs.repo }}
25 |     steps:
26 |       - name: "validate token"
27 |         id: validate
28 |         uses: carpentries/actions/check-valid-credentials@main
29 |         with:
30 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
31 | 
32 |   update_workflow:
33 |     name: "Update Workflow"
34 |     runs-on: ubuntu-22.04
35 |     needs: check_token
36 |     if: ${{ needs.check_token.outputs.workflow == 'true' }}
37 |     steps:
38 |       - name: "Checkout Repository"
39 |         uses: actions/checkout@v4
40 | 
41 |       - name: Update Workflows
42 |         id: update
43 |         uses: carpentries/actions/update-workflows@main
44 |         with:
45 |           clean: ${{ github.event.inputs.clean }}
46 | 
47 |       - name: Create Pull Request
48 |         id: cpr
49 |         if: "${{ steps.update.outputs.new }}"
50 |         uses: carpentries/create-pull-request@main
51 |         with:
52 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
53 |           delete-branch: true
54 |           branch: "update/workflows"
55 |           commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}"
56 |           title: "Update Workflows to Version ${{ steps.update.outputs.new }}"
57 |           body: |
58 |             :robot: This is an automated build
59 | 
60 |             Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }}
61 | 
62 |             - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
63 | 
64 |             [1]: https://github.com/carpentries/create-pull-request/tree/main
65 |           labels: "type: template and tools"
66 |           draft: false
67 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | #------------------------------------------------------------
 2 | # Values for this lesson.
 3 | #------------------------------------------------------------
 4 | 
 5 | # Which carpentry is this (swc, dc, lc, or cp)?
 6 | # swc: Software Carpentry
 7 | # dc: Data Carpentry
 8 | # lc: Library Carpentry
 9 | # cp: Carpentries (to use for instructor training for instance)
10 | # incubator: The Carpentries Incubator
11 | carpentry: 'swc'
12 | 
13 | # Overall title for pages.
14 | title: 'Databases and SQL'
15 | 
16 | # Date the lesson was created (YYYY-MM-DD, this is empty by default)
17 | created: '2014-11-21'
18 | 
19 | # Comma-separated list of keywords for the lesson
20 | keywords: 'software, data, lesson, The Carpentries'
21 | 
22 | # Life cycle stage of the lesson
23 | # possible values: pre-alpha, alpha, beta, stable
24 | life_cycle: 'stable'
25 | 
26 | # License of the lesson materials (recommended CC-BY 4.0)
27 | license: 'CC-BY 4.0'
28 | 
29 | # Link to the source repository for this lesson
30 | source: 'https://github.com/swcarpentry/sql-novice-survey'
31 | 
32 | # Default branch of your lesson
33 | branch: 'main'
34 | 
35 | # Who to contact if there are any issues
36 | contact: 'team@carpentries.org'
37 | 
38 | # Navigation ------------------------------------------------
39 | #
40 | # Use the following menu items to specify the order of
41 | # individual pages in each dropdown section. Leave blank to
42 | # include all pages in the folder.
43 | #
44 | # Example -------------
45 | #
46 | # episodes:
47 | # - introduction.md
48 | # - first-steps.md
49 | #
50 | # learners:
51 | # - setup.md
52 | #
53 | # instructors:
54 | # - instructor-notes.md
55 | #
56 | # profiles:
57 | # - one-learner.md
58 | # - another-learner.md
59 | 
60 | # Order of episodes in your lesson
61 | episodes: 
62 | - 01-select.md
63 | - 02-sort-dup.md
64 | - 03-filter.md
65 | - 04-calc.md
66 | - 05-null.md
67 | - 06-agg.md
68 | - 07-join.md
69 | - 08-hygiene.md
70 | - 09-create.md
71 | - 10-prog.md
72 | - 11-prog-R.md
73 | 
74 | # Information for Learners
75 | learners: 
76 | 
77 | # Information for Instructors
78 | instructors: 
79 | 
80 | # Learner Profiles
81 | profiles: 
82 | 
83 | # Customisation ---------------------------------------------
84 | #
85 | # This space below is where custom yaml items (e.g. pinning
86 | # sandpaper and varnish versions) should live
87 | 
88 | 
89 | url: 'https://swcarpentry.github.io/sql-novice-survey'
90 | analytics: carpentries
91 | lang: en
92 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Licenses"
 3 | ---
 4 | 
 5 | ## Instructional Material
 6 | 
 7 | All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
 8 | instructional material is made available under the [Creative Commons
 9 | Attribution license][cc-by-human]. The following is a human-readable summary of
10 | (and not a substitute for) the [full legal text of the CC BY 4.0
11 | license][cc-by-legal].
12 | 
13 | You are free:
14 | 
15 | - to **Share**---copy and redistribute the material in any medium or format
16 | - to **Adapt**---remix, transform, and build upon the material
17 | 
18 | for any purpose, even commercially.
19 | 
20 | The licensor cannot revoke these freedoms as long as you follow the license
21 | terms.
22 | 
23 | Under the following terms:
24 | 
25 | - **Attribution**---You must give appropriate credit (mentioning that your work
26 |   is derived from work that is Copyright (c) The Carpentries and, where
27 |   practical, linking to <https://carpentries.org/>), provide a [link to the
28 |   license][cc-by-human], and indicate if changes were made. You may do so in
29 |   any reasonable manner, but not in any way that suggests the licensor endorses
30 |   you or your use.
31 | 
32 | - **No additional restrictions**---You may not apply legal terms or
33 |   technological measures that legally restrict others from doing anything the
34 |   license permits.  With the understanding that:
35 | 
36 | Notices:
37 | 
38 | * You do not have to comply with the license for elements of the material in
39 |   the public domain or where your use is permitted by an applicable exception
40 |   or limitation.
41 | * No warranties are given. The license may not give you all of the permissions
42 |   necessary for your intended use. For example, other rights such as publicity,
43 |   privacy, or moral rights may limit how you use the material.
44 | 
45 | ## Software
46 | 
47 | Except where otherwise noted, the example programs and other software provided
48 | by The Carpentries are made available under the [OSI][osi]-approved [MIT
49 | license][mit-license].
50 | 
51 | Permission is hereby granted, free of charge, to any person obtaining a copy of
52 | this software and associated documentation files (the "Software"), to deal in
53 | the Software without restriction, including without limitation the rights to
54 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
55 | of the Software, and to permit persons to whom the Software is furnished to do
56 | so, subject to the following conditions:
57 | 
58 | The above copyright notice and this permission notice shall be included in all
59 | copies or substantial portions of the Software.
60 | 
61 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
62 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
63 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
64 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
65 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
66 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
67 | SOFTWARE.
68 | 
69 | ## Trademark
70 | 
71 | "The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
72 | Carpentry" and their respective logos are registered trademarks of
73 | [The Carpentries, Inc.][carpentries].
74 | 
75 | [cc-by-human]: https://creativecommons.org/licenses/by/4.0/
76 | [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
77 | [mit-license]: https://opensource.org/licenses/mit-license.html
78 | [carpentries]: https://carpentries.org
79 | [osi]: https://opensource.org
80 | 


--------------------------------------------------------------------------------
/learners/reference.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: 'Glossary'
 3 | ---
 4 | 
 5 | ## Glossary
 6 | 
 7 | [aggregation function]{#aggregation-function}
 8 | :   A function that combines multiple values to produce a single new value (e.g. sum, mean, median).
 9 | 
10 | [atomic]{#atomic}
11 | :   Describes a value *not* divisible into parts that one might want to
12 | work with separately. For example, if one wanted to work with
13 | first and last names separately, the values "Ada" and "Lovelace"
14 | would be atomic, but the value "Ada Lovelace" would not.
15 | 
16 | [cascading delete]{#cascading-delete}
17 | :   An [SQL](#sql) constraint requiring that if a given [record](#record) is deleted,
18 | all records referencing it (via [foreign key](#foreign-key)) in other [tables](#table)
19 | must also be deleted.
20 | 
21 | [case insensitive]{#case-insensitive}
22 | :   Treating text as if upper and lower case characters were the same.
23 | See also: [case sensitive](#case-sensitive).
24 | 
25 | [case sensitive]{#case-sensitive}
26 | :   Treating upper and lower case characters as different. See also: [case insensitive](#case-insensitive).
27 | 
28 | [comma-separated values (CSV)]{#comma-separated-values-csv}
29 | :   A common textual representation for tables in which the values in each row are separated by commas.
30 | 
31 | [cross product]{#cross-product}
32 | :   A pairing of all elements of one set with all elements of another.
33 | 
34 | [cursor]{#cursor}
35 | :   A pointer into a database that keeps track of outstanding operations.
36 | 
37 | [database manager]{#database-manager}
38 | :   A program that manages a database, such as SQLite.
39 | 
40 | [fields]{#fields}
41 | :   A set of data values of a particular type, one for each [record](#record) in a [table](#table).
42 | 
43 | [filter]{#filter}
44 | :   To select only the records that meet certain conditions.
45 | 
46 | [foreign key]{#foreign-key}
47 | :   One or more values in a [database table](#table) that identify
48 | [records](#record) in another table.
49 | 
50 | [prepared statement]{#prepared-statement}
51 | :   A template for an [SQL](#sql) query in which some values can be filled in.
52 | 
53 | [primary key]{#primary-key}
54 | :   One or more [fields](#fields) in a [database table](#table) whose values are
55 | guaranteed to be unique for each [record](#record), i.e., whose values
56 | uniquely identify the entry.
57 | 
58 | [query]{#query}
59 | :   A textual description of a database operation. Queries are expressed in
60 | a special-purpose language called [SQL](#sql), and despite the name "query",
61 | they may modify or delete data as well as interrogate it.
62 | 
63 | [record]{#record}
64 | :   A set of related values making up a single entry in a [database table](#table),
65 | typically shown as a row. See also: [fields](#fields).
66 | 
67 | [referential integrity]{#referential-integrity}
68 | :   The internal consistency of values in a database. If an entry in one table
69 | contains a [foreign key](#foreign-key), but the corresponding [records](#record)
70 | don't exist, referential integrity has been violated.
71 | 
72 | [relational database]{#relational-database}
73 | :   A collection of data organized into [tables](#table).
74 | 
75 | [sentinel value]{#sentinel-value}
76 | :   A value in a collection that has a special meaning, such as 999 to mean "age unknown".
77 | 
78 | [SQL]{#sql}
79 | :   A special-purpose language for describing operations on [relational databases](#relational-database).
80 | 
81 | [SQL injection attack]{#sql-injection-attack}
82 | :   An attack on a program in which the user's input contains malicious SQL statements.
83 | If this text is copied directly into an SQL statement, it will be executed in the database.
84 | 
85 | [table]{#table}
86 | :   A set of data in a [relational database](#relational-database) organized into a set
87 | of [records](#record), each having the same named [fields](#fields).
88 | 
89 | [wildcard]{#wildcard}
90 | :   A character used in pattern matching. In SQL's `like` operator, the wildcard "%"
91 | matches zero or more characters, so that `%able%` matches "fixable" and "tablets".
92 | 
93 | 
94 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-receive.yaml:
--------------------------------------------------------------------------------
  1 | name: "Receive Pull Request"
  2 | 
  3 | on:
  4 |   pull_request:
  5 |     types:
  6 |       [opened, synchronize, reopened]
  7 | 
  8 | concurrency:
  9 |   group: ${{ github.ref }}
 10 |   cancel-in-progress: true
 11 | 
 12 | jobs:
 13 |   test-pr:
 14 |     name: "Record PR number"
 15 |     if: ${{ github.event.action != 'closed' }}
 16 |     runs-on: ubuntu-22.04
 17 |     outputs:
 18 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
 19 |     steps:
 20 |       - name: "Record PR number"
 21 |         id: record
 22 |         if: ${{ always() }}
 23 |         run: |
 24 |           echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR
 25 |       - name: "Upload PR number"
 26 |         id: upload
 27 |         if: ${{ always() }}
 28 |         uses: actions/upload-artifact@v4
 29 |         with:
 30 |           name: pr
 31 |           path: ${{ github.workspace }}/NR
 32 |       - name: "Get Invalid Hashes File"
 33 |         id: hash
 34 |         run: |
 35 |           echo "json<<EOF
 36 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
 37 |           EOF" >> $GITHUB_OUTPUT
 38 |       - name: "echo output"
 39 |         run: |
 40 |           echo "${{ steps.hash.outputs.json }}"
 41 |       - name: "Check PR"
 42 |         id: check-pr
 43 |         uses: carpentries/actions/check-valid-pr@main
 44 |         with:
 45 |           pr: ${{ github.event.number }}
 46 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
 47 | 
 48 |   build-md-source:
 49 |     name: "Build markdown source files if valid"
 50 |     needs: test-pr
 51 |     runs-on: ubuntu-22.04
 52 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
 53 |     env:
 54 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 55 |       RENV_PATHS_ROOT: ~/.local/share/renv/
 56 |       CHIVE: ${{ github.workspace }}/site/chive
 57 |       PR: ${{ github.workspace }}/site/pr
 58 |       MD: ${{ github.workspace }}/site/built
 59 |     steps:
 60 |       - name: "Check Out Main Branch"
 61 |         uses: actions/checkout@v4
 62 | 
 63 |       - name: "Check Out Staging Branch"
 64 |         uses: actions/checkout@v4
 65 |         with:
 66 |           ref: md-outputs
 67 |           path: ${{ env.MD }}
 68 | 
 69 |       - name: "Set up R"
 70 |         uses: r-lib/actions/setup-r@v2
 71 |         with:
 72 |           use-public-rspm: true
 73 |           install-r: false
 74 | 
 75 |       - name: "Set up Pandoc"
 76 |         uses: r-lib/actions/setup-pandoc@v2
 77 | 
 78 |       - name: "Setup Lesson Engine"
 79 |         uses: carpentries/actions/setup-sandpaper@main
 80 |         with:
 81 |           cache-version: ${{ secrets.CACHE_VERSION }}
 82 | 
 83 |       - name: "Setup Package Cache"
 84 |         uses: carpentries/actions/setup-lesson-deps@main
 85 |         with:
 86 |           cache-version: ${{ secrets.CACHE_VERSION }}
 87 | 
 88 |       - name: "Validate and Build Markdown"
 89 |         id: build-site
 90 |         run: |
 91 |           sandpaper::package_cache_trigger(TRUE)
 92 |           sandpaper::validate_lesson(path = '${{ github.workspace }}')
 93 |           sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE)
 94 |         shell: Rscript {0}
 95 | 
 96 |       - name: "Generate Artifacts"
 97 |         id: generate-artifacts
 98 |         run: |
 99 |           sandpaper:::ci_bundle_pr_artifacts(
100 |             repo         = '${{ github.repository }}',
101 |             pr_number    = '${{ github.event.number }}',
102 |             path_md      = '${{ env.MD }}',
103 |             path_pr      = '${{ env.PR }}',
104 |             path_archive = '${{ env.CHIVE }}',
105 |             branch       = 'md-outputs'
106 |           )
107 |         shell: Rscript {0}
108 | 
109 |       - name: "Upload PR"
110 |         uses: actions/upload-artifact@v4
111 |         with:
112 |           name: pr
113 |           path: ${{ env.PR }}
114 |           overwrite: true
115 | 
116 |       - name: "Upload Diff"
117 |         uses: actions/upload-artifact@v4
118 |         with:
119 |           name: diff
120 |           path: ${{ env.CHIVE }}
121 |           retention-days: 1
122 | 
123 |       - name: "Upload Build"
124 |         uses: actions/upload-artifact@v4
125 |         with:
126 |           name: built
127 |           path: ${{ env.MD }}
128 |           retention-days: 1
129 | 
130 |       - name: "Teardown"
131 |         run: sandpaper::reset_site()
132 |         shell: Rscript {0}
133 | 


--------------------------------------------------------------------------------
/.github/workflows/update-cache.yaml:
--------------------------------------------------------------------------------
  1 | name: "03 Maintain: Update Package Cache"
  2 | 
  3 | on:
  4 |   workflow_dispatch:
  5 |     inputs:
  6 |       name:
  7 |         description: 'Who triggered this build (enter github username to tag yourself)?'
  8 |         required: true
  9 |         default: 'monthly run'
 10 |   schedule:
 11 |     # Run every tuesday
 12 |     - cron: '0 0 * * 2'
 13 | 
 14 | jobs:
 15 |   preflight:
 16 |     name: "Preflight Check"
 17 |     runs-on: ubuntu-22.04
 18 |     outputs:
 19 |       ok: ${{ steps.check.outputs.ok }}
 20 |     steps:
 21 |       - id: check
 22 |         run: |
 23 |           if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then
 24 |             echo "ok=true" >> $GITHUB_OUTPUT
 25 |             echo "Running on request"
 26 |           # using single brackets here to avoid 08 being interpreted as octal
 27 |           # https://github.com/carpentries/sandpaper/issues/250
 28 |           elif [ `date +%d` -le 7 ]; then
 29 |             # If the Tuesday lands in the first week of the month, run it
 30 |             echo "ok=true" >> $GITHUB_OUTPUT
 31 |             echo "Running on schedule"
 32 |           else
 33 |             echo "ok=false" >> $GITHUB_OUTPUT
 34 |             echo "Not Running Today"
 35 |           fi
 36 | 
 37 |   check_renv:
 38 |     name: "Check if We Need {renv}"
 39 |     runs-on: ubuntu-22.04
 40 |     needs: preflight
 41 |     if: ${{ needs.preflight.outputs.ok == 'true'}}
 42 |     outputs:
 43 |       needed: ${{ steps.renv.outputs.exists }}
 44 |     steps:
 45 |       - name: "Checkout Lesson"
 46 |         uses: actions/checkout@v4
 47 |       - id: renv
 48 |         run: |
 49 |           if [[ -d renv ]]; then
 50 |             echo "exists=true" >> $GITHUB_OUTPUT
 51 |           fi
 52 | 
 53 |   check_token:
 54 |     name: "Check SANDPAPER_WORKFLOW token"
 55 |     runs-on: ubuntu-22.04
 56 |     needs: check_renv
 57 |     if: ${{ needs.check_renv.outputs.needed == 'true' }}
 58 |     outputs:
 59 |       workflow: ${{ steps.validate.outputs.wf }}
 60 |       repo: ${{ steps.validate.outputs.repo }}
 61 |     steps:
 62 |       - name: "validate token"
 63 |         id: validate
 64 |         uses: carpentries/actions/check-valid-credentials@main
 65 |         with:
 66 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
 67 | 
 68 |   update_cache:
 69 |     name: "Update Package Cache"
 70 |     needs: check_token
 71 |     if: ${{ needs.check_token.outputs.repo== 'true' }}
 72 |     runs-on: ubuntu-22.04
 73 |     env:
 74 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 75 |       RENV_PATHS_ROOT: ~/.local/share/renv/
 76 |     steps:
 77 | 
 78 |       - name: "Checkout Lesson"
 79 |         uses: actions/checkout@v4
 80 | 
 81 |       - name: "Set up R"
 82 |         uses: r-lib/actions/setup-r@v2
 83 |         with:
 84 |           use-public-rspm: true
 85 |           install-r: false
 86 | 
 87 |       - name: "Update {renv} deps and determine if a PR is needed"
 88 |         id: update
 89 |         uses: carpentries/actions/update-lockfile@main
 90 |         with:
 91 |           cache-version: ${{ secrets.CACHE_VERSION }}
 92 | 
 93 |       - name: Create Pull Request
 94 |         id: cpr
 95 |         if: ${{ steps.update.outputs.n > 0 }}
 96 |         uses: carpentries/create-pull-request@main
 97 |         with:
 98 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
 99 |           delete-branch: true
100 |           branch: "update/packages"
101 |           commit-message: "[actions] update ${{ steps.update.outputs.n }} packages"
102 |           title: "Update ${{ steps.update.outputs.n }} packages"
103 |           body: |
104 |             :robot: This is an automated build
105 | 
106 |             This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions:
107 | 
108 |             ```
109 |             ${{ steps.update.outputs.report }}
110 |             ```
111 | 
112 |             :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates.
113 | 
114 |             If you want to inspect these changes locally, you can use the following code to check out a new branch:
115 | 
116 |             ```bash
117 |             git fetch origin update/packages
118 |             git checkout update/packages
119 |             ```
120 | 
121 |             - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
122 | 
123 |             [1]: https://github.com/carpentries/create-pull-request/tree/main
124 |           labels: "type: package cache"
125 |           draft: false
126 | 


--------------------------------------------------------------------------------
/instructors/instructor-notes.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Instructor Notes
  3 | ---
  4 | 
  5 | > **database** (dā'tə-bās') noun:
  6 | > "A collection of data arranged for ease and speed of search and retrieval by a computer"
  7 | > 
  8 | > — The American Heritage® Science Dictionary
  9 | > {: .quotation}
 10 | 
 11 | - Three common options for storing data
 12 | - Text
 13 |   - Easy to create, work well with version control
 14 |   - But then we have to build search and analysis tools ourselves
 15 | - Spreadsheets
 16 |   - Good for simple analyses
 17 |   - But don't handle large or complex data sets well
 18 | - Databases
 19 |   - Include powerful tools for search and analysis
 20 |   - Can handle large, complex data sets.
 21 | 
 22 | ## Overall
 23 | 
 24 | Relational databases are not as widely used in science as in business,
 25 | but they are still a common way to store large data sets with complex structure.
 26 | Even when the data itself isn't in a database,
 27 | the metadata could be:
 28 | for example,
 29 | meteorological data might be stored in files on disk,
 30 | but data about when and where observations were made,
 31 | data ranges,
 32 | and so on could be in a database
 33 | to make it easier for scientists to find what they want to.
 34 | 
 35 | - The first few sections (up to "Missing Data") usually go very quickly.
 36 |   The pace usually slows down a bit when null values are discussed
 37 |   mostly because learners have a lot of details to keep straight by this point.
 38 |   Things *really* slow down during the discussion of joins,
 39 |   but this is the key idea in the whole lesson:
 40 |   important ideas like primary keys and referential integrity
 41 |   only make sense once learners have seen how they're used in joins.
 42 |   It's worth going over things a couple of times if necessary (with lots of examples).
 43 | 
 44 | - The sections on creating and modifying data,
 45 |   and programming with databases,
 46 |   can be dropped if time is short.
 47 |   Of the two,
 48 |   people seem to care most about how to add data (which only takes a few minutes to demonstrate).
 49 | 
 50 | - Simple calculations are actually easier to do in a spreadsheet; the
 51 |   advantages of using a database become clear as soon as filtering
 52 |   and joins are needed.  Instructors may therefore want to show a
 53 |   spreadsheet with the information from the four database tables
 54 |   consolidated into a single sheet, and demonstrate what's needed in
 55 |   both systems to answer questions like, "What was the average
 56 |   radiation reading in 1931?"
 57 | 
 58 | - Some advanced learners may have heard that NoSQL databases
 59 |   (i.e., ones that don't use the relational model)
 60 |   are the next big thing,
 61 |   and ask why we're not teaching those.
 62 |   The answers are:
 63 |   
 64 |   1. Relational databases are far more widely used than NoSQL databases.
 65 |   2. We have far more experience with relational databases than with any other kind,
 66 |     so we have a better idea of what to teach and how to teach it.
 67 |   3. NoSQL databases are as different from each other as they are from relational databases.
 68 |     Until a leader emerges, it isn't clear *which* NoSQL database we should teach.
 69 | 
 70 | ## Resources
 71 | 
 72 | - `data/*.csv`: CSV versions of data in sample survey database.
 73 | - `bin/create-db.sql`: generate survey database used in examples based on CSV.
 74 | 
 75 | ## SQLite Setup
 76 | 
 77 | In order to execute the following lessons interactively,
 78 | please install SQLite as mentioned in the setup instructions for your workshop.
 79 | Then:
 80 | 
 81 | ```bash
 82 | $ git clone http://github.com/swcarpentry/sql-novice-survey.git
 83 | $ cd sql-novice-survey
 84 | ```
 85 | 
 86 | Next,
 87 | create the database that will be used:
 88 | 
 89 | ```bash
 90 | $ sqlite3 survey.sqlite '.read bin/create-db.sql'
 91 | ```
 92 | 
 93 | This reads commands from `bin/create-db.sql`,
 94 | which sets up the tables and loads data from the CSV files in the `data` directory.
 95 | 
 96 | To run commands interactively,
 97 | run SQLite on `survey.sqlite`:
 98 | 
 99 | ```bash
100 | $ sqlite3 survey.sqlite
101 | SQLite version 3.8.5 2014-08-15 22:37:57
102 | Enter ".help" for usage hints.
103 | sqlite>
104 | ```
105 | 
106 | ## Troubleshooting
107 | 
108 | The command history and line editing features provided by `readline` are
109 | invaluable with a command-line tool like `sqlite3`. Participants should be
110 | encouraged strongly to start with a simple SQL statement and then use the
111 | up-arrow key to go back and add clauses one at a time, or fix problems, rather
112 | than typing each command from scratch. Unfortunately on some Linux and Mac OS X
113 | systems participants have found that the arrow keys do not scroll through the
114 | command history as expected.
115 | 
116 | A workaround for this it to use the [rlwrap](https://github.com/hanslub42/rlwrap)
117 | (readline wrapper) command when starting SQLite:
118 | 
119 | ```bash
120 | $ rlwrap sqlite3 survey.sqlite
121 | ```
122 | 
123 | The `rlwrap` package is available in the standard Fedora repository
124 | (but wasn't needed when I [@benwaugh] taught this) and appears to be
125 | available in [Ubuntu](https://packages.ubuntu.com/precise/rlwrap) too,
126 | and in [OS X using Homebrew](https://news.ycombinator.com/item?id=5087790).
127 | 
128 | 
129 | 


--------------------------------------------------------------------------------
/episodes/08-hygiene.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Data Hygiene
  3 | teaching: 15
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain what an atomic value is.
 10 | - Distinguish between atomic and non-atomic values.
 11 | - Explain why every value in a database should be atomic.
 12 | - Explain what a primary key is and why every record should have one.
 13 | - Identify primary keys in database tables.
 14 | - Explain why database entries should not contain redundant information.
 15 | - Identify redundant information in databases.
 16 | 
 17 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 18 | 
 19 | :::::::::::::::::::::::::::::::::::::::: questions
 20 | 
 21 | - How should I format data in a database, and why?
 22 | 
 23 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 24 | 
 25 | Now that we have seen how joins work, we can see why the relational
 26 | model is so useful and how best to use it.  The first rule is that
 27 | every value should be [atomic](../learners/reference.md#atomic), i.e., not
 28 | contain parts that we might want to work with separately.  We store
 29 | personal and family names in separate columns instead of putting the
 30 | entire name in one column so that we don't have to use substring
 31 | operations to get the name's components.  More importantly, we store
 32 | the two parts of the name separately because splitting on spaces is
 33 | unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel
 34 | Steubart".
 35 | 
 36 | The second rule is that every record should have a unique primary key.
 37 | This can be a serial number that has no intrinsic meaning,
 38 | one of the values in the record (like the `id` field in the `Person` table),
 39 | or even a combination of values:
 40 | the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement.
 41 | 
 42 | The third rule is that there should be no redundant information.
 43 | For example,
 44 | we could get rid of the `Site` table and rewrite the `Visited` table like this:
 45 | 
 46 | | id       | lat       | long       | dated       | 
 47 | | -------- | --------- | ---------- | ----------- |
 48 | | 619      | \-49.85    | \-128.57    | 1927-02-08  | 
 49 | | 622      | \-49.85    | \-128.57    | 1927-02-10  | 
 50 | | 734      | \-47.15    | \-126.72    | 1930-01-07  | 
 51 | | 735      | \-47.15    | \-126.72    | 1930-01-12  | 
 52 | | 751      | \-47.15    | \-126.72    | 1930-02-26  | 
 53 | | 752      | \-47.15    | \-126.72    | \-null-      | 
 54 | | 837      | \-48.87    | \-123.40    | 1932-01-14  | 
 55 | | 844      | \-49.85    | \-128.57    | 1932-03-22  | 
 56 | 
 57 | In fact,
 58 | we could use a single table that recorded all the information about each reading in each row,
 59 | just as a spreadsheet would.
 60 | The problem is that it's very hard to keep data organized this way consistent:
 61 | if we realize that the date of a particular visit to a particular site is wrong,
 62 | we have to change multiple records in the database.
 63 | What's worse,
 64 | we may have to guess which records to change,
 65 | since other sites may also have been visited on that date.
 66 | 
 67 | The fourth rule is that the units for every value should be stored explicitly.
 68 | Our database doesn't do this,
 69 | and that's a problem:
 70 | Roerich's salinity measurements are several orders of magnitude larger than anyone else's,
 71 | but we don't know if that means she was using parts per million instead of parts per thousand,
 72 | or whether there actually was a saline anomaly at that site in 1932.
 73 | 
 74 | Stepping back,
 75 | data and the tools used to store it have a symbiotic relationship:
 76 | we use tables and joins because it's efficient,
 77 | provided our data is organized a certain way,
 78 | but organize our data that way because we have tools to manipulate it efficiently.
 79 | As anthropologists say,
 80 | the tool shapes the hand that shapes the tool.
 81 | 
 82 | :::::::::::::::::::::::::::::::::::::::  challenge
 83 | 
 84 | ## Identifying Atomic Values
 85 | 
 86 | Which of the following are atomic values? Which are not? Why?
 87 | 
 88 | - New Zealand
 89 | - 87 Turing Avenue
 90 | - January 25, 1971
 91 | - the XY coordinate (0.5, 3.3)
 92 | 
 93 | :::::::::::::::  solution
 94 | 
 95 | ## Solution
 96 | 
 97 | New Zealand is the only clear-cut atomic value.
 98 | 
 99 | The address and the XY coordinate contain more than one piece of information
100 | which should be stored separately:
101 | 
102 | - House number, street name
103 | - X coordinate, Y coordinate
104 | 
105 | The date entry is less clear cut, because it contains month, day, and year elements.
106 | However, there is a `DATE` datatype in SQL, and dates should be stored using this format.
107 | If we need to work with the month, day, or year separately, we can use the SQL functions available for our database software
108 | (for example [`EXTRACT`](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm) or [`STRFTIME`](https://www.sqlite.org/lang_datefunc.html) for SQLite).
109 | 
110 | 
111 | 
112 | :::::::::::::::::::::::::
113 | 
114 | ::::::::::::::::::::::::::::::::::::::::::::::::::
115 | 
116 | :::::::::::::::::::::::::::::::::::::::  challenge
117 | 
118 | ## Identifying a Primary Key
119 | 
120 | What is the primary key in this table?
121 | I.e., what value or combination of values uniquely identifies a record?
122 | 
123 | | latitude | longitude | date       | temperature | 
124 | | -------- | --------- | ---------- | ----------- |
125 | | 57\.3     | \-22.5     | 2015-01-09 | \-14.2       | 
126 | 
127 | :::::::::::::::  solution
128 | 
129 | ## Solution
130 | 
131 | Latitude, longitude, and date are all required to uniquely identify the temperature record.
132 | 
133 | 
134 | 
135 | :::::::::::::::::::::::::
136 | 
137 | ::::::::::::::::::::::::::::::::::::::::::::::::::
138 | 
139 | :::::::::::::::::::::::::::::::::::::::: keypoints
140 | 
141 | - Every value in a database should be atomic.
142 | - Every record should have a unique primary key.
143 | - A database should not contain redundant information.
144 | - Units and similar metadata should be stored with the data.
145 | 
146 | ::::::::::::::::::::::::::::::::::::::::::::::::::
147 | 
148 | 
149 | 


--------------------------------------------------------------------------------
/episodes/fig/sql-join-structure.svg:
--------------------------------------------------------------------------------
  1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
  2 | <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN"
  3 |  "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd" [
  4 |  <!ATTLIST svg xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">
  5 | ]>
  6 | <!-- Generated by Graphviz version 2.12 (Mon Dec  4 22:04:37 UTC 2006)
  7 |      For user: Nick J -->
  8 | <!-- Title: G Pages: 1 -->
  9 | <svg width="3.88in" height="5.44in"
 10 |  xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
 11 | <g id="graph0" class="graph" transform="scale(1.33333 1.33333) rotate(0) translate(4 388)">
 12 | <title>G</title>
 13 | <polygon style="fill:white;stroke:white;" points="-4,4 -4,-388 275,-388 275,4 -4,4"/>
 14 | <!-- Person -->
 15 | <g id="node1" class="node"><title>Person</title>
 16 | <polygon style="fill:none;stroke:black;" points="0,-156 0,-252 102,-252 102,-156 0,-156"/>
 17 | <text text-anchor="middle" x="51" y="-234" >Person</text>
 18 | <polyline style="fill:none;stroke:black;" points="0,-228 102,-228 "/>
 19 | <text text-anchor="middle" x="27" y="-210" >id</text>
 20 | <polyline style="fill:none;stroke:black;" points="54,-204 54,-228 "/>
 21 | <text text-anchor="middle" x="78" y="-210" >text</text>
 22 | <polyline style="fill:none;stroke:black;" points="0,-204 102,-204 "/>
 23 | <text text-anchor="middle" x="32" y="-186" >personal</text>
 24 | <polyline style="fill:none;stroke:black;" points="64,-180 64,-204 "/>
 25 | <text text-anchor="middle" x="83" y="-186" >text</text>
 26 | <polyline style="fill:none;stroke:black;" points="0,-180 102,-180 "/>
 27 | <text text-anchor="middle" x="28" y="-162" >family</text>
 28 | <polyline style="fill:none;stroke:black;" points="57,-156 57,-180 "/>
 29 | <text text-anchor="middle" x="79" y="-162" >text</text>
 30 | </g>
 31 | <!-- Survey -->
 32 | <g id="node4" class="node"><title>Survey</title>
 33 | <polygon style="fill:none;stroke:black;" points="169,0 169,-120 272,-120 272,0 169,0"/>
 34 | <text text-anchor="middle" x="220" y="-102" >Survey</text>
 35 | <polyline style="fill:none;stroke:black;" points="169,-96 272,-96 "/>
 36 | <text text-anchor="middle" x="193" y="-78" >taken</text>
 37 | <polyline style="fill:none;stroke:black;" points="217,-72 217,-96 "/>
 38 | <text text-anchor="middle" x="244" y="-78" >integer</text>
 39 | <polyline style="fill:none;stroke:black;" points="169,-72 272,-72 "/>
 40 | <text text-anchor="middle" x="199" y="-54" >person</text>
 41 | <polyline style="fill:none;stroke:black;" points="229,-48 229,-72 "/>
 42 | <text text-anchor="middle" x="250" y="-54" >text</text>
 43 | <polyline style="fill:none;stroke:black;" points="169,-48 272,-48 "/>
 44 | <text text-anchor="middle" x="197" y="-30" >quant</text>
 45 | <polyline style="fill:none;stroke:black;" points="225,-24 225,-48 "/>
 46 | <text text-anchor="middle" x="248" y="-30" >text</text>
 47 | <polyline style="fill:none;stroke:black;" points="169,-24 272,-24 "/>
 48 | <text text-anchor="middle" x="200" y="-6" >reading</text>
 49 | <polyline style="fill:none;stroke:black;" points="231,0 231,-24 "/>
 50 | <text text-anchor="middle" x="251" y="-6" >real</text>
 51 | </g>
 52 | <!-- Person&#45;&gt;Survey -->
 53 | <g id="edge6" class="edge"><title>Person&#45;&gt;Survey</title>
 54 | <path style="fill:none;stroke:black;" d="M102,-216C130,-216 110,-181 120,-156 137,-115 124,-68 159,-61"/>
 55 | <polygon style="fill:black;stroke:black;" points="159.398,-64.4778 169,-60 158.701,-57.5125 159.398,-64.4778"/>
 56 | </g>
 57 | <!-- Site -->
 58 | <g id="node2" class="node"><title>Site</title>
 59 | <polygon style="fill:none;stroke:black;" points="53,-288 53,-384 138,-384 138,-288 53,-288"/>
 60 | <text text-anchor="middle" x="95" y="-366" >Site</text>
 61 | <polyline style="fill:none;stroke:black;" points="53,-360 138,-360 "/>
 62 | <text text-anchor="middle" x="76" y="-342" >name</text>
 63 | <polyline style="fill:none;stroke:black;" points="100,-336 100,-360 "/>
 64 | <text text-anchor="middle" x="119" y="-342" >text</text>
 65 | <polyline style="fill:none;stroke:black;" points="53,-336 138,-336 "/>
 66 | <text text-anchor="middle" x="72" y="-318" >lat</text>
 67 | <polyline style="fill:none;stroke:black;" points="91,-312 91,-336 "/>
 68 | <text text-anchor="middle" x="114" y="-318" >real</text>
 69 | <polyline style="fill:none;stroke:black;" points="53,-312 138,-312 "/>
 70 | <text text-anchor="middle" x="75" y="-294" >long</text>
 71 | <polyline style="fill:none;stroke:black;" points="97,-288 97,-312 "/>
 72 | <text text-anchor="middle" x="117" y="-294" >real</text>
 73 | </g>
 74 | <!-- Visited -->
 75 | <g id="node3" class="node"><title>Visited</title>
 76 | <polygon style="fill:none;stroke:black;" points="138,-156 138,-252 237,-252 237,-156 138,-156"/>
 77 | <text text-anchor="middle" x="187" y="-234" >Visited</text>
 78 | <polyline style="fill:none;stroke:black;" points="138,-228 237,-228 "/>
 79 | <text text-anchor="middle" x="160" y="-210" >id</text>
 80 | <polyline style="fill:none;stroke:black;" points="182,-204 182,-228 "/>
 81 | <text text-anchor="middle" x="209" y="-210" >integer</text>
 82 | <polyline style="fill:none;stroke:black;" points="138,-204 237,-204 "/>
 83 | <text text-anchor="middle" x="162" y="-186" >site</text>
 84 | <polyline style="fill:none;stroke:black;" points="186,-180 186,-204 "/>
 85 | <text text-anchor="middle" x="211" y="-186" >text</text>
 86 | <polyline style="fill:none;stroke:black;" points="138,-180 237,-180 "/>
 87 | <text text-anchor="middle" x="165" y="-162" >dated</text>
 88 | <polyline style="fill:none;stroke:black;" points="192,-156 192,-180 "/>
 89 | <text text-anchor="middle" x="214" y="-162" >text</text>
 90 | </g>
 91 | <!-- Site&#45;&gt;Visited -->
 92 | <g id="edge2" class="edge"><title>Site&#45;&gt;Visited</title>
 93 | <path style="fill:none;stroke:black;" d="M138,-348C168,-348 119,-226 130,-198"/>
 94 | <polygon style="fill:black;stroke:black;" points="132.1,-200.8 138,-192 127.9,-195.2 132.1,-200.8"/>
 95 | </g>
 96 | <!-- Visited&#45;&gt;Survey -->
 97 | <g id="edge4" class="edge"><title>Visited&#45;&gt;Survey</title>
 98 | <path style="fill:none;stroke:black;" d="M237,-216C263,-216 250,-178 237,-156 219,-126 186,-149 169,-120 163,-110 157,-96 160,-89"/>
 99 | <polygon style="fill:black;stroke:black;" points="161.958,-91.916 169,-84 158.559,-85.7969 161.958,-91.916"/>
100 | </g>
101 | </g>
102 | </svg>
103 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | ## Contributing
  2 | 
  3 | [The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data
  4 | Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source
  5 | projects, and we welcome contributions of all kinds: new lessons, fixes to
  6 | existing material, bug reports, and reviews of proposed changes are all
  7 | welcome.
  8 | 
  9 | ### Contributor Agreement
 10 | 
 11 | By contributing, you agree that we may redistribute your work under [our
 12 | license](LICENSE.md). In exchange, we will address your issues and/or assess
 13 | your change proposal as promptly as we can, and help you become a member of our
 14 | community. Everyone involved in [The Carpentries][cp-site] agrees to abide by
 15 | our [code of conduct](CODE_OF_CONDUCT.md).
 16 | 
 17 | ### How to Contribute
 18 | 
 19 | The easiest way to get started is to file an issue to tell us about a spelling
 20 | mistake, some awkward wording, or a factual error. This is a good way to
 21 | introduce yourself and to meet some of our community members.
 22 | 
 23 | 1. If you do not have a [GitHub][github] account, you can [send us comments by
 24 |    email][contact]. However, we will be able to respond more quickly if you use
 25 |    one of the other methods described below.
 26 | 
 27 | 2. If you have a [GitHub][github] account, or are willing to [create
 28 |    one][github-join], but do not know how to use Git, you can report problems
 29 |    or suggest improvements by [creating an issue][repo-issues]. This allows us
 30 |    to assign the item to someone and to respond to it in a threaded discussion.
 31 | 
 32 | 3. If you are comfortable with Git, and would like to add or change material,
 33 |    you can submit a pull request (PR). Instructions for doing this are
 34 |    [included below](#using-github). For inspiration about changes that need to
 35 |    be made, check out the [list of open issues][issues] across the Carpentries.
 36 | 
 37 | Note: if you want to build the website locally, please refer to [The Workbench
 38 | documentation][template-doc].
 39 | 
 40 | ### Where to Contribute
 41 | 
 42 | 1. If you wish to change this lesson, add issues and pull requests here.
 43 | 2. If you wish to change the template used for workshop websites, please refer
 44 |    to [The Workbench documentation][template-doc].
 45 | 
 46 | 
 47 | ### What to Contribute
 48 | 
 49 | There are many ways to contribute, from writing new exercises and improving
 50 | existing ones to updating or filling in the documentation and submitting [bug
 51 | reports][issues] about things that do not work, are not clear, or are missing.
 52 | If you are looking for ideas, please see [the list of issues for this
 53 | repository][repo-issues], or the issues for [Data Carpentry][dc-issues],
 54 | [Library Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects.
 55 | 
 56 | Comments on issues and reviews of pull requests are just as welcome: we are
 57 | smarter together than we are on our own. **Reviews from novices and newcomers
 58 | are particularly valuable**: it's easy for people who have been using these
 59 | lessons for a while to forget how impenetrable some of this material can be, so
 60 | fresh eyes are always welcome.
 61 | 
 62 | ### What *Not* to Contribute
 63 | 
 64 | Our lessons already contain more material than we can cover in a typical
 65 | workshop, so we are usually *not* looking for more concepts or tools to add to
 66 | them. As a rule, if you want to introduce a new idea, you must (a) estimate how
 67 | long it will take to teach and (b) explain what you would take out to make room
 68 | for it. The first encourages contributors to be honest about requirements; the
 69 | second, to think hard about priorities.
 70 | 
 71 | We are also not looking for exercises or other material that only run on one
 72 | platform. Our workshops typically contain a mixture of Windows, macOS, and
 73 | Linux users; in order to be usable, our lessons must run equally well on all
 74 | three.
 75 | 
 76 | ### Using GitHub
 77 | 
 78 | If you choose to contribute via GitHub, you may want to look at [How to
 79 | Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we
 80 | use [GitHub flow][github-flow] to manage changes:
 81 | 
 82 | 1. Create a new branch in your desktop copy of this repository for each
 83 |    significant change.
 84 | 2. Commit the change in that branch.
 85 | 3. Push that branch to your fork of this repository on GitHub.
 86 | 4. Submit a pull request from that branch to the [upstream repository][repo].
 87 | 5. If you receive feedback, make changes on your desktop and push to your
 88 |    branch on GitHub: the pull request will update automatically.
 89 | 
 90 | NB: The published copy of the lesson is usually in the `main` branch.
 91 | 
 92 | Each lesson has a team of maintainers who review issues and pull requests or
 93 | encourage others to do so. The maintainers are community volunteers, and have
 94 | final say over what gets merged into the lesson.
 95 | 
 96 | ### Other Resources
 97 | 
 98 | The Carpentries is a global organisation with volunteers and learners all over
 99 | the world. We share values of inclusivity and a passion for sharing knowledge,
100 | teaching and learning. There are several ways to connect with The Carpentries
101 | community listed at <https://carpentries.org/connect/> including via social
102 | media, slack, newsletters, and email lists. You can also [reach us by
103 | email][contact].
104 | 
105 | [repo]: https://github.com/swcarpentry/sql-novice-survey
106 | [repo-issues]: https://github.com/swcarpentry/sql-novice-survey/issues
107 | [contact]: mailto:team@carpentries.org
108 | [cp-site]: https://carpentries.org/
109 | [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry
110 | [dc-lessons]: https://datacarpentry.org/lessons/
111 | [dc-site]: https://datacarpentry.org/
112 | [discuss-list]: https://lists.software-carpentry.org/listinfo/discuss
113 | [github]: https://github.com
114 | [github-flow]: https://guides.github.com/introduction/flow/
115 | [github-join]: https://github.com/join
116 | [how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github
117 | [issues]: https://carpentries.org/help-wanted-issues/
118 | [lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry
119 | [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry
120 | [swc-lessons]: https://software-carpentry.org/lessons/
121 | [swc-site]: https://software-carpentry.org/
122 | [lc-site]: https://librarycarpentry.org/
123 | [template-doc]: https://carpentries.github.io/workbench/
124 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-comment.yaml:
--------------------------------------------------------------------------------
  1 | name: "Bot: Comment on the Pull Request"
  2 | 
  3 | # read-write repo token
  4 | # access to secrets
  5 | on:
  6 |   workflow_run:
  7 |     workflows: ["Receive Pull Request"]
  8 |     types:
  9 |       - completed
 10 | 
 11 | concurrency:
 12 |   group: pr-${{ github.event.workflow_run.pull_requests[0].number }}
 13 |   cancel-in-progress: true
 14 | 
 15 | 
 16 | jobs:
 17 |   # Pull requests are valid if:
 18 |   #  - they match the sha of the workflow run head commit
 19 |   #  - they are open
 20 |   #  - no .github files were committed
 21 |   test-pr:
 22 |     name: "Test if pull request is valid"
 23 |     runs-on: ubuntu-22.04
 24 |     if: >
 25 |       github.event.workflow_run.event == 'pull_request' &&
 26 |       github.event.workflow_run.conclusion == 'success'
 27 |     outputs:
 28 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
 29 |       payload: ${{ steps.check-pr.outputs.payload }}
 30 |       number: ${{ steps.get-pr.outputs.NUM }}
 31 |       msg: ${{ steps.check-pr.outputs.MSG }}
 32 |     steps:
 33 |       - name: 'Download PR artifact'
 34 |         id: dl
 35 |         uses: carpentries/actions/download-workflow-artifact@main
 36 |         with:
 37 |           run: ${{ github.event.workflow_run.id }}
 38 |           name: 'pr'
 39 | 
 40 |       - name: "Get PR Number"
 41 |         if: ${{ steps.dl.outputs.success == 'true' }}
 42 |         id: get-pr
 43 |         run: |
 44 |           unzip pr.zip
 45 |           echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT
 46 | 
 47 |       - name: "Fail if PR number was not present"
 48 |         id: bad-pr
 49 |         if: ${{ steps.dl.outputs.success != 'true' }}
 50 |         run: |
 51 |           echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.'
 52 |           exit 1
 53 |       - name: "Get Invalid Hashes File"
 54 |         id: hash
 55 |         run: |
 56 |           echo "json<<EOF
 57 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
 58 |           EOF" >> $GITHUB_OUTPUT
 59 |       - name: "Check PR"
 60 |         id: check-pr
 61 |         if: ${{ steps.dl.outputs.success == 'true' }}
 62 |         uses: carpentries/actions/check-valid-pr@main
 63 |         with:
 64 |           pr: ${{ steps.get-pr.outputs.NUM }}
 65 |           sha: ${{ github.event.workflow_run.head_sha }}
 66 |           headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire
 67 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
 68 |           fail_on_error: true
 69 | 
 70 |   # Create an orphan branch on this repository with two commits
 71 |   #  - the current HEAD of the md-outputs branch
 72 |   #  - the output from running the current HEAD of the pull request through
 73 |   #    the md generator
 74 |   create-branch:
 75 |     name: "Create Git Branch"
 76 |     needs: test-pr
 77 |     runs-on: ubuntu-22.04
 78 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
 79 |     env:
 80 |       NR: ${{ needs.test-pr.outputs.number }}
 81 |     permissions:
 82 |       contents: write
 83 |     steps:
 84 |       - name: 'Checkout md outputs'
 85 |         uses: actions/checkout@v4
 86 |         with:
 87 |           ref: md-outputs
 88 |           path: built
 89 |           fetch-depth: 1
 90 | 
 91 |       - name: 'Download built markdown'
 92 |         id: dl
 93 |         uses: carpentries/actions/download-workflow-artifact@main
 94 |         with:
 95 |           run: ${{ github.event.workflow_run.id }}
 96 |           name: 'built'
 97 | 
 98 |       - if: ${{ steps.dl.outputs.success == 'true' }}
 99 |         run: unzip built.zip
100 | 
101 |       - name: "Create orphan and push"
102 |         if: ${{ steps.dl.outputs.success == 'true' }}
103 |         run: |
104 |           cd built/
105 |           git config --local user.email "actions@github.com"
106 |           git config --local user.name "GitHub Actions"
107 |           CURR_HEAD=$(git rev-parse HEAD)
108 |           git checkout --orphan md-outputs-PR-${NR}
109 |           git add -A
110 |           git commit -m "source commit: ${CURR_HEAD}"
111 |           ls -A | grep -v '^.git$' | xargs -I _ rm -r '_'
112 |           cd ..
113 |           unzip -o -d built built.zip
114 |           cd built
115 |           git add -A
116 |           git commit --allow-empty -m "differences for PR #${NR}"
117 |           git push -u --force --set-upstream origin md-outputs-PR-${NR}
118 | 
119 |   # Comment on the Pull Request with a link to the branch and the diff
120 |   comment-pr:
121 |     name: "Comment on Pull Request"
122 |     needs: [test-pr, create-branch]
123 |     runs-on: ubuntu-22.04
124 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
125 |     env:
126 |       NR: ${{ needs.test-pr.outputs.number }}
127 |     permissions:
128 |       pull-requests: write
129 |     steps:
130 |       - name: 'Download comment artifact'
131 |         id: dl
132 |         uses: carpentries/actions/download-workflow-artifact@main
133 |         with:
134 |           run: ${{ github.event.workflow_run.id }}
135 |           name: 'diff'
136 | 
137 |       - if: ${{ steps.dl.outputs.success == 'true' }}
138 |         run: unzip ${{ github.workspace }}/diff.zip
139 | 
140 |       - name: "Comment on PR"
141 |         id: comment-diff
142 |         if: ${{ steps.dl.outputs.success == 'true' }}
143 |         uses: carpentries/actions/comment-diff@main
144 |         with:
145 |           pr: ${{ env.NR }}
146 |           path: ${{ github.workspace }}/diff.md
147 | 
148 |   # Comment if the PR is open and matches the SHA, but the workflow files have
149 |   # changed
150 |   comment-changed-workflow:
151 |     name: "Comment if workflow files have changed"
152 |     needs: test-pr
153 |     runs-on: ubuntu-22.04
154 |     if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }}
155 |     env:
156 |       NR: ${{ github.event.workflow_run.pull_requests[0].number }}
157 |       body: ${{ needs.test-pr.outputs.msg }}
158 |     permissions:
159 |       pull-requests: write
160 |     steps:
161 |       - name: 'Check for spoofing'
162 |         id: dl
163 |         uses: carpentries/actions/download-workflow-artifact@main
164 |         with:
165 |           run: ${{ github.event.workflow_run.id }}
166 |           name: 'built'
167 | 
168 |       - name: 'Alert if spoofed'
169 |         id: spoof
170 |         if: ${{ steps.dl.outputs.success == 'true' }}
171 |         run: |
172 |           echo 'body<<EOF' >> $GITHUB_ENV
173 |           echo '' >> $GITHUB_ENV
174 |           echo '## :x: DANGER :x:' >> $GITHUB_ENV
175 |           echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV
176 |           echo '' >> $GITHUB_ENV
177 |           echo 'EOF' >> $GITHUB_ENV
178 | 
179 |       - name: "Comment on PR"
180 |         id: comment-diff
181 |         uses: carpentries/actions/comment-diff@main
182 |         with:
183 |           pr: ${{ env.NR }}
184 |           body: ${{ env.body }}
185 | 


--------------------------------------------------------------------------------
/episodes/04-calc.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Calculating New Values
  3 | teaching: 5
  4 | exercises: 5
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write queries that calculate new values for each selected record.
 10 | 
 11 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 12 | 
 13 | :::::::::::::::::::::::::::::::::::::::: questions
 14 | 
 15 | - How can I calculate new values on the fly?
 16 | 
 17 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 18 | 
 19 | After carefully re-reading the expedition logs,
 20 | we realize that the radiation measurements they report
 21 | may need to be corrected upward by 5%.
 22 | Rather than modifying the stored data,
 23 | we can do this calculation on the fly
 24 | as part of our query:
 25 | 
 26 | ```sql
 27 | SELECT 1.05 * reading FROM Survey WHERE quant = 'rad';
 28 | ```
 29 | 
 30 | | 1\.05 \* reading            | 
 31 | | ------------------------- |
 32 | | 10\.311                    | 
 33 | | 8\.19                      | 
 34 | | 8\.8305                    | 
 35 | | 7\.581                     | 
 36 | | 4\.5675                    | 
 37 | | 2\.2995                    | 
 38 | | 1\.533                     | 
 39 | | 11\.8125                   | 
 40 | 
 41 | When we run the query,
 42 | the expression `1.05 * reading` is evaluated for each row.
 43 | Expressions can use any of the fields,
 44 | all of usual arithmetic operators,
 45 | and a variety of common functions.
 46 | (Exactly which ones depends on which database manager is being used.)
 47 | For example,
 48 | we can convert temperature readings from Fahrenheit to Celsius
 49 | and round to two decimal places:
 50 | 
 51 | ```sql
 52 | SELECT taken, round(5 * (reading - 32) / 9, 2) FROM Survey WHERE quant = 'temp';
 53 | ```
 54 | 
 55 | | taken                     | round(5\*(reading-32)/9, 2) | 
 56 | | ------------------------- | -------------------------- |
 57 | | 734                       | \-29.72                     | 
 58 | | 735                       | \-32.22                     | 
 59 | | 751                       | \-28.06                     | 
 60 | | 752                       | \-26.67                     | 
 61 | 
 62 | As you can see from this example, though, the string describing our
 63 | new field (generated from the equation) can become quite unwieldy. SQL
 64 | allows us to rename our fields, any field for that matter, whether it
 65 | was calculated or one of the existing fields in our database, for
 66 | succinctness and clarity. For example, we could write the previous
 67 | query as:
 68 | 
 69 | ```sql
 70 | SELECT taken, round(5 * (reading - 32) / 9, 2) as Celsius FROM Survey WHERE quant = 'temp';
 71 | ```
 72 | 
 73 | | taken                     | Celsius                    | 
 74 | | ------------------------- | -------------------------- |
 75 | | 734                       | \-29.72                     | 
 76 | | 735                       | \-32.22                     | 
 77 | | 751                       | \-28.06                     | 
 78 | | 752                       | \-26.67                     | 
 79 | 
 80 | We can also combine values from different fields,
 81 | for example by using the string concatenation operator `||`:
 82 | 
 83 | ```sql
 84 | SELECT personal || ' ' || family FROM Person;
 85 | ```
 86 | 
 87 | | personal || ' ' || family | 
 88 | | ------------------------- |
 89 | | William Dyer              | 
 90 | | Frank Pabodie             | 
 91 | | Anderson Lake             | 
 92 | | Valentina Roerich         | 
 93 | | Frank Danforth            | 
 94 | 
 95 | :::::::::::::::::::::::::::::::::::::::  challenge
 96 | 
 97 | ## Fixing Salinity Readings
 98 | 
 99 | After further reading,
100 | we realize that Valentina Roerich
101 | was reporting salinity as percentages.
102 | Write a query that returns all of her salinity measurements
103 | from the `Survey` table
104 | with the values divided by 100.
105 | 
106 | :::::::::::::::  solution
107 | 
108 | ## Solution
109 | 
110 | ```sql
111 | SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal';
112 | ```
113 | 
114 | | taken                     | reading / 100              | 
115 | | ------------------------- | -------------------------- |
116 | | 752                       | 0\.416                      | 
117 | | 837                       | 0\.225                      | 
118 | 
119 | :::::::::::::::::::::::::
120 | 
121 | ::::::::::::::::::::::::::::::::::::::::::::::::::
122 | 
123 | :::::::::::::::::::::::::::::::::::::::  challenge
124 | 
125 | ## Unions
126 | 
127 | The `UNION` operator combines the results of two queries:
128 | 
129 | ```sql
130 | SELECT * FROM Person WHERE id = 'dyer' UNION SELECT * FROM Person WHERE id = 'roe';
131 | ```
132 | 
133 | | id                        | personal                   | family  | 
134 | | ------------------------- | -------------------------- | ------- |
135 | | dyer                      | William                    | Dyer    | 
136 | | roe                       | Valentina                  | Roerich | 
137 | 
138 | The `UNION ALL` command is equivalent to the `UNION` operator,
139 | except that `UNION ALL` will select all values.
140 | The difference is that `UNION ALL` will not eliminate duplicate rows.
141 | Instead, `UNION ALL` pulls all rows from the query
142 | specifics and combines them into a table.
143 | The `UNION` command does a `SELECT DISTINCT` on the results set.
144 | If all the records to be returned are unique from your union,
145 | use `UNION ALL` instead, it gives faster results since it skips the `DISTINCT` step.
146 | For this section, we shall use UNION.
147 | 
148 | Use `UNION` to create a consolidated list of salinity measurements
149 | in which Valentina Roerich's, and only Valentina's,
150 | have been corrected as described in the previous challenge.
151 | The output should be something like:
152 | 
153 | | taken                     | reading                    | 
154 | | ------------------------- | -------------------------- |
155 | | 619                       | 0\.13                       | 
156 | | 622                       | 0\.09                       | 
157 | | 734                       | 0\.05                       | 
158 | | 751                       | 0\.1                        | 
159 | | 752                       | 0\.09                       | 
160 | | 752                       | 0\.416                      | 
161 | | 837                       | 0\.21                       | 
162 | | 837                       | 0\.225                      | 
163 | 
164 | :::::::::::::::  solution
165 | 
166 | ## Solution
167 | 
168 | ```sql
169 | SELECT taken, reading FROM Survey WHERE person != 'roe' AND quant = 'sal' UNION SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal' ORDER BY taken ASC;
170 | ```
171 | 
172 | :::::::::::::::::::::::::
173 | 
174 | ::::::::::::::::::::::::::::::::::::::::::::::::::
175 | 
176 | :::::::::::::::::::::::::::::::::::::::  challenge
177 | 
178 | ## Selecting Major Site Identifiers
179 | 
180 | The site identifiers in the `Visited` table have two parts
181 | separated by a '-':
182 | 
183 | ```sql
184 | SELECT DISTINCT site FROM Visited;
185 | ```
186 | 
187 | | site                      | 
188 | | ------------------------- |
189 | | DR-1                      | 
190 | | DR-3                      | 
191 | | MSK-4                     | 
192 | 
193 | Some major site identifiers (i.e. the letter codes) are two letters long and some are three.
194 | The "in string" function `instr(X, Y)`
195 | returns the 1-based index of the first occurrence of string Y in string X,
196 | or 0 if Y does not exist in X.
197 | The substring function `substr(X, I, [L])`
198 | returns the substring of X starting at index I, with an optional length L.
199 | Use these two functions to produce a list of unique major site identifiers.
200 | (For this data,
201 | the list should contain only "DR" and "MSK").
202 | 
203 | :::::::::::::::  solution
204 | 
205 | ## Solution
206 | 
207 | ```sql
208 | SELECT DISTINCT substr(site, 1, instr(site, '-') - 1) AS MajorSite FROM Visited;
209 | ```
210 | 
211 | :::::::::::::::::::::::::
212 | 
213 | ::::::::::::::::::::::::::::::::::::::::::::::::::
214 | 
215 | :::::::::::::::::::::::::::::::::::::::: keypoints
216 | 
217 | - Queries can do the usual arithmetic operations on values.
218 | - Use UNION to combine the results of two or more queries.
219 | 
220 | ::::::::::::::::::::::::::::::::::::::::::::::::::
221 | 
222 | 
223 | 


--------------------------------------------------------------------------------
/episodes/02-sort-dup.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Sorting and Removing Duplicates
  3 | teaching: 10
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write queries that display results in a particular order.
 10 | - Write queries that eliminate duplicate values from data.
 11 | 
 12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 13 | 
 14 | :::::::::::::::::::::::::::::::::::::::: questions
 15 | 
 16 | - How can I sort a query's results?
 17 | - How can I remove duplicate values from a query's results?
 18 | 
 19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 20 | 
 21 | In beginning our examination of the Antarctic data, we want to know:
 22 | 
 23 | - what kind of quantity measurements were taken at each site;
 24 | - which scientists took measurements on the expedition;
 25 | 
 26 | To determine which measurements were taken at each site,
 27 | we can examine the `Survey` table.
 28 | Data is often redundant,
 29 | so queries often return redundant information.
 30 | For example,
 31 | if we select the quantities that have been measured
 32 | from the `Survey` table,
 33 | we get this:
 34 | 
 35 | ```sql
 36 | SELECT quant FROM Survey;
 37 | ```
 38 | 
 39 | | quant      | 
 40 | | ---------- |
 41 | | rad        | 
 42 | | sal        | 
 43 | | rad        | 
 44 | | sal        | 
 45 | | rad        | 
 46 | | sal        | 
 47 | | temp       | 
 48 | | rad        | 
 49 | | sal        | 
 50 | | temp       | 
 51 | | rad        | 
 52 | | temp       | 
 53 | | sal        | 
 54 | | rad        | 
 55 | | sal        | 
 56 | | temp       | 
 57 | | sal        | 
 58 | | rad        | 
 59 | | sal        | 
 60 | | sal        | 
 61 | | rad        | 
 62 | 
 63 | This result makes it difficult to see all of the different types of
 64 | `quant` in the Survey table.  We can eliminate the redundant output to
 65 | make the result more readable by adding the `DISTINCT` keyword to our
 66 | query:
 67 | 
 68 | ```sql
 69 | SELECT DISTINCT quant FROM Survey;
 70 | ```
 71 | 
 72 | | quant      | 
 73 | | ---------- |
 74 | | rad        | 
 75 | | sal        | 
 76 | | temp       | 
 77 | 
 78 | If we want to determine which visit (stored in the `taken` column)
 79 | have which `quant` measurement,
 80 | we can use the `DISTINCT` keyword on multiple columns.
 81 | If we select more than one column,
 82 | distinct *sets* of values are returned
 83 | (in this case *pairs*, because we are selecting two columns):
 84 | 
 85 | ```sql
 86 | SELECT DISTINCT taken, quant FROM Survey;
 87 | ```
 88 | 
 89 | | taken      | quant     | 
 90 | | ---------- | --------- |
 91 | | 619        | rad       | 
 92 | | 619        | sal       | 
 93 | | 622        | rad       | 
 94 | | 622        | sal       | 
 95 | | 734        | rad       | 
 96 | | 734        | sal       | 
 97 | | 734        | temp      | 
 98 | | 735        | rad       | 
 99 | | 735        | sal       | 
100 | | 735        | temp      | 
101 | | 751        | rad       | 
102 | | 751        | temp      | 
103 | | 751        | sal       | 
104 | | 752        | rad       | 
105 | | 752        | sal       | 
106 | | 752        | temp      | 
107 | | 837        | rad       | 
108 | | 837        | sal       | 
109 | | 844        | rad       | 
110 | 
111 | Notice in both cases that duplicates are removed
112 | even if the rows they come from didn't appear to be adjacent in the database table.
113 | 
114 | Our next task is to identify the scientists on the expedition by looking at the `Person` table.
115 | As we mentioned earlier,
116 | database records are not stored in any particular order.
117 | This means that query results aren't necessarily sorted,
118 | and even if they are,
119 | we often want to sort them in a different way,
120 | e.g., by their identifier instead of by their personal name.
121 | We can do this in SQL by adding an `ORDER BY` clause to our query:
122 | 
123 | ```sql
124 | SELECT * FROM Person ORDER BY id;
125 | ```
126 | 
127 | | id         | personal  | family   | 
128 | | ---------- | --------- | -------- |
129 | | danfort    | Frank     | Danforth | 
130 | | dyer       | William   | Dyer     | 
131 | | lake       | Anderson  | Lake     | 
132 | | pb         | Frank     | Pabodie  | 
133 | | roe        | Valentina | Roerich  | 
134 | 
135 | By default, when we use `ORDER BY`,
136 | results are sorted in ascending order of the column we specify
137 | (i.e.,
138 | from least to greatest).
139 | 
140 | We can sort in the opposite order using `DESC` (for "descending"):
141 | 
142 | :::::::::::::::::::::::::::::::::::::::::  callout
143 | 
144 | ## A note on ordering
145 | 
146 | While it may look that the records are consistent every time we ask for them in this lesson, that is because no one has changed or modified any of the data so far. Remember to use `ORDER BY` if you want the rows returned to have any sort of consistent or predictable order.
147 | 
148 | 
149 | ::::::::::::::::::::::::::::::::::::::::::::::::::
150 | 
151 | ```sql
152 | SELECT * FROM person ORDER BY id DESC;
153 | ```
154 | 
155 | | id         | personal  | family   | 
156 | | ---------- | --------- | -------- |
157 | | roe        | Valentina | Roerich  | 
158 | | pb         | Frank     | Pabodie  | 
159 | | lake       | Anderson  | Lake     | 
160 | | dyer       | William   | Dyer     | 
161 | | danfort    | Frank     | Danforth | 
162 | 
163 | (And if we want to make it clear that we're sorting in ascending order,
164 | we can use `ASC` instead of `DESC`.)
165 | 
166 | In order to look at which scientist measured quantities during each visit,
167 | we can look again at the `Survey` table.
168 | We can also sort on several fields at once.
169 | For example,
170 | this query sorts results first in ascending order by `taken`,
171 | and then in descending order by `person`
172 | within each group of equal `taken` values:
173 | 
174 | ```sql
175 | SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;
176 | ```
177 | 
178 | | taken      | person    | quant    | 
179 | | ---------- | --------- | -------- |
180 | | 619        | dyer      | rad      | 
181 | | 619        | dyer      | sal      | 
182 | | 622        | dyer      | rad      | 
183 | | 622        | dyer      | sal      | 
184 | | 734        | pb        | rad      | 
185 | | 734        | pb        | temp     | 
186 | | 734        | lake      | sal      | 
187 | | 735        | pb        | rad      | 
188 | | 735        | \-null-    | sal      | 
189 | | 735        | \-null-    | temp     | 
190 | | 751        | pb        | rad      | 
191 | | 751        | pb        | temp     | 
192 | | 751        | lake      | sal      | 
193 | | 752        | roe       | sal      | 
194 | | 752        | lake      | rad      | 
195 | | 752        | lake      | sal      | 
196 | | 752        | lake      | temp     | 
197 | | 837        | roe       | sal      | 
198 | | 837        | lake      | rad      | 
199 | | 837        | lake      | sal      | 
200 | | 844        | roe       | rad      | 
201 | 
202 | This query gives us a good idea of which scientist was involved in which visit,
203 | and what measurements they performed during the visit.
204 | 
205 | Looking at the table, it seems like some scientists specialized in
206 | certain kinds of measurements.  We can examine which scientists
207 | performed which measurements by selecting the appropriate columns and
208 | removing duplicates.
209 | 
210 | ```sql
211 | SELECT DISTINCT quant, person FROM Survey ORDER BY quant ASC;
212 | ```
213 | 
214 | | quant      | person    | 
215 | | ---------- | --------- |
216 | | rad        | dyer      | 
217 | | rad        | pb        | 
218 | | rad        | lake      | 
219 | | rad        | roe       | 
220 | | sal        | dyer      | 
221 | | sal        | lake      | 
222 | | sal        | \-null-    | 
223 | | sal        | roe       | 
224 | | temp       | pb        | 
225 | | temp       | \-null-    | 
226 | | temp       | lake      | 
227 | 
228 | :::::::::::::::::::::::::::::::::::::::  challenge
229 | 
230 | ## Finding Distinct Dates
231 | 
232 | Write a query that selects distinct dates from the `Visited` table.
233 | 
234 | :::::::::::::::  solution
235 | 
236 | ## Solution
237 | 
238 | ```sql
239 | SELECT DISTINCT dated FROM Visited;
240 | ```
241 | 
242 | | dated      | 
243 | | ---------- |
244 | | 1927-02-08 | 
245 | | 1927-02-10 | 
246 | | 1930-01-07 | 
247 | | 1930-01-12 | 
248 | | 1930-02-26 | 
249 | |            | 
250 | | 1932-01-14 | 
251 | | 1932-03-22 | 
252 | 
253 | :::::::::::::::::::::::::
254 | 
255 | ::::::::::::::::::::::::::::::::::::::::::::::::::
256 | 
257 | :::::::::::::::::::::::::::::::::::::::  challenge
258 | 
259 | ## Displaying Full Names
260 | 
261 | Write a query that displays the full names of the scientists in the `Person` table,
262 | ordered by family name.
263 | 
264 | :::::::::::::::  solution
265 | 
266 | ## Solution
267 | 
268 | ```sql
269 | SELECT personal, family FROM Person ORDER BY family ASC;
270 | ```
271 | 
272 | | personal   | family    | 
273 | | ---------- | --------- |
274 | | Frank      | Danforth  | 
275 | | William    | Dyer      | 
276 | | Anderson   | Lake      | 
277 | | Frank      | Pabodie   | 
278 | | Valentina  | Roerich   | 
279 | 
280 | :::::::::::::::::::::::::
281 | 
282 | ::::::::::::::::::::::::::::::::::::::::::::::::::
283 | 
284 | :::::::::::::::::::::::::::::::::::::::: keypoints
285 | 
286 | - The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.
287 | - The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.
288 | 
289 | ::::::::::::::::::::::::::::::::::::::::::::::::::
290 | 
291 | 
292 | 


--------------------------------------------------------------------------------
/episodes/05-null.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Missing Data
  3 | teaching: 15
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain how databases represent missing information.
 10 | - Explain the three-valued logic databases use when manipulating missing information.
 11 | - Write queries that handle missing information correctly.
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - How do databases represent missing information?
 18 | - What special handling does missing information require?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | Real-world data is never complete --- there are always holes.
 23 | Databases represent these holes using a special value called `null`.
 24 | `null` is not zero, `False`, or the empty string;
 25 | it is a one-of-a-kind value that means "nothing here".
 26 | Dealing with `null` requires a few special tricks
 27 | and some careful thinking.
 28 | 
 29 | By default, SQLite does not display NULL values in its output. The `.nullvalue`
 30 | command causes SQLite to display the value you specify for NULLs. We will use
 31 | the value `-null-` to make the NULLs easier to see:
 32 | 
 33 | ```sql
 34 | .nullvalue -null-
 35 | ```
 36 | 
 37 | To start,
 38 | let's have a look at the `Visited` table.
 39 | There are eight records,
 40 | but #752 doesn't have a date --- or rather,
 41 | its date is null:
 42 | 
 43 | ```sql
 44 | SELECT * FROM Visited;
 45 | ```
 46 | 
 47 | | id    | site   | dated      | 
 48 | | ----- | ------ | ---------- |
 49 | | 619   | DR-1   | 1927-02-08 | 
 50 | | 622   | DR-1   | 1927-02-10 | 
 51 | | 734   | DR-3   | 1930-01-07 | 
 52 | | 735   | DR-3   | 1930-01-12 | 
 53 | | 751   | DR-3   | 1930-02-26 | 
 54 | | 752   | DR-3   | \-null-     | 
 55 | | 837   | MSK-4  | 1932-01-14 | 
 56 | | 844   | DR-1   | 1932-03-22 | 
 57 | 
 58 | Null doesn't behave like other values.
 59 | If we select the records that come before 1930:
 60 | 
 61 | ```sql
 62 | SELECT * FROM Visited WHERE dated < '1930-01-01';
 63 | ```
 64 | 
 65 | | id    | site   | dated      | 
 66 | | ----- | ------ | ---------- |
 67 | | 619   | DR-1   | 1927-02-08 | 
 68 | | 622   | DR-1   | 1927-02-10 | 
 69 | 
 70 | we get two results,
 71 | and if we select the ones that come during or after 1930:
 72 | 
 73 | ```sql
 74 | SELECT * FROM Visited WHERE dated >= '1930-01-01';
 75 | ```
 76 | 
 77 | | id    | site   | dated      | 
 78 | | ----- | ------ | ---------- |
 79 | | 734   | DR-3   | 1930-01-07 | 
 80 | | 735   | DR-3   | 1930-01-12 | 
 81 | | 751   | DR-3   | 1930-02-26 | 
 82 | | 837   | MSK-4  | 1932-01-14 | 
 83 | | 844   | DR-1   | 1932-03-22 | 
 84 | 
 85 | we get five,
 86 | but record #752 isn't in either set of results.
 87 | The reason is that
 88 | `null<'1930-01-01'`
 89 | is neither true nor false:
 90 | null means, "We don't know,"
 91 | and if we don't know the value on the left side of a comparison,
 92 | we don't know whether the comparison is true or false.
 93 | Since databases represent "don't know" as null,
 94 | the value of `null<'1930-01-01'`
 95 | is actually `null`.
 96 | `null>='1930-01-01'` is also null
 97 | because we can't answer to that question either.
 98 | And since the only records kept by a `WHERE`
 99 | are those for which the test is true,
100 | record #752 isn't included in either set of results.
101 | 
102 | Comparisons aren't the only operations that behave this way with nulls.
103 | `1+null` is `null`,
104 | `5*null` is `null`,
105 | `log(null)` is `null`,
106 | and so on.
107 | In particular,
108 | comparing things to null with = and != produces null:
109 | 
110 | ```sql
111 | SELECT * FROM Visited WHERE dated = NULL;
112 | ```
113 | 
114 | produces no output, and neither does:
115 | 
116 | ```sql
117 | SELECT * FROM Visited WHERE dated != NULL;
118 | ```
119 | 
120 | To check whether a value is `null` or not,
121 | we must use a special test `IS NULL`:
122 | 
123 | ```sql
124 | SELECT * FROM Visited WHERE dated IS NULL;
125 | ```
126 | 
127 | | id    | site   | dated      | 
128 | | ----- | ------ | ---------- |
129 | | 752   | DR-3   | \-null-     | 
130 | 
131 | or its inverse `IS NOT NULL`:
132 | 
133 | ```sql
134 | SELECT * FROM Visited WHERE dated IS NOT NULL;
135 | ```
136 | 
137 | | id    | site   | dated      | 
138 | | ----- | ------ | ---------- |
139 | | 619   | DR-1   | 1927-02-08 | 
140 | | 622   | DR-1   | 1927-02-10 | 
141 | | 734   | DR-3   | 1930-01-07 | 
142 | | 735   | DR-3   | 1930-01-12 | 
143 | | 751   | DR-3   | 1930-02-26 | 
144 | | 837   | MSK-4  | 1932-01-14 | 
145 | | 844   | DR-1   | 1932-03-22 | 
146 | 
147 | Null values can cause headaches wherever they appear.
148 | For example,
149 | suppose we want to find all the salinity measurements
150 | that weren't taken by Lake.
151 | It's natural to write the query like this:
152 | 
153 | ```sql
154 | SELECT * FROM Survey WHERE quant = 'sal' AND person != 'lake';
155 | ```
156 | 
157 | | taken | person | quant      | reading | 
158 | | ----- | ------ | ---------- | ------- |
159 | | 619   | dyer   | sal        | 0\.13    | 
160 | | 622   | dyer   | sal        | 0\.09    | 
161 | | 752   | roe    | sal        | 41\.6    | 
162 | | 837   | roe    | sal        | 22\.5    | 
163 | 
164 | but this query filters omits the records
165 | where we don't know who took the measurement.
166 | Once again,
167 | the reason is that when `person` is `null`,
168 | the `!=` comparison produces `null`,
169 | so the record isn't kept in our results.
170 | If we want to keep these records
171 | we need to add an explicit check:
172 | 
173 | ```sql
174 | SELECT * FROM Survey WHERE quant = 'sal' AND (person != 'lake' OR person IS NULL);
175 | ```
176 | 
177 | | taken | person | quant      | reading | 
178 | | ----- | ------ | ---------- | ------- |
179 | | 619   | dyer   | sal        | 0\.13    | 
180 | | 622   | dyer   | sal        | 0\.09    | 
181 | | 735   | \-null- | sal        | 0\.06    | 
182 | | 752   | roe    | sal        | 41\.6    | 
183 | | 837   | roe    | sal        | 22\.5    | 
184 | 
185 | We still have to decide whether this is the right thing to do or not.
186 | If we want to be absolutely sure that
187 | we aren't including any measurements by Lake in our results,
188 | we need to exclude all the records for which we don't know who did the work.
189 | 
190 | In contrast to arithmetic or Boolean operators, aggregation functions
191 | that combine multiple values, such as `min`, `max` or `avg`, *ignore*
192 | `null` values. In the majority of cases, this is a desirable output:
193 | for example, unknown values are thus not affecting our data when we
194 | are averaging it. Aggregation functions will be addressed in more
195 | detail in [the next section](06-agg.md).
196 | 
197 | :::::::::::::::::::::::::::::::::::::::  challenge
198 | 
199 | ## Sorting by Known Date
200 | 
201 | Write a query that sorts the records in `Visited` by date,
202 | omitting entries for which the date is not known
203 | (i.e., is null).
204 | 
205 | :::::::::::::::  solution
206 | 
207 | ## Solution
208 | 
209 | ```sql
210 | SELECT * FROM Visited WHERE dated IS NOT NULL ORDER BY dated ASC;
211 | ```
212 | 
213 | | id    | site   | dated      | 
214 | | ----- | ------ | ---------- |
215 | | 619   | DR-1   | 1927-02-08 | 
216 | | 622   | DR-1   | 1927-02-10 | 
217 | | 734   | DR-3   | 1930-01-07 | 
218 | | 735   | DR-3   | 1930-01-12 | 
219 | | 751   | DR-3   | 1930-02-26 | 
220 | | 837   | MSK-4  | 1932-01-14 | 
221 | | 844   | DR-1   | 1932-03-22 | 
222 | 
223 | :::::::::::::::::::::::::
224 | 
225 | ::::::::::::::::::::::::::::::::::::::::::::::::::
226 | 
227 | :::::::::::::::::::::::::::::::::::::::  challenge
228 | 
229 | ## NULL in a Set
230 | 
231 | What do you expect the following query to produce?
232 | 
233 | ```sql
234 | SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);
235 | ```
236 | 
237 | What does it actually produce?
238 | 
239 | :::::::::::::::  solution
240 | 
241 | ## Solution
242 | 
243 | You might expect the above query to return rows where dated is either '1927-02-08' or NULL.
244 | Instead it only returns rows where dated is '1927-02-08', the same as you would get from this
245 | simpler query:
246 | 
247 | ```sql
248 | SELECT * FROM Visited WHERE dated IN ('1927-02-08');
249 | ```
250 | 
251 | The reason is that the `IN` operator works with a set of *values*, but NULL is by definition
252 | not a value and is therefore simply ignored.
253 | 
254 | If we wanted to actually include NULL, we would have to rewrite the query to use the IS NULL condition:
255 | 
256 | ```sql
257 | SELECT * FROM Visited WHERE dated = '1927-02-08' OR dated IS NULL;
258 | ```
259 | 
260 | :::::::::::::::::::::::::
261 | 
262 | ::::::::::::::::::::::::::::::::::::::::::::::::::
263 | 
264 | :::::::::::::::::::::::::::::::::::::::  challenge
265 | 
266 | ## Pros and Cons of Sentinels
267 | 
268 | Some database designers prefer to use
269 | a [sentinel value](../learners/reference.md#sentinel-value)
270 | to mark missing data rather than `null`.
271 | For example,
272 | they will use the date "0000-00-00" to mark a missing date,
273 | or -1.0 to mark a missing salinity or radiation reading
274 | (since actual readings cannot be negative).
275 | What does this simplify?
276 | What burdens or risks does it introduce?
277 | 
278 | 
279 | ::::::::::::::::::::::::::::::::::::::::::::::::::
280 | 
281 | :::::::::::::::::::::::::::::::::::::::: keypoints
282 | 
283 | - Databases use a special value called NULL to represent missing information.
284 | - Almost all operations on NULL produce NULL.
285 | - Queries can test for NULLs using IS NULL and IS NOT NULL.
286 | 
287 | ::::::::::::::::::::::::::::::::::::::::::::::::::
288 | 
289 | 
290 | 


--------------------------------------------------------------------------------
/.github/workflows/README.md:
--------------------------------------------------------------------------------
  1 | # Carpentries Workflows
  2 | 
  3 | This directory contains workflows to be used for Lessons using the {sandpaper}
  4 | lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml`
  5 | and `pr-receive.yaml`) and the rest are bots to handle pull request management.
  6 | 
  7 | These workflows will likely change as {sandpaper} evolves, so it is important to
  8 | keep them up-to-date. To do this in your lesson you can do the following in your
  9 | R console:
 10 | 
 11 | ```r
 12 | # Install/Update sandpaper
 13 | options(repos = c(carpentries = "https://carpentries.r-universe.dev/", 
 14 |   CRAN = "https://cloud.r-project.org"))
 15 | install.packages("sandpaper")
 16 | 
 17 | # update the workflows in your lesson
 18 | library("sandpaper")
 19 | update_github_workflows()
 20 | ```
 21 | 
 22 | Inside this folder, you will find a file called `sandpaper-version.txt`, which
 23 | will contain a version number for sandpaper. This will be used in the future to
 24 | alert you if a workflow update is needed.
 25 | 
 26 | What follows are the descriptions of the workflow files:
 27 | 
 28 | ## Deployment
 29 | 
 30 | ### 01 Build and Deploy (sandpaper-main.yaml)
 31 | 
 32 | This is the main driver that will only act on the main branch of the repository.
 33 | This workflow does the following:
 34 | 
 35 |  1. checks out the lesson
 36 |  2. provisions the following resources
 37 |    - R
 38 |    - pandoc
 39 |    - lesson infrastructure (stored in a cache)
 40 |    - lesson dependencies if needed (stored in a cache)
 41 |  3. builds the lesson via `sandpaper:::ci_deploy()`
 42 | 
 43 | #### Caching
 44 | 
 45 | This workflow has two caches; one cache is for the lesson infrastructure and 
 46 | the other is for the lesson dependencies if the lesson contains rendered
 47 | content. These caches are invalidated by new versions of the infrastructure and
 48 | the `renv.lock` file, respectively. If there is a problem with the cache, 
 49 | manual invaliation is necessary. You will need maintain access to the repository
 50 | and you can either go to the actions tab and [click on the caches button to find
 51 | and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/) 
 52 | or by setting the `CACHE_VERSION` secret to the current date (which will
 53 | invalidate all of the caches).
 54 | 
 55 | ## Updates
 56 | 
 57 | ### Setup Information
 58 | 
 59 | These workflows run on a schedule and at the maintainer's request. Because they
 60 | create pull requests that update workflows/require the downstream actions to run,
 61 | they need a special repository/organization secret token called 
 62 | `SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope. 
 63 | 
 64 | This can be an individual user token, OR it can be a trusted bot account. If you
 65 | have a repository in one of the official Carpentries accounts, then you do not
 66 | need to worry about this token being present because the Carpentries Core Team
 67 | will take care of supplying this token.
 68 | 
 69 | If you want to use your personal account: you can go to 
 70 | <https://github.com/settings/tokens/new?scopes=public_repo,workflow&description=Sandpaper%20Token>
 71 | to create a token. Once you have created your token, you should copy it to your
 72 | clipboard and then go to your repository's settings > secrets > actions and
 73 | create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token.
 74 | 
 75 | If you do not specify your token correctly, the runs will not fail and they will
 76 | give you instructions to provide the token for your repository. 
 77 | 
 78 | ### 02 Maintain: Update Workflow Files (update-workflow.yaml)
 79 | 
 80 | The {sandpaper} repository was designed to do as much as possible to separate 
 81 | the tools from the content. For local builds, this is absolutely true, but 
 82 | there is a minor issue when it comes to workflow files: they must live inside 
 83 | the repository. 
 84 | 
 85 | This workflow ensures that the workflow files are up-to-date. The way it work is
 86 | to download the update-workflows.sh script from GitHub and run it. The script 
 87 | will do the following:
 88 | 
 89 | 1. check the recorded version of sandpaper against the current version on github
 90 | 2. update the files if there is a difference in versions
 91 | 
 92 | After the files are updated, if there are any changes, they are pushed to a
 93 | branch called `update/workflows` and a pull request is created. Maintainers are
 94 | encouraged to review the changes and accept the pull request if the outputs
 95 | are okay.
 96 | 
 97 | This update is run weekly or on demand.
 98 | 
 99 | ### 03 Maintain: Update Package Cache (update-cache.yaml)
100 | 
101 | For lessons that have generated content, we use {renv} to ensure that the output
102 | is stable. This is controlled by a single lockfile which documents the packages
103 | needed for the lesson and the version numbers. This workflow is skipped in 
104 | lessons that do not have generated content.
105 | 
106 | Because the lessons need to remain current with the package ecosystem, it's a
107 | good idea to make sure these packages can be updated periodically. The 
108 | update cache workflow will do this by checking for updates, applying them in a
109 | branch called `updates/packages` and creating a pull request with _only the
110 | lockfile changed_. 
111 | 
112 | From here, the markdown documents will be rebuilt and you can inspect what has
113 | changed based on how the packages have updated. 
114 | 
115 | ## Pull Request and Review Management
116 | 
117 | Because our lessons execute code, pull requests are a secruity risk for any
118 | lesson and thus have security measures associted with them. **Do not merge any
119 | pull requests that do not pass checks and do not have bots commented on them.**
120 | 
121 | This series of workflows all go together and are described in the following 
122 | diagram and the below sections:
123 | 
124 | ![Graph representation of a pull request](https://carpentries.github.io/sandpaper/articles/img/pr-flow.dot.svg)
125 | 
126 | ### Pre Flight Pull Request Validation (pr-preflight.yaml)
127 | 
128 | This workflow runs every time a pull request is created and its purpose is to
129 | validate that the pull request is okay to run. This means the following things:
130 | 
131 | 1. The pull request does not contain modified workflow files
132 | 2. If the pull request contains modified workflow files, it does not contain 
133 |    modified content files (such as a situation where @carpentries-bot will
134 |    make an automated pull request)
135 | 3. The pull request does not contain an invalid commit hash (e.g. from a fork
136 |    that was made before a lesson was transitioned from styles to use the
137 |    workbench).
138 | 
139 | Once the checks are finished, a comment is issued to the pull request, which 
140 | will allow maintainers to determine if it is safe to run the 
141 | "Receive Pull Request" workflow from new contributors.
142 | 
143 | ### Receive Pull Request (pr-receive.yaml)
144 | 
145 | **Note of caution:** This workflow runs arbitrary code by anyone who creates a
146 | pull request. GitHub has safeguarded the token used in this workflow to have no
147 | priviledges in the repository, but we have taken precautions to protect against
148 | spoofing.
149 | 
150 | This workflow is triggered with every push to a pull request. If this workflow
151 | is already running and a new push is sent to the pull request, the workflow
152 | running from the previous push will be cancelled and a new workflow run will be
153 | started.
154 | 
155 | The first step of this workflow is to check if it is valid (e.g. that no
156 | workflow files have been modified). If there are workflow files that have been
157 | modified, a comment is made that indicates that the workflow is not run. If 
158 | both a workflow file and lesson content is modified, an error will occurr.
159 | 
160 | The second step (if valid) is to build the generated content from the pull
161 | request. This builds the content and uploads three artifacts:
162 | 
163 | 1. The pull request number (pr)
164 | 2. A summary of changes after the rendering process (diff)
165 | 3. The rendered files (build)
166 | 
167 | Because this workflow builds generated content, it follows the same general 
168 | process as the `sandpaper-main` workflow with the same caching mechanisms.
169 | 
170 | The artifacts produced are used by the next workflow.
171 | 
172 | ### Comment on Pull Request (pr-comment.yaml)
173 | 
174 | This workflow is triggered if the `pr-receive.yaml` workflow is successful.
175 | The steps in this workflow are:
176 | 
177 | 1. Test if the workflow is valid and comment the validity of the workflow to the
178 |    pull request.
179 | 2. If it is valid: create an orphan branch with two commits: the current state
180 |    of the repository and the proposed changes.
181 | 3. If it is valid: update the pull request comment with the summary of changes
182 | 
183 | Importantly: if the pull request is invalid, the branch is not created so any
184 | malicious code is not published.
185 | 
186 | From here, the maintainer can request changes from the author and eventually 
187 | either merge or reject the PR. When this happens, if the PR was valid, the 
188 | preview branch needs to be deleted. 
189 | 
190 | ### Send Close PR Signal (pr-close-signal.yaml)
191 | 
192 | Triggered any time a pull request is closed. This emits an artifact that is the
193 | pull request number for the next action
194 | 
195 | ### Remove Pull Request Branch (pr-post-remove-branch.yaml)
196 | 
197 | Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with
198 | the pull request (if it was created).
199 | 


--------------------------------------------------------------------------------
/episodes/11-prog-R.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Programming with Databases - R
  3 | teaching: 30
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write short programs that execute SQL queries.
 10 | - Trace the execution of a program that contains an SQL query.
 11 | - Explain why most database applications are written in a general-purpose language rather than in SQL.
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - How can I access databases from programs written in R?
 18 | 
 19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 20 | 
 21 | To close,
 22 | let's have a look at how to access a database from
 23 | a data analysis language like R.
 24 | Other languages use almost exactly the same model:
 25 | library and function names may differ,
 26 | but the concepts are the same.
 27 | 
 28 | Here's a short R program that selects latitudes and longitudes
 29 | from an SQLite database stored in a file called `survey.db`:
 30 | 
 31 | ```r
 32 | library(RSQLite)
 33 | connection <- dbConnect(SQLite(), "survey.db")
 34 | results <- dbGetQuery(connection, "SELECT Site.lat, Site.long FROM Site;")
 35 | print(results)
 36 | dbDisconnect(connection)
 37 | ```
 38 | 
 39 | ```output
 40 |      lat    long
 41 | 1 -49.85 -128.57
 42 | 2 -47.15 -126.72
 43 | 3 -48.87 -123.40
 44 | ```
 45 | 
 46 | The program starts by importing the `RSQLite` library.
 47 | If we were connecting to MySQL, DB2, or some other database,
 48 | we would import a different library,
 49 | but all of them provide the same functions,
 50 | so that the rest of our program does not have to change
 51 | (at least, not much)
 52 | if we switch from one database to another.
 53 | 
 54 | Line 2 establishes a connection to the database.
 55 | Since we're using SQLite,
 56 | all we need to specify is the name of the database file.
 57 | Other systems may require us to provide a username and password as well.
 58 | 
 59 | On line 3, we retrieve the results from an SQL query.
 60 | It's our job to make sure that SQL is properly formatted;
 61 | if it isn't,
 62 | or if something goes wrong when it is being executed,
 63 | the database will report an error.
 64 | This result is a dataframe with one row for each entry and one column for each column in the database.
 65 | 
 66 | Finally, the last line closes our connection,
 67 | since the database can only keep a limited number of these open at one time.
 68 | Since establishing a connection takes time,
 69 | though,
 70 | we shouldn't open a connection,
 71 | do one operation,
 72 | then close the connection,
 73 | only to reopen it a few microseconds later to do another operation.
 74 | Instead,
 75 | it's normal to create one connection that stays open for the lifetime of the program.
 76 | 
 77 | Queries in real applications will often depend on values provided by users.
 78 | For example,
 79 | this function takes a user's ID as a parameter and returns their name:
 80 | 
 81 | ```r
 82 | library(RSQLite)
 83 | 
 84 | connection <- dbConnect(SQLite(), "survey.db")
 85 | 
 86 | getName <- function(personID) {
 87 |   query <- paste0("SELECT personal || ' ' || family FROM Person WHERE id =='",
 88 |                   personID, "';")
 89 |   return(dbGetQuery(connection, query))
 90 | }
 91 | 
 92 | print(paste("full name for dyer:", getName('dyer')))
 93 | 
 94 | dbDisconnect(connection)
 95 | ```
 96 | 
 97 | ```output
 98 | full name for dyer: William Dyer
 99 | ```
100 | 
101 | We use string concatenation on the first line of this function
102 | to construct a query containing the user ID we have been given.
103 | This seems simple enough,
104 | but what happens if someone gives us this string as input?
105 | 
106 | ```sql
107 | dyer'; DROP TABLE Survey; SELECT '
108 | ```
109 | 
110 | It looks like there's garbage after the user's ID,
111 | but it is very carefully chosen garbage.
112 | If we insert this string into our query,
113 | the result is:
114 | 
115 | ```sql
116 | SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';
117 | ```
118 | 
119 | If we execute this,
120 | it will erase one of the tables in our database.
121 | 
122 | This is called an [SQL injection attack](../learners/reference.md#sql-injection-attack),
123 | and it has been used to attack thousands of programs over the years.
124 | In particular,
125 | many web sites that take data from users insert values directly into queries
126 | without checking them carefully first.
127 | A very [relevant XKCD](https://xkcd.com/327/) that explains the
128 | dangers of using raw input in queries a little more succinctly:
129 | 
130 | ![](https://imgs.xkcd.com/comics/exploits_of_a_mom.png){alt='relevant XKCD'}
131 | 
132 | Since an unscrupulous parent might try to smuggle commands into our queries in many different ways,
133 | the safest way to deal with this threat is
134 | to replace characters like quotes with their escaped equivalents,
135 | so that we can safely put whatever the user gives us inside a string.
136 | We can do this by using a [prepared statement](../learners/reference.md#prepared-statement)
137 | instead of formatting our statements as strings.
138 | Here's what our example program looks like if we do this:
139 | 
140 | ```r
141 | library(RSQLite)
142 | connection <- dbConnect(SQLite(), "survey.db")
143 | 
144 | getName <- function(personID) {
145 |   query <- "SELECT personal || ' ' || family FROM Person WHERE id == ?"
146 |   return(dbGetPreparedQuery(connection, query, data.frame(personID)))
147 | }
148 | 
149 | print(paste("full name for dyer:", getName('dyer')))
150 | 
151 | dbDisconnect(connection)
152 | ```
153 | 
154 | ```output
155 | full name for dyer: William Dyer
156 | ```
157 | 
158 | The key changes are in the query string and the `dbGetQuery` call (we use dbGetPreparedQuery instead).
159 | Instead of formatting the query ourselves,
160 | we put question marks in the query template where we want to insert values.
161 | When we call `dbGetPreparedQuery`,
162 | we provide a dataframe
163 | that contains as many values as there are question marks in the query.
164 | The library matches values to question marks in order,
165 | and translates any special characters in the values
166 | into their escaped equivalents
167 | so that they are safe to use.
168 | 
169 | :::::::::::::::::::::::::::::::::::::::  challenge
170 | 
171 | ## Filling a Table vs. Printing Values
172 | 
173 | Write an R program that creates a new database in a file called
174 | `original.db` containing a single table called `Pressure`, with a
175 | single field called `reading`, and inserts 100,000 random numbers
176 | between 10.0 and 25.0.  How long does it take this program to run?
177 | How long does it take to run a program that simply writes those
178 | random numbers to a file?
179 | 
180 | 
181 | ::::::::::::::::::::::::::::::::::::::::::::::::::
182 | 
183 | :::::::::::::::::::::::::::::::::::::::  challenge
184 | 
185 | ## Filtering in SQL vs. Filtering in R
186 | 
187 | Write an R program that creates a new database called
188 | `backup.db` with the same structure as `original.db` and copies all
189 | the values greater than 20.0 from `original.db` to `backup.db`.
190 | Which is faster: filtering values in the query, or reading
191 | everything into memory and filtering in R?
192 | 
193 | 
194 | ::::::::::::::::::::::::::::::::::::::::::::::::::
195 | 
196 | ## Database helper functions in R
197 | 
198 | R's database interface packages (like `RSQLite`) all share
199 | a common set of helper functions useful for exploring databases and
200 | reading/writing entire tables at once.
201 | 
202 | To view all tables in a database, we can use `dbListTables()`:
203 | 
204 | ```r
205 | connection <- dbConnect(SQLite(), "survey.db")
206 | dbListTables(connection)
207 | ```
208 | 
209 | ```output
210 | "Person"  "Site"    "Survey"  "Visited"
211 | ```
212 | 
213 | To view all column names of a table, use `dbListFields()`:
214 | 
215 | ```r
216 | dbListFields(connection, "Survey")
217 | ```
218 | 
219 | ```output
220 | "taken"   "person"  "quant"   "reading"
221 | ```
222 | 
223 | To read an entire table as a dataframe, use `dbReadTable()`:
224 | 
225 | ```r
226 | dbReadTable(connection, "Person")
227 | ```
228 | 
229 | ```output
230 |         id  personal   family
231 | 1     dyer   William     Dyer
232 | 2       pb     Frank  Pabodie
233 | 3     lake  Anderson     Lake
234 | 4      roe Valentina  Roerich
235 | 5 danforth     Frank Danforth
236 | ```
237 | 
238 | Finally to write an entire table to a database, you can use `dbWriteTable()`.
239 | Note that we will always want to use the `row.names = FALSE` argument or R
240 | will write the row names as a separate column.
241 | In this example we will write R's built-in `iris` dataset as a table in `survey.db`.
242 | 
243 | ```r
244 | dbWriteTable(connection, "iris", iris, row.names = FALSE)
245 | head(dbReadTable(connection, "iris"))
246 | ```
247 | 
248 | ```output
249 |   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
250 | 1          5.1         3.5          1.4         0.2  setosa
251 | 2          4.9         3.0          1.4         0.2  setosa
252 | 3          4.7         3.2          1.3         0.2  setosa
253 | 4          4.6         3.1          1.5         0.2  setosa
254 | 5          5.0         3.6          1.4         0.2  setosa
255 | 6          5.4         3.9          1.7         0.4  setosa
256 | ```
257 | 
258 | And as always, remember to close the database connection when done!
259 | 
260 | ```r
261 | dbDisconnect(connection)
262 | ```
263 | 
264 | :::::::::::::::::::::::::::::::::::::::: keypoints
265 | 
266 | - Data analysis languages have libraries for accessing databases.
267 | - To connect to a database, a program must use a library specific to that database manager.
268 | - R's libraries can be used to directly query or read from a database.
269 | - Programs can read query results in batches or all at once.
270 | - Queries should be written using parameter substitution, not string formatting.
271 | - R has multiple helper functions to make working with databases easier.
272 | 
273 | ::::::::::::::::::::::::::::::::::::::::::::::::::
274 | 
275 | 
276 | 


--------------------------------------------------------------------------------
/episodes/09-create.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Creating and Modifying Data
  3 | teaching: 15
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write statements that create tables.
 10 | - Write statements to insert, modify, and delete records.
 11 | 
 12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 13 | 
 14 | :::::::::::::::::::::::::::::::::::::::: questions
 15 | 
 16 | - How can I create, modify, and delete tables and data?
 17 | 
 18 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 19 | 
 20 | So far we have only looked at how to get information out of a database,
 21 | both because that is more frequent than adding information,
 22 | and because most other operations only make sense
 23 | once queries are understood.
 24 | 
 25 | The `Person`, `Survey`, `Site`, and `Visited` tables from the `survey,db` database were
 26 | used during the earlier episodes. We're going to build a new database over the course
 27 | of the upcoming episodes. Exit the `SQLite` interactive session if you're still in it.
 28 | 
 29 | ```sql
 30 | .exit
 31 | ```
 32 | 
 33 | Launch `SQLite3` and create a new database, lets call it `newsurvey.db`.
 34 | We use a different name to avoid confusion with the currently existing `survey.db` database.
 35 | 
 36 | ```
 37 | $ sqlite3 newsurvey.db
 38 | ```
 39 | 
 40 | Run the `.mode column` and `.header on` commands again if you aren't using the `.sqliterc` file.
 41 | (Note if you exited and restarted SQLite3 your settings will change back to the default)
 42 | 
 43 | ```sql
 44 | .mode column
 45 | .header on
 46 | ```
 47 | 
 48 | If we want to create and modify data,
 49 | we need to know two other sets of commands.
 50 | 
 51 | The first pair are [`CREATE TABLE`][create-table] and [`DROP TABLE`][drop-table].
 52 | While they are written as two words,
 53 | they are actually single commands.
 54 | The first one creates a new table;
 55 | its arguments are the names and types of the table's columns.
 56 | For example,
 57 | the following statements create the four tables in our survey database:
 58 | 
 59 | ```sql
 60 | CREATE TABLE Person(id text, personal text, family text);
 61 | CREATE TABLE Site(name text, lat real, long real);
 62 | CREATE TABLE Visited(id integer, site text, dated text);
 63 | CREATE TABLE Survey(taken integer, person text, quant text, reading real);
 64 | ```
 65 | 
 66 | We can get rid of one of our tables using:
 67 | 
 68 | ```sql
 69 | DROP TABLE Survey;
 70 | ```
 71 | 
 72 | Be very careful when doing this:
 73 | if you drop the wrong table, hope that the person maintaining the database has a backup,
 74 | but it's better not to have to rely on it.
 75 | 
 76 | Different database systems support different data types for table columns,
 77 | but most provide the following:
 78 | 
 79 | | data type | use                                       | 
 80 | | --------- | ----------------------------------------- |
 81 | | INTEGER   | a signed integer                          | 
 82 | | REAL      | a floating point number                   | 
 83 | | TEXT      | a character string                        | 
 84 | | BLOB      | a "binary large object", such as an image | 
 85 | 
 86 | Most databases also support Booleans and date/time values;
 87 | SQLite uses the integers 0 and 1 for the former,
 88 | and represents the latter as discussed [earlier](03-filter.md).
 89 | An increasing number of databases also support geographic data types,
 90 | such as latitude and longitude.
 91 | Keeping track of what particular systems do or do not offer,
 92 | and what names they give different data types,
 93 | is an unending portability headache.
 94 | 
 95 | When we create a table,
 96 | we can specify several kinds of constraints on its columns.
 97 | For example,
 98 | a better definition for the `Survey` table would be:
 99 | 
100 | ```sql
101 | CREATE TABLE Survey(
102 |     taken   integer not null, -- where reading taken
103 |     person  text,             -- may not know who took it
104 |     quant   text not null,    -- the quantity measured
105 |     reading real not null,    -- the actual reading
106 |     primary key(taken, person, quant),    -- key is taken + person + quant
107 |     foreign key(taken) references Visited(id),
108 |     foreign key(person) references Person(id)
109 | );
110 | ```
111 | 
112 | Once again,
113 | exactly what constraints are available
114 | and what they're called
115 | depends on which database manager we are using.
116 | 
117 | Once tables have been created,
118 | we can add, change, and remove records using our other set of commands,
119 | `INSERT`, `UPDATE`, and `DELETE`.
120 | 
121 | Here is an example of inserting rows into the `Site` table:
122 | 
123 | ```sql
124 | INSERT INTO Site (name, lat, long) VALUES ('DR-1', -49.85, -128.57);
125 | INSERT INTO Site (name, lat, long) VALUES ('DR-3', -47.15, -126.72);
126 | INSERT INTO Site (name, lat, long) VALUES ('MSK-4', -48.87, -123.40);
127 | ```
128 | 
129 | We can also insert values into one table directly from another:
130 | 
131 | ```sql
132 | CREATE TABLE JustLatLong(lat real, long real);
133 | INSERT INTO JustLatLong SELECT lat, long FROM Site;
134 | ```
135 | 
136 | Modifying existing records is done using the `UPDATE` statement.
137 | To do this we tell the database which table we want to update,
138 | what we want to change the values to for any or all of the fields,
139 | and under what conditions we should update the values.
140 | 
141 | For example, if we made a mistake when entering the lat and long values
142 | of the last `INSERT` statement above, we can correct it with an update:
143 | 
144 | ```sql
145 | UPDATE Site SET lat = -47.87, long = -122.40 WHERE name = 'MSK-4';
146 | ```
147 | 
148 | Be careful to not forget the `WHERE` clause or the update statement will
149 | modify *all* of the records in the `Site` table.
150 | 
151 | Deleting records can be a bit trickier,
152 | because we have to ensure that the database remains internally consistent.
153 | If all we care about is a single table,
154 | we can use the `DELETE` command with a `WHERE` clause
155 | that matches the records we want to discard.
156 | For example,
157 | once we realize that Frank Danforth didn't take any measurements,
158 | we can remove him from the `Person` table like this:
159 | 
160 | ```sql
161 | DELETE FROM Person WHERE id = 'danforth';
162 | ```
163 | 
164 | But what if we removed Anderson Lake instead?
165 | Our `Survey` table would still contain seven records
166 | of measurements he'd taken,
167 | but that's never supposed to happen:
168 | `Survey.person` is a foreign key into the `Person` table,
169 | and all our queries assume there will be a row in the latter
170 | matching every value in the former.
171 | 
172 | This problem is called [referential integrity](../learners/reference.md#referential-integrity):
173 | we need to ensure that all references between tables can always be resolved correctly.
174 | One way to do this is to delete all the records
175 | that use `'lake'` as a foreign key
176 | before deleting the record that uses it as a primary key.
177 | If our database manager supports it,
178 | we can automate this
179 | using [cascading delete](../learners/reference.md#cascading-delete).
180 | However,
181 | this technique is outside the scope of this chapter.
182 | 
183 | :::::::::::::::::::::::::::::::::::::::::  callout
184 | 
185 | ## Hybrid Storage Models
186 | 
187 | Many applications use a hybrid storage model
188 | instead of putting everything into a database:
189 | the actual data (such as astronomical images) is stored in files,
190 | while the database stores the files' names,
191 | their modification dates,
192 | the region of the sky they cover,
193 | their spectral characteristics,
194 | and so on.
195 | This is also how most music player software is built:
196 | the database inside the application keeps track of the MP3 files,
197 | but the files themselves live on disk.
198 | 
199 | 
200 | ::::::::::::::::::::::::::::::::::::::::::::::::::
201 | 
202 | :::::::::::::::::::::::::::::::::::::::  challenge
203 | 
204 | ## Replacing NULL
205 | 
206 | Write an SQL statement to replace all uses of `null` in
207 | `Survey.person` with the string `'unknown'`.
208 | 
209 | :::::::::::::::  solution
210 | 
211 | ## Solution
212 | 
213 | ```sql
214 | UPDATE Survey SET person = 'unknown' WHERE person IS NULL;
215 | ```
216 | 
217 | :::::::::::::::::::::::::
218 | 
219 | ::::::::::::::::::::::::::::::::::::::::::::::::::
220 | 
221 | :::::::::::::::::::::::::::::::::::::::  challenge
222 | 
223 | ## Backing Up with SQL
224 | 
225 | SQLite has several administrative commands that aren't part of the
226 | SQL standard.  One of them is `.dump`, which prints the SQL commands
227 | needed to re-create the database.  Another is `.read`, which reads a
228 | file created by `.dump` and restores the database.  A colleague of
229 | yours thinks that storing dump files (which are text) in version
230 | control is a good way to track and manage changes to the database.
231 | What are the pros and cons of this approach?  (Hint: records aren't
232 | stored in any particular order.)
233 | 
234 | :::::::::::::::  solution
235 | 
236 | ## Solution
237 | 
238 | #### Advantages
239 | 
240 | - A version control system will be able to show differences between versions
241 |   of the dump file; something it can't do for binary files like databases
242 | - A VCS only saves changes between versions, rather than a complete copy of
243 |   each version (save disk space)
244 | - The version control log will explain the reason for the changes in each version
245 |   of the database
246 | 
247 | #### Disadvantages
248 | 
249 | - Artificial differences between commits because records don't have a fixed order
250 |   
251 |   
252 | 
253 | :::::::::::::::::::::::::
254 | 
255 | ::::::::::::::::::::::::::::::::::::::::::::::::::
256 | 
257 | [create-table]: https://www.sqlite.org/lang_createtable.html
258 | [drop-table]: https://www.sqlite.org/lang_droptable.html
259 | 
260 | 
261 | :::::::::::::::::::::::::::::::::::::::: keypoints
262 | 
263 | - Use CREATE and DROP to create and delete tables.
264 | - Use INSERT to add data.
265 | - Use UPDATE to modify existing data.
266 | - Use DELETE to remove data.
267 | - It is simpler and safer to modify data when every record has a unique primary key.
268 | - Do not create dangling references by deleting records that other records refer to.
269 | 
270 | ::::::::::::::::::::::::::::::::::::::::::::::::::
271 | 
272 | 
273 | 


--------------------------------------------------------------------------------
/episodes/03-filter.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Filtering
  3 | teaching: 10
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write queries that select records that satisfy user-specified conditions.
 10 | - Explain the order in which the clauses in a query are executed.
 11 | 
 12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 13 | 
 14 | :::::::::::::::::::::::::::::::::::::::: questions
 15 | 
 16 | - How can I select subsets of data?
 17 | 
 18 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 19 | 
 20 | One of the most powerful features of a database is
 21 | the ability to [filter](../learners/reference.md#filter) data,
 22 | i.e.,
 23 | to select only those records that match certain criteria.
 24 | For example,
 25 | suppose we want to see when a particular site was visited.
 26 | We can select these records from the `Visited` table
 27 | by using a `WHERE` clause in our query:
 28 | 
 29 | ```sql
 30 | SELECT * FROM Visited WHERE site = 'DR-1';
 31 | ```
 32 | 
 33 | | id     | site   | dated      | 
 34 | | ------ | ------ | ---------- |
 35 | | 619    | DR-1   | 1927-02-08 | 
 36 | | 622    | DR-1   | 1927-02-10 | 
 37 | | 844    | DR-1   | 1932-03-22 | 
 38 | 
 39 | The database manager executes this query in two stages.
 40 | First,
 41 | it checks at each row in the `Visited` table
 42 | to see which ones satisfy the `WHERE`.
 43 | It then uses the column names following the `SELECT` keyword
 44 | to determine which columns to display.
 45 | 
 46 | This processing order means that
 47 | we can filter records using `WHERE`
 48 | based on values in columns that aren't then displayed:
 49 | 
 50 | ```sql
 51 | SELECT id FROM Visited WHERE site = 'DR-1';
 52 | ```
 53 | 
 54 | | id     | 
 55 | | ------ |
 56 | | 619    | 
 57 | | 622    | 
 58 | | 844    | 
 59 | 
 60 | ![](fig/sql-filter.svg){alt='SQL Filtering in Action'}
 61 | 
 62 | We can use many other Boolean operators to filter our data.
 63 | For example,
 64 | we can ask for all information from the DR-1 site collected before 1930:
 65 | 
 66 | ```sql
 67 | SELECT * FROM Visited WHERE site = 'DR-1' AND dated < '1930-01-01';
 68 | ```
 69 | 
 70 | | id     | site   | dated      | 
 71 | | ------ | ------ | ---------- |
 72 | | 619    | DR-1   | 1927-02-08 | 
 73 | | 622    | DR-1   | 1927-02-10 | 
 74 | 
 75 | :::::::::::::::::::::::::::::::::::::::::  callout
 76 | 
 77 | ## Date Types
 78 | 
 79 | Most database managers have a special data type for dates.
 80 | In fact, many have two:
 81 | one for dates,
 82 | such as "May 31, 1971",
 83 | and one for durations,
 84 | such as "31 days".
 85 | SQLite doesn't:
 86 | instead,
 87 | it stores dates as either text
 88 | (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"),
 89 | real numbers
 90 | ([Julian days](https://en.wikipedia.org/wiki/Julian_day), the number of days since November 24, 4714 BCE),
 91 | or integers
 92 | ([Unix time](https://en.wikipedia.org/wiki/Unix_time), the number of seconds since midnight, January 1, 1970).
 93 | If this sounds complicated,
 94 | it is,
 95 | but not nearly as complicated as figuring out
 96 | [historical dates in Sweden](https://en.wikipedia.org/wiki/Swedish_calendar).
 97 | 
 98 | 
 99 | ::::::::::::::::::::::::::::::::::::::::::::::::::
100 | 
101 | If we want to find out what measurements were taken by either Lake or Roerich,
102 | we can combine the tests on their names using `OR`:
103 | 
104 | ```sql
105 | SELECT * FROM Survey WHERE person = 'lake' OR person = 'roe';
106 | ```
107 | 
108 | | taken  | person | quant      | reading | 
109 | | ------ | ------ | ---------- | ------- |
110 | | 734    | lake   | sal        | 0\.05    | 
111 | | 751    | lake   | sal        | 0\.1     | 
112 | | 752    | lake   | rad        | 2\.19    | 
113 | | 752    | lake   | sal        | 0\.09    | 
114 | | 752    | lake   | temp       | \-16.0   | 
115 | | 752    | roe    | sal        | 41\.6    | 
116 | | 837    | lake   | rad        | 1\.46    | 
117 | | 837    | lake   | sal        | 0\.21    | 
118 | | 837    | roe    | sal        | 22\.5    | 
119 | | 844    | roe    | rad        | 11\.25   | 
120 | 
121 | Alternatively,
122 | we can use `IN` to see if a value is in a specific set:
123 | 
124 | ```sql
125 | SELECT * FROM Survey WHERE person IN ('lake', 'roe');
126 | ```
127 | 
128 | | taken  | person | quant      | reading | 
129 | | ------ | ------ | ---------- | ------- |
130 | | 734    | lake   | sal        | 0\.05    | 
131 | | 751    | lake   | sal        | 0\.1     | 
132 | | 752    | lake   | rad        | 2\.19    | 
133 | | 752    | lake   | sal        | 0\.09    | 
134 | | 752    | lake   | temp       | \-16.0   | 
135 | | 752    | roe    | sal        | 41\.6    | 
136 | | 837    | lake   | rad        | 1\.46    | 
137 | | 837    | lake   | sal        | 0\.21    | 
138 | | 837    | roe    | sal        | 22\.5    | 
139 | | 844    | roe    | rad        | 11\.25   | 
140 | 
141 | We can combine `AND` with `OR`,
142 | but we need to be careful about which operator is executed first.
143 | If we *don't* use parentheses,
144 | we get this:
145 | 
146 | ```sql
147 | SELECT * FROM Survey WHERE quant = 'sal' AND person = 'lake' OR person = 'roe';
148 | ```
149 | 
150 | | taken  | person | quant      | reading | 
151 | | ------ | ------ | ---------- | ------- |
152 | | 734    | lake   | sal        | 0\.05    | 
153 | | 751    | lake   | sal        | 0\.1     | 
154 | | 752    | lake   | sal        | 0\.09    | 
155 | | 752    | roe    | sal        | 41\.6    | 
156 | | 837    | lake   | sal        | 0\.21    | 
157 | | 837    | roe    | sal        | 22\.5    | 
158 | | 844    | roe    | rad        | 11\.25   | 
159 | 
160 | which is salinity measurements by Lake,
161 | and *any* measurement by Roerich.
162 | We probably want this instead:
163 | 
164 | ```sql
165 | SELECT * FROM Survey WHERE quant = 'sal' AND (person = 'lake' OR person = 'roe');
166 | ```
167 | 
168 | | taken  | person | quant      | reading | 
169 | | ------ | ------ | ---------- | ------- |
170 | | 734    | lake   | sal        | 0\.05    | 
171 | | 751    | lake   | sal        | 0\.1     | 
172 | | 752    | lake   | sal        | 0\.09    | 
173 | | 752    | roe    | sal        | 41\.6    | 
174 | | 837    | lake   | sal        | 0\.21    | 
175 | | 837    | roe    | sal        | 22\.5    | 
176 | 
177 | We can also filter by partial matches.  For example, if we want to
178 | know something just about the site names beginning with "DR" we can
179 | use the `LIKE` keyword.  The percent symbol acts as a
180 | [wildcard](../learners/reference.md#wildcard), matching any characters in that
181 | place.  It can be used at the beginning, middle, or end of the string:
182 | 
183 | ```sql
184 | SELECT * FROM Visited WHERE site LIKE 'DR%';
185 | ```
186 | 
187 | | id     | site   | dated      | 
188 | | ------ | ------ | ---------- |
189 | | 619    | DR-1   | 1927-02-08 | 
190 | | 622    | DR-1   | 1927-02-10 | 
191 | | 734    | DR-3   | 1930-01-07 | 
192 | | 735    | DR-3   | 1930-01-12 | 
193 | | 751    | DR-3   | 1930-02-26 | 
194 | | 752    | DR-3   |            | 
195 | | 844    | DR-1   | 1932-03-22 | 
196 | 
197 | Finally,
198 | we can use `DISTINCT` with `WHERE`
199 | to give a second level of filtering:
200 | 
201 | ```sql
202 | SELECT DISTINCT person, quant FROM Survey WHERE person = 'lake' OR person = 'roe';
203 | ```
204 | 
205 | | person | quant  | 
206 | | ------ | ------ |
207 | | lake   | sal    | 
208 | | lake   | rad    | 
209 | | lake   | temp   | 
210 | | roe    | sal    | 
211 | | roe    | rad    | 
212 | 
213 | But remember:
214 | `DISTINCT` is applied to the values displayed in the chosen columns,
215 | not to the entire rows as they are being processed.
216 | 
217 | :::::::::::::::::::::::::::::::::::::::::  callout
218 | 
219 | ## Growing Queries
220 | 
221 | What we have just done is how most people "grow" their SQL queries.
222 | We started with something simple that did part of what we wanted,
223 | then added more clauses one by one,
224 | testing their effects as we went.
225 | This is a good strategy --- in fact,
226 | for complex queries it's often the *only* strategy --- but
227 | it depends on quick turnaround,
228 | and on us recognizing the right answer when we get it.
229 | 
230 | The best way to achieve a quick turnaround is often
231 | to put a subset of data in a temporary database
232 | and run our queries against that,
233 | or to fill a small database with synthesized records.
234 | For example,
235 | instead of trying our queries against an actual database of 20 million Australians,
236 | we could run it against a sample of ten thousand,
237 | or write a small program to generate ten thousand random (but plausible) records
238 | and use that.
239 | 
240 | 
241 | ::::::::::::::::::::::::::::::::::::::::::::::::::
242 | 
243 | :::::::::::::::::::::::::::::::::::::::  challenge
244 | 
245 | ## Fix This Query
246 | 
247 | Suppose we want to select all sites that lie within 48 degrees of the equator.
248 | Our first query is:
249 | 
250 | ```sql
251 | SELECT * FROM Site WHERE (lat > -48) OR (lat < 48);
252 | ```
253 | 
254 | Explain why this is wrong,
255 | and rewrite the query so that it is correct.
256 | 
257 | :::::::::::::::  solution
258 | 
259 | ## Solution
260 | 
261 | Because we used `OR`, a site on the South Pole for example will still meet
262 | the second criteria and thus be included. Instead, we want to restrict this
263 | to sites that meet *both* criteria:
264 | 
265 | ```sql
266 | SELECT * FROM Site WHERE (lat > -48) AND (lat < 48);
267 | ```
268 | 
269 | :::::::::::::::::::::::::
270 | 
271 | ::::::::::::::::::::::::::::::::::::::::::::::::::
272 | 
273 | :::::::::::::::::::::::::::::::::::::::  challenge
274 | 
275 | ## Finding Outliers
276 | 
277 | Normalized salinity readings are supposed to be between 0.0 and 1.0.
278 | Write a query that selects all records from `Survey`
279 | with salinity values outside this range.
280 | 
281 | :::::::::::::::  solution
282 | 
283 | ## Solution
284 | 
285 | ```sql
286 | SELECT * FROM Survey WHERE quant = 'sal' AND ((reading > 1.0) OR (reading < 0.0));
287 | ```
288 | 
289 | | taken  | person | quant      | reading | 
290 | | ------ | ------ | ---------- | ------- |
291 | | 752    | roe    | sal        | 41\.6    | 
292 | | 837    | roe    | sal        | 22\.5    | 
293 | 
294 | :::::::::::::::::::::::::
295 | 
296 | ::::::::::::::::::::::::::::::::::::::::::::::::::
297 | 
298 | :::::::::::::::::::::::::::::::::::::::  challenge
299 | 
300 | ## Matching Patterns
301 | 
302 | Which of these expressions are true?
303 | 
304 | 1. `'a' LIKE 'a'`
305 | 2. `'a' LIKE '%a'`
306 | 3. `'beta' LIKE '%a'`
307 | 4. `'alpha' LIKE 'a%%'`
308 | 5. `'alpha' LIKE 'a%p%'`
309 | 
310 | :::::::::::::::  solution
311 | 
312 | ## Solution
313 | 
314 | 1. True because these are the same character.
315 | 2. True because the wildcard can match *zero* or more characters.
316 | 3. True because the `%` matches `bet` and the `a` matches the `a`.
317 | 4. True because the first wildcard matches `lpha` and the second wildcard matches zero characters (or vice versa).
318 | 5. True because the first wildcard matches `l` and the second wildcard matches `ha`.
319 |   
320 |   
321 | 
322 | :::::::::::::::::::::::::
323 | 
324 | ::::::::::::::::::::::::::::::::::::::::::::::::::
325 | 
326 | :::::::::::::::::::::::::::::::::::::::: keypoints
327 | 
328 | - Use WHERE to specify conditions that records must meet in order to be included in a query's results.
329 | - Use AND, OR, and NOT to combine tests.
330 | - Filtering is done on whole records, so conditions can use fields that are not actually displayed.
331 | - Write queries incrementally.
332 | 
333 | ::::::::::::::::::::::::::::::::::::::::::::::::::
334 | 
335 | 
336 | 


--------------------------------------------------------------------------------
/episodes/10-prog.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Programming with Databases - Python
  3 | teaching: 20
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Write short programs that execute SQL queries.
 10 | - Trace the execution of a program that contains an SQL query.
 11 | - Explain why most database applications are written in a general-purpose language rather than in SQL.
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - How can I access databases from programs written in Python?
 18 | 
 19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 20 | 
 21 | To close,
 22 | let's have a look at how to access a database from
 23 | a general-purpose programming language like Python.
 24 | Other languages use almost exactly the same model:
 25 | library and function names may differ,
 26 | but the concepts are the same.
 27 | 
 28 | Here's a short Python program that selects latitudes and longitudes
 29 | from an SQLite database stored in a file called `survey.db`:
 30 | 
 31 | ```python
 32 | import sqlite3
 33 | 
 34 | connection = sqlite3.connect("survey.db")
 35 | cursor = connection.cursor()
 36 | cursor.execute("SELECT Site.lat, Site.long FROM Site;")
 37 | results = cursor.fetchall()
 38 | for r in results:
 39 |     print(r)
 40 | cursor.close()
 41 | connection.close()
 42 | ```
 43 | 
 44 | ```output
 45 | (-49.85, -128.57)
 46 | (-47.15, -126.72)
 47 | (-48.87, -123.4)
 48 | ```
 49 | 
 50 | The program starts by importing the `sqlite3` library.
 51 | If we were connecting to MySQL, DB2, or some other database,
 52 | we would import a different library,
 53 | but all of them provide the same functions,
 54 | so that the rest of our program does not have to change
 55 | (at least, not much)
 56 | if we switch from one database to another.
 57 | 
 58 | Line 2 establishes a connection to the database.
 59 | Since we're using SQLite,
 60 | all we need to specify is the name of the database file.
 61 | Other systems may require us to provide a username and password as well.
 62 | Line 3 then uses this connection to create a [cursor](../learners/reference.md#cursor).
 63 | Just like the cursor in an editor,
 64 | its role is to keep track of where we are in the database.
 65 | 
 66 | On line 4, we use that cursor to ask the database to execute a query for us.
 67 | The query is written in SQL,
 68 | and passed to `cursor.execute` as a string.
 69 | It's our job to make sure that SQL is properly formatted;
 70 | if it isn't,
 71 | or if something goes wrong when it is being executed,
 72 | the database will report an error.
 73 | 
 74 | The database returns the results of the query to us
 75 | in response to the `cursor.fetchall` call on line 5.
 76 | This result is a list with one entry for each record in the result set;
 77 | if we loop over that list (line 6) and print those list entries (line 7),
 78 | we can see that each one is a tuple
 79 | with one element for each field we asked for.
 80 | 
 81 | Finally, lines 8 and 9 close our cursor and our connection,
 82 | since the database can only keep a limited number of these open at one time.
 83 | Since establishing a connection takes time,
 84 | though,
 85 | we shouldn't open a connection,
 86 | do one operation,
 87 | then close the connection,
 88 | only to reopen it a few microseconds later to do another operation.
 89 | Instead,
 90 | it's normal to create one connection that stays open for the lifetime of the program.
 91 | 
 92 | Queries in real applications will often depend on values provided by users.
 93 | For example,
 94 | this function takes a user's ID as a parameter and returns their name:
 95 | 
 96 | ```python
 97 | import sqlite3
 98 | 
 99 | def get_name(database_file, person_id):
100 |     query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';"
101 | 
102 |     connection = sqlite3.connect(database_file)
103 |     cursor = connection.cursor()
104 |     cursor.execute(query)
105 |     results = cursor.fetchall()
106 |     cursor.close()
107 |     connection.close()
108 | 
109 |     return results[0][0]
110 | 
111 | print("Full name for dyer:", get_name('survey.db', 'dyer'))
112 | ```
113 | 
114 | ```output
115 | Full name for dyer: William Dyer
116 | ```
117 | 
118 | We use string concatenation on the first line of this function
119 | to construct a query containing the user ID we have been given.
120 | This seems simple enough,
121 | but what happens if someone gives us this string as input?
122 | 
123 | ```source
124 | dyer'; DROP TABLE Survey; SELECT '
125 | ```
126 | 
127 | It looks like there's garbage after the user's ID,
128 | but it is very carefully chosen garbage.
129 | If we insert this string into our query,
130 | the result is:
131 | 
132 | ```sql
133 | SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';
134 | ```
135 | 
136 | If we execute this,
137 | it will erase one of the tables in our database.
138 | 
139 | This is called an [SQL injection attack](../learners/reference.md#sql-injection-attack),
140 | and it has been used to attack thousands of programs over the years.
141 | In particular,
142 | many web sites that take data from users insert values directly into queries
143 | without checking them carefully first.
144 | 
145 | Since a villain might try to smuggle commands into our queries in many different ways,
146 | the safest way to deal with this threat is
147 | to replace characters like quotes with their escaped equivalents,
148 | so that we can safely put whatever the user gives us inside a string.
149 | We can do this by using a [prepared statement](../learners/reference.md#prepared-statement)
150 | instead of formatting our statements as strings.
151 | Here's what our example program looks like if we do this:
152 | 
153 | ```python
154 | import sqlite3
155 | 
156 | def get_name(database_file, person_id):
157 |     query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"
158 | 
159 |     connection = sqlite3.connect(database_file)
160 |     cursor = connection.cursor()
161 |     cursor.execute(query, [person_id])
162 |     results = cursor.fetchall()
163 |     cursor.close()
164 |     connection.close()
165 | 
166 |     return results[0][0]
167 | 
168 | print("Full name for dyer:", get_name('survey.db', 'dyer'))
169 | ```
170 | 
171 | ```output
172 | Full name for dyer: William Dyer
173 | ```
174 | 
175 | The key changes are in the query string and the `execute` call.
176 | Instead of formatting the query ourselves,
177 | we put question marks in the query template where we want to insert values.
178 | When we call `execute`,
179 | we provide a list
180 | that contains as many values as there are question marks in the query.
181 | The library matches values to question marks in order,
182 | and translates any special characters in the values
183 | into their escaped equivalents
184 | so that they are safe to use.
185 | 
186 | We can also use `sqlite3`'s cursor to make changes to our database,
187 | such as inserting a new name.
188 | For instance, we can define a new function called `add_name` like so:
189 | 
190 | ```python
191 | import sqlite3
192 | 
193 | def add_name(database_file, new_person):
194 |     query = "INSERT INTO Person (id, personal, family) VALUES (?, ?, ?);"
195 | 
196 |     connection = sqlite3.connect(database_file)
197 |     cursor = connection.cursor()
198 |     cursor.execute(query, list(new_person))
199 |     cursor.close()
200 |     connection.close()
201 | 
202 | 
203 | def get_name(database_file, person_id):
204 |     query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"
205 | 
206 |     connection = sqlite3.connect(database_file)
207 |     cursor = connection.cursor()
208 |     cursor.execute(query, [person_id])
209 |     results = cursor.fetchall()
210 |     cursor.close()
211 |     connection.close()
212 | 
213 |     return results[0][0]
214 | 
215 | # Insert a new name
216 | add_name('survey.db', ('barrett', 'Mary', 'Barrett'))
217 | # Check it exists
218 | print("Full name for barrett:", get_name('survey.db', 'barrett'))
219 | ```
220 | 
221 | ```output
222 | IndexError: list index out of range
223 | ```
224 | 
225 | Note that in versions of sqlite3 >= 2.5, the `get_name` function described
226 | above will fail with an `IndexError: list index out of range`,
227 | even though we added Mary's
228 | entry into the table using `add_name`.
229 | This is because we must perform a `connection.commit()` before closing
230 | the connection, in order to save our changes to the database.
231 | 
232 | ```python
233 | import sqlite3
234 | 
235 | def add_name(database_file, new_person):
236 |     query = "INSERT INTO Person (id, personal, family) VALUES (?, ?, ?);"
237 | 
238 |     connection = sqlite3.connect(database_file)
239 |     cursor = connection.cursor()
240 |     cursor.execute(query, list(new_person))
241 |     cursor.close()
242 |     connection.commit()
243 |     connection.close()
244 | 
245 | 
246 | def get_name(database_file, person_id):
247 |     query = "SELECT personal || ' ' || family FROM Person WHERE id=?;"
248 | 
249 |     connection = sqlite3.connect(database_file)
250 |     cursor = connection.cursor()
251 |     cursor.execute(query, [person_id])
252 |     results = cursor.fetchall()
253 |     cursor.close()
254 |     connection.close()
255 | 
256 |     return results[0][0]
257 | 
258 | # Insert a new name
259 | add_name('survey.db', ('barrett', 'Mary', 'Barrett'))
260 | # Check it exists
261 | print("Full name for barrett:", get_name('survey.db', 'barrett'))
262 | ```
263 | 
264 | ```output
265 | Full name for barrett: Mary Barrett
266 | ```
267 | 
268 | :::::::::::::::::::::::::::::::::::::::  challenge
269 | 
270 | ## Filling a Table vs. Printing Values
271 | 
272 | Write a Python program that creates a new database in a file called
273 | `original.db` containing a single table called `Pressure`, with a
274 | single field called `reading`, and inserts 100,000 random numbers
275 | between 10.0 and 25.0.  How long does it take this program to run?
276 | How long does it take to run a program that simply writes those
277 | random numbers to a file?
278 | 
279 | :::::::::::::::  solution
280 | 
281 | ## Solution
282 | 
283 | ```python
284 | import sqlite3
285 | # import random number generator
286 | from numpy.random import uniform
287 | 
288 | random_numbers = uniform(low=10.0, high=25.0, size=100000)
289 | 
290 | connection = sqlite3.connect("original.db")
291 | cursor = connection.cursor()
292 | cursor.execute("CREATE TABLE Pressure (reading float not null)")
293 | query = "INSERT INTO Pressure (reading) VALUES (?);"
294 | 
295 | for number in random_numbers:
296 |     cursor.execute(query, [number])
297 | 
298 | cursor.close()
299 | # save changes to file for next exercise
300 | connection.commit()
301 | connection.close()
302 | ```
303 | 
304 | For comparison, the following program writes the random numbers
305 | into the file `random_numbers.txt`:
306 | 
307 | ```python
308 | from numpy.random import uniform
309 | 
310 | random_numbers = uniform(low=10.0, high=25.0, size=100000)
311 | with open('random_numbers.txt', 'w') as outfile:
312 |     for number in random_numbers:
313 |         # need to add linebreak \n
314 |         outfile.write("{}\n".format(number))
315 | ```
316 | 
317 | :::::::::::::::::::::::::
318 | 
319 | ::::::::::::::::::::::::::::::::::::::::::::::::::
320 | 
321 | :::::::::::::::::::::::::::::::::::::::  challenge
322 | 
323 | ## Filtering in SQL vs. Filtering in Python
324 | 
325 | Write a Python program that creates a new database called
326 | `backup.db` with the same structure as `original.db` and copies all
327 | the values greater than 20.0 from `original.db` to `backup.db`.
328 | Which is faster: filtering values in the query, or reading
329 | everything into memory and filtering in Python?
330 | 
331 | :::::::::::::::  solution
332 | 
333 | ## Solution
334 | 
335 | The first example reads all the data into memory and filters the
336 | numbers using the if statement in Python.
337 | 
338 | ```python
339 | import sqlite3
340 | 
341 | connection_original = sqlite3.connect("original.db")
342 | cursor_original = connection_original.cursor()
343 | cursor_original.execute("SELECT * FROM Pressure;")
344 | results = cursor_original.fetchall()
345 | cursor_original.close()
346 | connection_original.close()
347 | 
348 | connection_backup = sqlite3.connect("backup.db")
349 | cursor_backup = connection_backup.cursor()
350 | cursor_backup.execute("CREATE TABLE Pressure (reading float not null)")
351 | query = "INSERT INTO Pressure (reading) VALUES (?);"
352 | 
353 | for entry in results:
354 |     # number is saved in first column of the table
355 |     if entry[0] > 20.0:
356 |         cursor_backup.execute(query, entry)
357 | 
358 | cursor_backup.close()
359 | connection_backup.commit()
360 | connection_backup.close()
361 | ```
362 | 
363 | In contrast the following example uses the conditional `SELECT` statement
364 | to filter the numbers in SQL.
365 | The only lines that changed are in line 5, where the values are fetched
366 | from `original.db` and the for loop starting in line 15 used to insert
367 | the numbers into `backup.db`.
368 | Note how this version does not require the use of Python's if statement.
369 | 
370 | ```python
371 | import sqlite3
372 | 
373 | connection_original = sqlite3.connect("original.db")
374 | cursor_original = connection_original.cursor()
375 | cursor_original.execute("SELECT * FROM Pressure WHERE reading > 20.0;")
376 | results = cursor_original.fetchall()
377 | cursor_original.close()
378 | connection_original.close()
379 | 
380 | connection_backup = sqlite3.connect("backup.db")
381 | cursor_backup = connection_backup.cursor()
382 | cursor_backup.execute("CREATE TABLE Pressure (reading float not null)")
383 | query = "INSERT INTO Pressure (reading) VALUES (?);"
384 | 
385 | for entry in results:
386 |     cursor_backup.execute(query, entry)
387 | 
388 | cursor_backup.close()
389 | connection_backup.commit()
390 | connection_backup.close()
391 | ```
392 | 
393 | :::::::::::::::::::::::::
394 | 
395 | ::::::::::::::::::::::::::::::::::::::::::::::::::
396 | 
397 | :::::::::::::::::::::::::::::::::::::::  challenge
398 | 
399 | ## Generating Insert Statements
400 | 
401 | One of our colleagues has sent us a
402 | [CSV](../learners/reference.md#comma-separated-values-csv)
403 | file containing
404 | temperature readings by Robert Olmstead, which is formatted like
405 | this:
406 | 
407 | ```output
408 | Taken,Temp
409 | 619,-21.5
410 | 622,-15.5
411 | ```
412 | 
413 | Write a small Python program that reads this file in and INSERTs these
414 | records into the survey database.
415 | Note: you will need to add an entry for Olmstead
416 | to the `Person` table.  If you are testing your program repeatedly,
417 | you may want to investigate SQL's `INSERT or REPLACE` command.
418 | 
419 | 
420 | ::::::::::::::::::::::::::::::::::::::::::::::::::
421 | 
422 | :::::::::::::::::::::::::::::::::::::::: keypoints
423 | 
424 | - General-purpose languages have libraries for accessing databases.
425 | - To connect to a database, a program must use a library specific to that database manager.
426 | - These libraries use a connection-and-cursor model.
427 | - Programs can read query results in batches or all at once.
428 | - Queries should be written using parameter substitution, not string formatting.
429 | 
430 | ::::::::::::::::::::::::::::::::::::::::::::::::::
431 | 
432 | 
433 | 


--------------------------------------------------------------------------------
/episodes/01-select.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Selecting Data
  3 | teaching: 10
  4 | exercises: 5
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain the difference between a table, a record, and a field.
 10 | - Explain the difference between a database and a database manager.
 11 | - Write a query to select all values for specific fields from a single table.
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - How can I get data from a database?
 18 | 
 19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 20 | 
 21 | A [relational database](../learners/reference.md#relational-database)
 22 | is a way to store and manipulate information.
 23 | Databases are arranged as [tables](../learners/reference.md#table).
 24 | Each table has columns (also known as [fields](../learners/reference.md#fields)) that describe the data,
 25 | and rows (also known as [records](../learners/reference.md#record)) which contain the data.
 26 | 
 27 | When we are using a spreadsheet,
 28 | we put formulas into cells to calculate new values based on old ones.
 29 | When we are using a database,
 30 | we send commands
 31 | (usually called [queries](../learners/reference.md#query\))
 32 | to a [database manager](../learners/reference.md#database-manager):
 33 | a program that manipulates the database for us.
 34 | The database manager does whatever lookups and calculations the query specifies,
 35 | returning the results in a tabular form
 36 | that we can then use as a starting point for further queries.
 37 | 
 38 | Queries are written in a language called [SQL](../learners/reference.md#sql),
 39 | which stands for "Structured Query Language".
 40 | SQL provides hundreds of different ways to analyze and recombine data.
 41 | We will only look at a handful of queries,
 42 | but that handful accounts for most of what scientists do.
 43 | 
 44 | :::::::::::::::::::::::::::::::::::::::::  callout
 45 | 
 46 | ## Changing Database Managers
 47 | 
 48 | Many database managers --- Oracle,
 49 | IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite ---  understand
 50 | SQL but each stores data in a different way,
 51 | so a database created with one cannot be used directly by another.
 52 | However, every database manager
 53 | can import and export data in a variety of formats like .csv, SQL,
 54 | so it *is* possible to move information from one to another.
 55 | 
 56 | 
 57 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 58 | 
 59 | :::::::::::::::::::::::::::::::::::::::::  callout
 60 | 
 61 | ## Getting Into and Out Of SQLite
 62 | 
 63 | In order to use the SQLite commands *interactively*, we need to
 64 | enter into the SQLite console.  So, open up a terminal, and run
 65 | 
 66 | ```bash
 67 | $ cd /path/to/survey/data/
 68 | $ sqlite3 survey.db
 69 | ```
 70 | 
 71 | The SQLite command is `sqlite3` and you are telling SQLite to open up
 72 | the `survey.db`.  You need to specify the `.db` file, otherwise SQLite
 73 | will open up a temporary, empty database.
 74 | 
 75 | To get out of SQLite, type out `.exit` or `.quit`.  For some
 76 | terminals, `Ctrl-D` can also work.  If you forget any SQLite `.` (dot)
 77 | command, type `.help`.
 78 | 
 79 | 
 80 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 81 | 
 82 | Before we get into using SQLite to select the data, let's take a look at the tables of the database we will use in our examples:
 83 | 
 84 | <div class="row">
 85 | 
 86 |   <div class="col-md-6" markdown="1">
 87 | 
 88 | **Person**: People who took readings, `id` being the unique identifier for that person.
 89 | 
 90 | | id        | personal  | family     | 
 91 | | --------- | --------- | ---------- |
 92 | | dyer      | William   | Dyer       | 
 93 | | pb        | Frank     | Pabodie    | 
 94 | | lake      | Anderson  | Lake       | 
 95 | | roe       | Valentina | Roerich    | 
 96 | | danforth  | Frank     | Danforth   | 
 97 | 
 98 | **Site**: Locations of the `sites` where readings were taken.
 99 | 
100 | | name      | lat       | long       | 
101 | | --------- | --------- | ---------- |
102 | | DR-1      | \-49.85    | \-128.57    | 
103 | | DR-3      | \-47.15    | \-126.72    | 
104 | | MSK-4     | \-48.87    | \-123.4     | 
105 | 
106 | **Visited**: Specific identification `id` of the precise locations where readings were taken at the sites and dates.
107 | 
108 | | id        | site      | dated      | 
109 | | --------- | --------- | ---------- |
110 | | 619       | DR-1      | 1927-02-08 | 
111 | | 622       | DR-1      | 1927-02-10 | 
112 | | 734       | DR-3      | 1930-01-07 | 
113 | | 735       | DR-3      | 1930-01-12 | 
114 | | 751       | DR-3      | 1930-02-26 | 
115 | | 752       | DR-3      | \-null-     | 
116 | | 837       | MSK-4     | 1932-01-14 | 
117 | | 844       | DR-1      | 1932-03-22 | 
118 | 
119 |   </div>
120 | 
121 |   <div class="col-md-6" markdown="1">
122 | 
123 | **Survey**: The measurements taken at each precise location on these sites. They are identified as `taken`. The field `quant` is short for quantity and indicates what is being measured.  The values are `rad`, `sal`, and `temp` referring to 'radiation', 'salinity' and 'temperature', respectively.
124 | 
125 | | taken     | person    | quant      | reading | 
126 | | --------- | --------- | ---------- | ------- |
127 | | 619       | dyer      | rad        | 9\.82    | 
128 | | 619       | dyer      | sal        | 0\.13    | 
129 | | 622       | dyer      | rad        | 7\.8     | 
130 | | 622       | dyer      | sal        | 0\.09    | 
131 | | 734       | pb        | rad        | 8\.41    | 
132 | | 734       | lake      | sal        | 0\.05    | 
133 | | 734       | pb        | temp       | \-21.5   | 
134 | | 735       | pb        | rad        | 7\.22    | 
135 | | 735       | \-null-    | sal        | 0\.06    | 
136 | | 735       | \-null-    | temp       | \-26.0   | 
137 | | 751       | pb        | rad        | 4\.35    | 
138 | | 751       | pb        | temp       | \-18.5   | 
139 | | 751       | lake      | sal        | 0\.1     | 
140 | | 752       | lake      | rad        | 2\.19    | 
141 | | 752       | lake      | sal        | 0\.09    | 
142 | | 752       | lake      | temp       | \-16.0   | 
143 | | 752       | roe       | sal        | 41\.6    | 
144 | | 837       | lake      | rad        | 1\.46    | 
145 | | 837       | lake      | sal        | 0\.21    | 
146 | | 837       | roe       | sal        | 22\.5    | 
147 | | 844       | roe       | rad        | 11\.25   | 
148 | 
149 |   </div>
150 | 
151 | </div>
152 | 
153 | Notice that three entries --- one in the `Visited` table,
154 | and two in the `Survey` table --- don't contain any actual
155 | data, but instead have a special `-null-` entry:
156 | we'll return to these missing values
157 | [later](05-null.md).
158 | 
159 | :::::::::::::::::::::::::::::::::::::::::  callout
160 | 
161 | ## Checking If Data is Available
162 | 
163 | On the shell command line,
164 | change the working directory to the one where you saved `survey.db`.
165 | If you saved it at your Desktop you should use
166 | 
167 | ```bash
168 | $ cd Desktop
169 | $ ls | grep survey.db
170 | ```
171 | 
172 | ```output
173 | survey.db
174 | ```
175 | 
176 | If you get the same output, you can run
177 | 
178 | ```bash
179 | $ sqlite3 survey.db
180 | ```
181 | 
182 | ```output
183 | SQLite version 3.8.8 2015-01-16 12:08:06
184 | Enter ".help" for usage hints.
185 | sqlite>
186 | ```
187 | 
188 | that instructs SQLite to load the database in the `survey.db` file.
189 | 
190 | For a list of useful system commands, enter `.help`.
191 | 
192 | All SQLite-specific commands are prefixed with a `.` to distinguish them from SQL commands.
193 | 
194 | Type `.tables` to list the tables in the database.
195 | 
196 | ```sql
197 | .tables
198 | ```
199 | 
200 | ```output
201 | Person   Site     Survey   Visited
202 | ```
203 | 
204 | If you had the above tables, you might be curious what information was stored in each table.
205 | To get more information on the tables, type `.schema` to see the SQL statements used to create the tables in the database.  The statements will have a list of the columns and the data types each column stores.
206 | 
207 | ```sql
208 | .schema
209 | ```
210 | 
211 | ```output
212 | CREATE TABLE Person (id text, personal text, family text);
213 | CREATE TABLE Site (name text, lat real, long real);
214 | CREATE TABLE Survey (taken integer, person text, quant text, reading real);
215 | CREATE TABLE Visited (id integer, site text, dated text);
216 | ```
217 | 
218 | The output is formatted as \<**columnName** *dataType*\>.  Thus we can see from the first line that the table **Person** has three columns:
219 | 
220 | - **id** with type *text*
221 | - **personal** with type *text*
222 | - **family** with type *text*
223 | 
224 | Note: The available data types vary based on the database manager - you can search online for what data types are supported.
225 | 
226 | You can change some SQLite settings to make the output easier to read.
227 | First,
228 | set the output mode to display left-aligned columns.
229 | Then turn on the display of column headers.
230 | 
231 | ```sql
232 | .mode column
233 | .header on
234 | ```
235 | 
236 | Alternatively, you can get the settings automatically by creating a `.sqliterc` file.
237 | Add the commands above and reopen SQLite.
238 | For Windows, use `C:\Users\<yourusername>.sqliterc`.
239 | For Linux/MacOS, use `/Users/<yourusername>/.sqliterc`.
240 | 
241 | To exit SQLite and return to the shell command line,
242 | you can use either `.quit` or `.exit`.
243 | 
244 | 
245 | ::::::::::::::::::::::::::::::::::::::::::::::::::
246 | 
247 | For now,
248 | let's write an SQL query that displays scientists' names.
249 | We do this using the SQL command `SELECT`,
250 | giving it the names of the columns we want and the table we want them from.
251 | Our query and its output look like this:
252 | 
253 | ```sql
254 | SELECT family, personal FROM Person;
255 | ```
256 | 
257 | | family    | personal  | 
258 | | --------- | --------- |
259 | | Dyer      | William   | 
260 | | Pabodie   | Frank     | 
261 | | Lake      | Anderson  | 
262 | | Roerich   | Valentina | 
263 | | Danforth  | Frank     | 
264 | 
265 | The semicolon at the end of the query
266 | tells the database manager that the query is complete and ready to run.
267 | We have written our commands in upper case and the names for the table and columns
268 | in lower case,
269 | but we don't have to:
270 | as the example below shows,
271 | SQL is [case insensitive](../learners/reference.md#case-insensitive).
272 | 
273 | ```sql
274 | SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;
275 | ```
276 | 
277 | | family    | personal  | 
278 | | --------- | --------- |
279 | | Dyer      | William   | 
280 | | Pabodie   | Frank     | 
281 | | Lake      | Anderson  | 
282 | | Roerich   | Valentina | 
283 | | Danforth  | Frank     | 
284 | 
285 | You can use SQL's case insensitivity
286 | to distinguish between different parts of an SQL statement.
287 | In this lesson, we use the convention of using UPPER CASE for SQL keywords
288 | (such as `SELECT` and `FROM`),
289 | Title Case for table names, and lower case for field names.
290 | Whatever casing
291 | convention you choose, please be consistent: complex queries are hard
292 | enough to read without the extra cognitive load of random
293 | capitalization.
294 | 
295 | While we are on the topic of SQL's syntax, one aspect of SQL's syntax
296 | that can frustrate novices and experts alike is forgetting to finish a
297 | command with `;` (semicolon).  When you press enter for a command
298 | without adding the `;` to the end, it can look something like this:
299 | 
300 | ```sql
301 | SELECT id FROM Person
302 | ...>
303 | ...>
304 | ```
305 | 
306 | This is SQL's prompt, where it is waiting for additional commands or
307 | for a `;` to let SQL know to finish.  This is easy to fix!  Just type
308 | `;` and press enter!
309 | 
310 | Now, going back to our query,
311 | it's important to understand that
312 | the rows and columns in a database table aren't actually stored in any particular order.
313 | They will always be *displayed* in some order,
314 | but we can control that in various ways.
315 | For example,
316 | we could swap the columns in the output by writing our query as:
317 | 
318 | ```sql
319 | SELECT personal, family FROM Person;
320 | ```
321 | 
322 | | personal  | family    | 
323 | | --------- | --------- |
324 | | William   | Dyer      | 
325 | | Frank     | Pabodie   | 
326 | | Anderson  | Lake      | 
327 | | Valentina | Roerich   | 
328 | | Frank     | Danforth  | 
329 | 
330 | or even repeat columns:
331 | 
332 | ```sql
333 | SELECT id, id, id FROM Person;
334 | ```
335 | 
336 | | id        | id        | id         | 
337 | | --------- | --------- | ---------- |
338 | | dyer      | dyer      | dyer       | 
339 | | pb        | pb        | pb         | 
340 | | lake      | lake      | lake       | 
341 | | roe       | roe       | roe        | 
342 | | danforth  | danforth  | danforth   | 
343 | 
344 | As a shortcut,
345 | we can select all of the columns in a table using `*`:
346 | 
347 | ```sql
348 | SELECT * FROM Person;
349 | ```
350 | 
351 | | id        | personal  | family     | 
352 | | --------- | --------- | ---------- |
353 | | dyer      | William   | Dyer       | 
354 | | pb        | Frank     | Pabodie    | 
355 | | lake      | Anderson  | Lake       | 
356 | | roe       | Valentina | Roerich    | 
357 | | danforth  | Frank     | Danforth   | 
358 | 
359 | :::::::::::::::::::::::::::::::::::::::  challenge
360 | 
361 | ## Understanding CREATE statements
362 | 
363 | Use the `.schema` to identify column that contains integers.
364 | 
365 | :::::::::::::::  solution
366 | 
367 | ## Solution
368 | 
369 | ```sql
370 | .schema
371 | ```
372 | 
373 | ```output
374 | CREATE TABLE Person (id text, personal text, family text);
375 | CREATE TABLE Site (name text, lat real, long real);
376 | CREATE TABLE Survey (taken integer, person text, quant text, reading real);
377 | CREATE TABLE Visited (id integer, site text, dated text);
378 | ```
379 | 
380 | From the output, we see that the **taken** column in the **Survey** table (3rd line) is composed of integers.
381 | 
382 | 
383 | 
384 | :::::::::::::::::::::::::
385 | 
386 | ::::::::::::::::::::::::::::::::::::::::::::::::::
387 | 
388 | :::::::::::::::::::::::::::::::::::::::  challenge
389 | 
390 | ## Selecting Site Names
391 | 
392 | Write a query that selects only the `name` column from the `Site` table.
393 | 
394 | :::::::::::::::  solution
395 | 
396 | ## Solution
397 | 
398 | ```sql
399 | SELECT name FROM Site;
400 | ```
401 | 
402 | | name      | 
403 | | --------- |
404 | | DR-1      | 
405 | | DR-3      | 
406 | | MSK-4     | 
407 | 
408 | :::::::::::::::::::::::::
409 | 
410 | ::::::::::::::::::::::::::::::::::::::::::::::::::
411 | 
412 | :::::::::::::::::::::::::::::::::::::::  challenge
413 | 
414 | ## Query Style
415 | 
416 | Many people format queries as:
417 | 
418 | ```sql
419 | SELECT personal, family FROM person;
420 | ```
421 | 
422 | or as:
423 | 
424 | ```sql
425 | select Personal, Family from PERSON;
426 | ```
427 | 
428 | What style do you find easiest to read, and why?
429 | 
430 | 
431 | ::::::::::::::::::::::::::::::::::::::::::::::::::
432 | 
433 | :::::::::::::::::::::::::::::::::::::::: keypoints
434 | 
435 | - A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
436 | - A database manager is a program that manipulates information stored in a database.
437 | - We write queries in a specialized language called SQL to extract information from databases.
438 | - Use SELECT... FROM... to get values from a database table.
439 | - SQL is case-insensitive (but data is case-sensitive).
440 | 
441 | ::::::::::::::::::::::::::::::::::::::::::::::::::
442 | 
443 | 
444 | 


--------------------------------------------------------------------------------
/episodes/07-join.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Combining Data
  3 | teaching: 20
  4 | exercises: 20
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain the operation of a query that joins two tables.
 10 | - Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.
 11 | - Write queries that join tables on equal keys.
 12 | - Explain what primary and foreign keys are, and why they are useful.
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - How can I combine data from multiple tables?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | In order to submit our data to a web site
 23 | that aggregates historical meteorological data,
 24 | we might need to format it as
 25 | latitude, longitude, date, quantity, and reading.
 26 | However,
 27 | our latitudes and longitudes are in the `Site` table,
 28 | while the dates of measurements are in the `Visited` table
 29 | and the readings themselves are in the `Survey` table.
 30 | We need to combine these tables somehow.
 31 | 
 32 | This figure shows the relations between the tables:
 33 | 
 34 | ![](fig/sql-join-structure.svg){alt='Survey Database Structure'}
 35 | 
 36 | The SQL command to do this is `JOIN`.
 37 | To see how it works,
 38 | let's start by joining the `Site` and `Visited` tables:
 39 | 
 40 | ```sql
 41 | SELECT * FROM Site JOIN Visited;
 42 | ```
 43 | 
 44 | | name    | lat      | long       | id        | site    | dated      | 
 45 | | ------- | -------- | ---------- | --------- | ------- | ---------- |
 46 | | DR-1    | \-49.85   | \-128.57    | 619       | DR-1    | 1927-02-08 | 
 47 | | DR-1    | \-49.85   | \-128.57    | 622       | DR-1    | 1927-02-10 | 
 48 | | DR-1    | \-49.85   | \-128.57    | 734       | DR-3    | 1930-01-07 | 
 49 | | DR-1    | \-49.85   | \-128.57    | 735       | DR-3    | 1930-01-12 | 
 50 | | DR-1    | \-49.85   | \-128.57    | 751       | DR-3    | 1930-02-26 | 
 51 | | DR-1    | \-49.85   | \-128.57    | 752       | DR-3    | \-null-     | 
 52 | | DR-1    | \-49.85   | \-128.57    | 837       | MSK-4   | 1932-01-14 | 
 53 | | DR-1    | \-49.85   | \-128.57    | 844       | DR-1    | 1932-03-22 | 
 54 | | DR-3    | \-47.15   | \-126.72    | 619       | DR-1    | 1927-02-08 | 
 55 | | DR-3    | \-47.15   | \-126.72    | 622       | DR-1    | 1927-02-10 | 
 56 | | DR-3    | \-47.15   | \-126.72    | 734       | DR-3    | 1930-01-07 | 
 57 | | DR-3    | \-47.15   | \-126.72    | 735       | DR-3    | 1930-01-12 | 
 58 | | DR-3    | \-47.15   | \-126.72    | 751       | DR-3    | 1930-02-26 | 
 59 | | DR-3    | \-47.15   | \-126.72    | 752       | DR-3    | \-null-     | 
 60 | | DR-3    | \-47.15   | \-126.72    | 837       | MSK-4   | 1932-01-14 | 
 61 | | DR-3    | \-47.15   | \-126.72    | 844       | DR-1    | 1932-03-22 | 
 62 | | MSK-4   | \-48.87   | \-123.4     | 619       | DR-1    | 1927-02-08 | 
 63 | | MSK-4   | \-48.87   | \-123.4     | 622       | DR-1    | 1927-02-10 | 
 64 | | MSK-4   | \-48.87   | \-123.4     | 734       | DR-3    | 1930-01-07 | 
 65 | | MSK-4   | \-48.87   | \-123.4     | 735       | DR-3    | 1930-01-12 | 
 66 | | MSK-4   | \-48.87   | \-123.4     | 751       | DR-3    | 1930-02-26 | 
 67 | | MSK-4   | \-48.87   | \-123.4     | 752       | DR-3    | \-null-     | 
 68 | | MSK-4   | \-48.87   | \-123.4     | 837       | MSK-4   | 1932-01-14 | 
 69 | | MSK-4   | \-48.87   | \-123.4     | 844       | DR-1    | 1932-03-22 | 
 70 | 
 71 | `JOIN` creates
 72 | the [cross product](../learners/reference.md#cross-product)
 73 | of two tables,
 74 | i.e.,
 75 | it joins each record of one table with each record of the other table
 76 | to give all possible combinations.
 77 | Since there are three records in `Site`
 78 | and eight in `Visited`,
 79 | the join's output has 24 records (3 \* 8 = 24) .
 80 | And since each table has three fields,
 81 | the output has six fields (3 + 3 = 6).
 82 | 
 83 | What the join *hasn't* done is
 84 | figure out if the records being joined have anything to do with each other.
 85 | It has no way of knowing whether they do or not until we tell it how.
 86 | To do that,
 87 | we add a clause specifying that
 88 | we're only interested in combinations that have the same site name,
 89 | thus we need to use a filter:
 90 | 
 91 | ```sql
 92 | SELECT
 93 |   *
 94 | FROM
 95 |   Site
 96 |   JOIN Visited ON Site.name = Visited.site;
 97 | ```
 98 | 
 99 | | name    | lat      | long       | id        | site    | dated      | 
100 | | ------- | -------- | ---------- | --------- | ------- | ---------- |
101 | | DR-1    | \-49.85   | \-128.57    | 619       | DR-1    | 1927-02-08 | 
102 | | DR-1    | \-49.85   | \-128.57    | 622       | DR-1    | 1927-02-10 | 
103 | | DR-1    | \-49.85   | \-128.57    | 844       | DR-1    | 1932-03-22 | 
104 | | DR-3    | \-47.15   | \-126.72    | 734       | DR-3    | 1930-01-07 | 
105 | | DR-3    | \-47.15   | \-126.72    | 735       | DR-3    | 1930-01-12 | 
106 | | DR-3    | \-47.15   | \-126.72    | 751       | DR-3    | 1930-02-26 | 
107 | | DR-3    | \-47.15   | \-126.72    | 752       | DR-3    | \-null-     | 
108 | | MSK-4   | \-48.87   | \-123.4     | 837       | MSK-4   | 1932-01-14 | 
109 | 
110 | `ON` is very similar to `WHERE`,
111 | and for all the queries in this lesson you can use them interchangeably.
112 | There are differences in how they affect [outer joins][outer],
113 | but that's beyond the scope of this lesson.
114 | Once we add this to our query,
115 | the database manager throws away records
116 | that combined information about two different sites,
117 | leaving us with just the ones we want.
118 | 
119 | Notice that we used `Table.field` to specify field names
120 | in the output of the join.
121 | We do this because tables can have fields with the same name,
122 | and we need to be specific which ones we're talking about.
123 | For example,
124 | if we joined the `Person` and `Visited` tables,
125 | the result would inherit a field called `id`
126 | from each of the original tables.
127 | 
128 | We can now use the same dotted notation
129 | to select the three columns we actually want
130 | out of our join:
131 | 
132 | ```sql
133 | SELECT
134 |   Site.lat,
135 |   Site.long,
136 |   Visited.dated
137 | FROM
138 |   Site
139 |   JOIN Visited ON Site.name = Visited.site;
140 | ```
141 | 
142 | | lat     | long     | dated      | 
143 | | ------- | -------- | ---------- |
144 | | \-49.85  | \-128.57  | 1927-02-08 | 
145 | | \-49.85  | \-128.57  | 1927-02-10 | 
146 | | \-49.85  | \-128.57  | 1932-03-22 | 
147 | | \-47.15  | \-126.72  | \-null-     | 
148 | | \-47.15  | \-126.72  | 1930-01-12 | 
149 | | \-47.15  | \-126.72  | 1930-02-26 | 
150 | | \-47.15  | \-126.72  | 1930-01-07 | 
151 | | \-48.87  | \-123.4   | 1932-01-14 | 
152 | 
153 | If joining two tables is good,
154 | joining many tables must be better.
155 | In fact,
156 | we can join any number of tables
157 | simply by adding more `JOIN` clauses to our query,
158 | and more `ON` tests to filter out combinations of records
159 | that don't make sense:
160 | 
161 | ```sql
162 | SELECT
163 |   Site.lat,
164 |   Site.long,
165 |   Visited.dated,
166 |   Survey.quant,
167 |   Survey.reading
168 | FROM 
169 |   Site
170 |   JOIN Visited
171 |   JOIN Survey ON Site.name = Visited.site
172 |   AND Visited.id = Survey.taken
173 |   AND Visited.dated IS NOT NULL;
174 | ```
175 | 
176 | | lat     | long     | dated      | quant     | reading | 
177 | | ------- | -------- | ---------- | --------- | ------- |
178 | | \-49.85  | \-128.57  | 1927-02-08 | rad       | 9\.82    | 
179 | | \-49.85  | \-128.57  | 1927-02-08 | sal       | 0\.13    | 
180 | | \-49.85  | \-128.57  | 1927-02-10 | rad       | 7\.8     | 
181 | | \-49.85  | \-128.57  | 1927-02-10 | sal       | 0\.09    | 
182 | | \-47.15  | \-126.72  | 1930-01-07 | rad       | 8\.41    | 
183 | | \-47.15  | \-126.72  | 1930-01-07 | sal       | 0\.05    | 
184 | | \-47.15  | \-126.72  | 1930-01-07 | temp      | \-21.5   | 
185 | | \-47.15  | \-126.72  | 1930-01-12 | rad       | 7\.22    | 
186 | | \-47.15  | \-126.72  | 1930-01-12 | sal       | 0\.06    | 
187 | | \-47.15  | \-126.72  | 1930-01-12 | temp      | \-26.0   | 
188 | | \-47.15  | \-126.72  | 1930-02-26 | rad       | 4\.35    | 
189 | | \-47.15  | \-126.72  | 1930-02-26 | sal       | 0\.1     | 
190 | | \-47.15  | \-126.72  | 1930-02-26 | temp      | \-18.5   | 
191 | | \-48.87  | \-123.4   | 1932-01-14 | rad       | 1\.46    | 
192 | | \-48.87  | \-123.4   | 1932-01-14 | sal       | 0\.21    | 
193 | | \-48.87  | \-123.4   | 1932-01-14 | sal       | 22\.5    | 
194 | | \-49.85  | \-128.57  | 1932-03-22 | rad       | 11\.25   | 
195 | 
196 | We can tell which records from `Site`, `Visited`, and `Survey`
197 | correspond with each other
198 | because those tables contain
199 | [primary keys](../learners/reference.md#primary-key)
200 | and [foreign keys](../learners/reference.md#foreign-key).
201 | A primary key is a value,
202 | or combination of values,
203 | that uniquely identifies each record in a table.
204 | A foreign key is a value (or combination of values) from one table
205 | that identifies a unique record in another table.
206 | Another way of saying this is that
207 | a foreign key is the primary key of one table
208 | that appears in some other table.
209 | In our database,
210 | `Person.id` is the primary key in the `Person` table,
211 | while `Survey.person` is a foreign key
212 | relating the `Survey` table's entries
213 | to entries in `Person`.
214 | 
215 | Most database designers believe that
216 | every table should have a well-defined primary key.
217 | They also believe that this key should be separate from the data itself,
218 | so that if we ever need to change the data,
219 | we only need to make one change in one place.
220 | One easy way to do this is
221 | to create an arbitrary, unique ID for each record
222 | as we add it to the database.
223 | This is actually very common:
224 | those IDs have names like "student numbers" and "patient numbers",
225 | and they almost always turn out to have originally been
226 | a unique record identifier in some database system or other.
227 | As the query below demonstrates,
228 | SQLite [automatically numbers records][rowid] as they're added to tables,
229 | and we can use those record numbers in queries:
230 | 
231 | ```sql
232 | SELECT rowid, * FROM Person;
233 | ```
234 | 
235 | | rowid   | id       | personal   | family    | 
236 | | ------- | -------- | ---------- | --------- |
237 | | 1       | dyer     | William    | Dyer      | 
238 | | 2       | pb       | Frank      | Pabodie   | 
239 | | 3       | lake     | Anderson   | Lake      | 
240 | | 4       | roe      | Valentina  | Roerich   | 
241 | | 5       | danforth | Frank      | Danforth  | 
242 | 
243 | :::::::::::::::::::::::::::::::::::::::  challenge
244 | 
245 | ## Listing Radiation Readings
246 | 
247 | Write a query that lists all radiation readings from the DR-1 site.
248 | 
249 | :::::::::::::::  solution
250 | 
251 | ## Solution
252 | 
253 | ```sql
254 | SELECT
255 |    Survey.reading
256 | FROM
257 |    Site
258 |    JOIN
259 |       Visited
260 |   JOIN
261 |       Survey
262 |       ON Site.name = Visited.site
263 |       AND Visited.id = Survey.taken
264 | WHERE
265 |    Site.name = 'DR-1'
266 |    AND Survey.quant = 'rad';
267 | ```
268 | 
269 | | reading | 
270 | | ------- |
271 | | 9\.82    | 
272 | | 7\.8     | 
273 | | 11\.25   | 
274 | 
275 | :::::::::::::::::::::::::
276 | 
277 | ::::::::::::::::::::::::::::::::::::::::::::::::::
278 | 
279 | :::::::::::::::::::::::::::::::::::::::  challenge
280 | 
281 | ## Where's Frank?
282 | 
283 | Write a query that lists all sites visited by people named "Frank".
284 | 
285 | :::::::::::::::  solution
286 | 
287 | ## Solution
288 | 
289 | ```sql
290 | SELECT
291 |   DISTINCT Site.name
292 | FROM
293 |   Site
294 |   JOIN Visited
295 |   JOIN Survey
296 |   JOIN Person ON Site.name = Visited.site
297 |   AND Visited.id = Survey.taken
298 |   AND Survey.person = Person.id
299 | WHERE
300 |   Person.personal = 'Frank';
301 | ```
302 | 
303 | | name    | 
304 | | ------- |
305 | | DR-3    | 
306 | 
307 | :::::::::::::::::::::::::
308 | 
309 | ::::::::::::::::::::::::::::::::::::::::::::::::::
310 | 
311 | :::::::::::::::::::::::::::::::::::::::  challenge
312 | 
313 | ## Reading Queries
314 | 
315 | Describe in your own words what the following query produces:
316 | 
317 | ```sql
318 | SELECT Site.name FROM Site JOIN Visited
319 | ON Site.lat < -49.0 AND Site.name = Visited.site AND Visited.dated >= '1932-01-01';
320 | ```
321 | 
322 | ::::::::::::::::::::::::::::::::::::::::::::::::::
323 | 
324 | :::::::::::::::::::::::::::::::::::::::  challenge
325 | 
326 | ## Who Has Been Where?
327 | 
328 | Write a query that shows each site with exact location (lat, long) ordered by visited date,
329 | followed by personal name and family name of the person who visited the site
330 | and the type of measurement taken and its reading. Please avoid all null values.
331 | Tip: you should get 15 records with 8 fields.
332 | 
333 | :::::::::::::::  solution
334 | 
335 | ## Solution
336 | 
337 | ```sql
338 | SELECT Site.name, Site.lat, Site.long, Person.personal, Person.family, Survey.quant, Survey.reading, Visited.dated
339 | FROM
340 |    Site
341 |    JOIN
342 |       Visited
343 |    JOIN
344 |       Survey
345 |    JOIN
346 |       Person
347 |       ON Site.name = Visited.site
348 |       AND Visited.id = Survey.taken
349 |       AND Survey.person = Person.id
350 | WHERE
351 |    Survey.person IS NOT NULL
352 |    AND Visited.dated IS NOT NULL
353 | ORDER BY
354 |    Visited.dated;
355 | ```
356 | 
357 | | name    | lat      | long       | personal  | family  | quant      | reading | dated      | 
358 | | ------- | -------- | ---------- | --------- | ------- | ---------- | ------- | ---------- |
359 | | DR-1    | \-49.85   | \-128.57    | William   | Dyer    | rad        | 9\.82    | 1927-02-08 | 
360 | | DR-1    | \-49.85   | \-128.57    | William   | Dyer    | sal        | 0\.13    | 1927-02-08 | 
361 | | DR-1    | \-49.85   | \-128.57    | William   | Dyer    | rad        | 7\.8     | 1927-02-10 | 
362 | | DR-1    | \-49.85   | \-128.57    | William   | Dyer    | sal        | 0\.09    | 1927-02-10 | 
363 | | DR-3    | \-47.15   | \-126.72    | Anderson  | Lake    | sal        | 0\.05    | 1930-01-07 | 
364 | | DR-3    | \-47.15   | \-126.72    | Frank     | Pabodie | rad        | 8\.41    | 1930-01-07 | 
365 | | DR-3    | \-47.15   | \-126.72    | Frank     | Pabodie | temp       | \-21.5   | 1930-01-07 | 
366 | | DR-3    | \-47.15   | \-126.72    | Frank     | Pabodie | rad        | 7\.22    | 1930-01-12 | 
367 | | DR-3    | \-47.15   | \-126.72    | Anderson  | Lake    | sal        | 0\.1     | 1930-02-26 | 
368 | | DR-3    | \-47.15   | \-126.72    | Frank     | Pabodie | rad        | 4\.35    | 1930-02-26 | 
369 | | DR-3    | \-47.15   | \-126.72    | Frank     | Pabodie | temp       | \-18.5   | 1930-02-26 | 
370 | | MSK-4   | \-48.87   | \-123.4     | Anderson  | Lake    | rad        | 1\.46    | 1932-01-14 | 
371 | | MSK-4   | \-48.87   | \-123.4     | Anderson  | Lake    | sal        | 0\.21    | 1932-01-14 | 
372 | | MSK-4   | \-48.87   | \-123.4     | Valentina | Roerich | sal        | 22\.5    | 1932-01-14 | 
373 | | DR-1    | \-49.85   | \-128.57    | Valentina | Roerich | rad        | 11\.25   | 1932-03-22 | 
374 | 
375 | :::::::::::::::::::::::::
376 | 
377 | ::::::::::::::::::::::::::::::::::::::::::::::::::
378 | 
379 | A good visual explanation of joins can be [found here][joinref]
380 | 
381 | [outer]: https://en.wikipedia.org/wiki/Join_%28SQL%29#Outer_join
382 | [rowid]: https://www.sqlite.org/lang_createtable.html#rowid
383 | [joinref]: https://sql-joins.leopard.in.ua/
384 | 
385 | 
386 | :::::::::::::::::::::::::::::::::::::::: keypoints
387 | 
388 | - Use JOIN to combine data from two tables.
389 | - Use table.field notation to refer to fields when doing joins.
390 | - Every fact should be represented in a database exactly once.
391 | - A join produces all combinations of records from one table with records from another.
392 | - A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
393 | - A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
394 | - We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
395 | - The most common join condition is matching keys.
396 | 
397 | ::::::::::::::::::::::::::::::::::::::::::::::::::
398 | 
399 | 
400 | 


--------------------------------------------------------------------------------
/episodes/06-agg.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Aggregation
  3 | teaching: 10
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Define aggregation and give examples of its use.
 10 | - Write queries that compute aggregated values.
 11 | - Trace the execution of a query that performs aggregation.
 12 | - Explain how missing data is handled during aggregation.
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - How can I calculate sums, averages, and other summary values?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | We now want to calculate ranges and averages for our data.
 23 | We know how to select all of the dates from the `Visited` table:
 24 | 
 25 | ```sql
 26 | SELECT dated FROM Visited;
 27 | ```
 28 | 
 29 | | dated            | 
 30 | | ---------------- |
 31 | | 1927-02-08       | 
 32 | | 1927-02-10       | 
 33 | | 1930-01-07       | 
 34 | | 1930-01-12       | 
 35 | | 1930-02-26       | 
 36 | | \-null-           | 
 37 | | 1932-01-14       | 
 38 | | 1932-03-22       | 
 39 | 
 40 | but to combine them,
 41 | we must use an [aggregation function](../learners/reference.md#aggregation-function)
 42 | such as `min` or `max`.
 43 | Each of these functions takes a set of records as input,
 44 | and produces a single record as output:
 45 | 
 46 | ```sql
 47 | SELECT min(dated) FROM Visited;
 48 | ```
 49 | 
 50 | | min(dated)       | 
 51 | | ---------------- |
 52 | | 1927-02-08       | 
 53 | 
 54 | ![](fig/sql-aggregation.svg){alt='SQL Aggregation'}
 55 | 
 56 | ```sql
 57 | SELECT max(dated) FROM Visited;
 58 | ```
 59 | 
 60 | | max(dated)       | 
 61 | | ---------------- |
 62 | | 1932-03-22       | 
 63 | 
 64 | `min` and `max` are just two of
 65 | the aggregation functions built into SQL.
 66 | Three others are `avg`,
 67 | `count`,
 68 | and `sum`:
 69 | 
 70 | ```sql
 71 | SELECT avg(reading) FROM Survey WHERE quant = 'sal';
 72 | ```
 73 | 
 74 | | avg(reading)     | 
 75 | | ---------------- |
 76 | | 7\.20333333333333 | 
 77 | 
 78 | ```sql
 79 | SELECT count(reading) FROM Survey WHERE quant = 'sal';
 80 | ```
 81 | 
 82 | | count(reading)   | 
 83 | | ---------------- |
 84 | | 9                | 
 85 | 
 86 | ```sql
 87 | SELECT sum(reading) FROM Survey WHERE quant = 'sal';
 88 | ```
 89 | 
 90 | | sum(reading)     | 
 91 | | ---------------- |
 92 | | 64\.83            | 
 93 | 
 94 | We used `count(reading)` here,
 95 | but could have used `count(*)`,
 96 | since the function doesn't care about the values themselves,
 97 | just how many rows there are.
 98 | 
 99 | Even a column other than `reading` could be used,
100 | but note that any `NULL` value will not be counted
101 | (to see, try `count`ing the `person` column, which contains a
102 | row with a `NULL`).
103 | This perhaps non-obvious behavior
104 | of aggregation functions is covered later
105 | in this episode.
106 | 
107 | SQL lets us do several aggregations at once.
108 | We can,
109 | for example,
110 | find the range of sensible salinity measurements:
111 | 
112 | ```sql
113 | SELECT min(reading), max(reading) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;
114 | ```
115 | 
116 | | min(reading)     | max(reading)   | 
117 | | ---------------- | -------------- |
118 | | 0\.05             | 0\.21           | 
119 | 
120 | We can also combine aggregated results with raw results,
121 | although the output might surprise you:
122 | 
123 | ```sql
124 | SELECT person, count(*) FROM Survey WHERE quant = 'sal' AND reading <= 1.0;
125 | ```
126 | 
127 | | person           | count(\*)       | 
128 | | ---------------- | -------------- |
129 | | lake             | 7              | 
130 | 
131 | Why does Lake's name appear rather than Roerich's or Dyer's?
132 | The answer is that when it has to aggregate a field,
133 | but isn't told how to,
134 | the database manager chooses an actual value from the input set.
135 | It might use the first one processed,
136 | the last one,
137 | or something else entirely.
138 | 
139 | Another important fact is that when there are no values to aggregate ---
140 | for example, where there are no rows satisfying the `WHERE` clause ---
141 | aggregation's result is "don't know"
142 | rather than zero or some other arbitrary value:
143 | 
144 | ```sql
145 | SELECT person, max(reading), sum(reading) FROM Survey WHERE quant = 'missing';
146 | ```
147 | 
148 | | person           | max(reading)   | sum(reading)           | 
149 | | ---------------- | -------------- | ---------------------- |
150 | | \-null-           | \-null-         | \-null-                 | 
151 | 
152 | One final important feature of aggregation functions is that
153 | they are inconsistent with the rest of SQL in a very useful way.
154 | If we add two values,
155 | and one of them is null,
156 | the result is null.
157 | By extension,
158 | if we use `sum` to add all the values in a set,
159 | and any of those values are null,
160 | the result should also be null.
161 | It's much more useful,
162 | though,
163 | for aggregation functions to ignore null values
164 | and only combine those that are non-null.
165 | This behavior lets us write our queries as:
166 | 
167 | ```sql
168 | SELECT min(dated) FROM Visited;
169 | ```
170 | 
171 | | min(dated)       | 
172 | | ---------------- |
173 | | 1927-02-08       | 
174 | 
175 | instead of always having to filter explicitly:
176 | 
177 | ```sql
178 | SELECT min(dated) FROM Visited WHERE dated IS NOT NULL;
179 | ```
180 | 
181 | | min(dated)       | 
182 | | ---------------- |
183 | | 1927-02-08       | 
184 | 
185 | Aggregating all records at once doesn't always make sense.
186 | For example,
187 | suppose we suspect that there is a systematic bias in our data,
188 | and that some scientists' radiation readings are higher than others.
189 | We know that this doesn't work:
190 | 
191 | ```sql
192 | SELECT person, count(reading), round(avg(reading), 2)
193 | FROM  Survey
194 | WHERE quant = 'rad';
195 | ```
196 | 
197 | | person           | count(reading) | round(avg(reading), 2) | 
198 | | ---------------- | -------------- | ---------------------- |
199 | | roe              | 8              | 6\.56                   | 
200 | 
201 | because the database manager selects a single arbitrary scientist's name
202 | rather than aggregating separately for each scientist.
203 | Since there are only five scientists,
204 | we could write five queries of the form:
205 | 
206 | ```sql
207 | SELECT person, count(reading), round(avg(reading), 2)
208 | FROM  Survey
209 | WHERE quant = 'rad'
210 | AND   person = 'dyer';
211 | ```
212 | 
213 | | person           | count(reading) | round(avg(reading), 2) | 
214 | | ---------------- | -------------- | ---------------------- |
215 | | dyer             | 2              | 8\.81                   | 
216 | 
217 | but this would be tedious,
218 | and if we ever had a data set with fifty or five hundred scientists,
219 | the chances of us getting all of those queries right is small.
220 | 
221 | What we need to do is
222 | tell the database manager to aggregate the hours for each scientist separately
223 | using a `GROUP BY` clause:
224 | 
225 | ```sql
226 | SELECT   person, count(reading), round(avg(reading), 2)
227 | FROM     Survey
228 | WHERE    quant = 'rad'
229 | GROUP BY person;
230 | ```
231 | 
232 | | person           | count(reading) | round(avg(reading), 2) | 
233 | | ---------------- | -------------- | ---------------------- |
234 | | dyer             | 2              | 8\.81                   | 
235 | | lake             | 2              | 1\.82                   | 
236 | | pb               | 3              | 6\.66                   | 
237 | | roe              | 1              | 11\.25                  | 
238 | 
239 | `GROUP BY` does exactly what its name implies:
240 | groups all the records with the same value for the specified field together
241 | so that aggregation can process each batch separately.
242 | Since all the records in each batch have the same value for `person`,
243 | it no longer matters that the database manager
244 | is picking an arbitrary one to display
245 | alongside the aggregated `reading` values.
246 | 
247 | Just as we can sort by multiple criteria at once,
248 | we can also group by multiple criteria.
249 | To get the average reading by scientist and quantity measured,
250 | for example,
251 | we just add another field to the `GROUP BY` clause:
252 | 
253 | ```sql
254 | SELECT   person, quant, count(reading), round(avg(reading), 2)
255 | FROM     Survey
256 | GROUP BY person, quant;
257 | ```
258 | 
259 | | person           | quant          | count(reading)         | round(avg(reading), 2) | 
260 | | ---------------- | -------------- | ---------------------- | ---------------------- |
261 | | \-null-           | sal            | 1                      | 0\.06                   | 
262 | | \-null-           | temp           | 1                      | \-26.0                  | 
263 | | dyer             | rad            | 2                      | 8\.81                   | 
264 | | dyer             | sal            | 2                      | 0\.11                   | 
265 | | lake             | rad            | 2                      | 1\.82                   | 
266 | | lake             | sal            | 4                      | 0\.11                   | 
267 | | lake             | temp           | 1                      | \-16.0                  | 
268 | | pb               | rad            | 3                      | 6\.66                   | 
269 | | pb               | temp           | 2                      | \-20.0                  | 
270 | | roe              | rad            | 1                      | 11\.25                  | 
271 | | roe              | sal            | 2                      | 32\.05                  | 
272 | 
273 | Note that we have added `quant` to the list of fields displayed,
274 | since the results wouldn't make much sense otherwise.
275 | 
276 | Let's go one step further and remove all the entries
277 | where we don't know who took the measurement:
278 | 
279 | ```sql
280 | SELECT   person, quant, count(reading), round(avg(reading), 2)
281 | FROM     Survey
282 | WHERE    person IS NOT NULL
283 | GROUP BY person, quant
284 | ORDER BY person, quant;
285 | ```
286 | 
287 | | person           | quant          | count(reading)         | round(avg(reading), 2) | 
288 | | ---------------- | -------------- | ---------------------- | ---------------------- |
289 | | dyer             | rad            | 2                      | 8\.81                   | 
290 | | dyer             | sal            | 2                      | 0\.11                   | 
291 | | lake             | rad            | 2                      | 1\.82                   | 
292 | | lake             | sal            | 4                      | 0\.11                   | 
293 | | lake             | temp           | 1                      | \-16.0                  | 
294 | | pb               | rad            | 3                      | 6\.66                   | 
295 | | pb               | temp           | 2                      | \-20.0                  | 
296 | | roe              | rad            | 1                      | 11\.25                  | 
297 | | roe              | sal            | 2                      | 32\.05                  | 
298 | 
299 | Looking more closely,
300 | this query:
301 | 
302 | 1. selected records from the `Survey` table
303 |   where the `person` field was not null;
304 | 
305 | 2. grouped those records into subsets
306 |   so that the `person` and `quant` values in each subset
307 |   were the same;
308 | 
309 | 3. ordered those subsets first by `person`,
310 |   and then within each sub-group by `quant`;
311 |   and
312 | 
313 | 4. counted the number of records in each subset,
314 |   calculated the average `reading` in each,
315 |   and chose a `person` and `quant` value from each
316 |   (it doesn't matter which ones,
317 |   since they're all equal).
318 | 
319 | :::::::::::::::::::::::::::::::::::::::  challenge
320 | 
321 | ## Counting Temperature Readings
322 | 
323 | How many temperature readings did Frank Pabodie record,
324 | and what was their average value?
325 | 
326 | :::::::::::::::  solution
327 | 
328 | ## Solution
329 | 
330 | ```sql
331 | SELECT count(reading), avg(reading) FROM Survey WHERE quant = 'temp' AND person = 'pb';
332 | ```
333 | 
334 | | count(reading)   | avg(reading)   | 
335 | | ---------------- | -------------- |
336 | | 2                | \-20.0          | 
337 | 
338 | :::::::::::::::::::::::::
339 | 
340 | ::::::::::::::::::::::::::::::::::::::::::::::::::
341 | 
342 | :::::::::::::::::::::::::::::::::::::::  challenge
343 | 
344 | ## Averaging with NULL
345 | 
346 | The average of a set of values is the sum of the values
347 | divided by the number of values.
348 | Does this mean that the `avg` function returns 2.0 or 3.0
349 | when given the values 1.0, `null`, and 5.0?
350 | 
351 | :::::::::::::::  solution
352 | 
353 | ## Solution
354 | 
355 | The answer is 3.0.
356 | `NULL` is not a value; it is the absence of a value.
357 | As such it is not included in the calculation.
358 | 
359 | You can confirm this, by executing this code:
360 | 
361 | ```sql
362 | SELECT AVG(a) FROM (
363 |     SELECT 1 AS a
364 |     UNION ALL SELECT NULL
365 |     UNION ALL SELECT 5);
366 | ```
367 | 
368 | :::::::::::::::::::::::::
369 | 
370 | ::::::::::::::::::::::::::::::::::::::::::::::::::
371 | 
372 | :::::::::::::::::::::::::::::::::::::::  challenge
373 | 
374 | ## What Does This Query Do?
375 | 
376 | We want to calculate the difference between
377 | each individual radiation reading
378 | and the average of all the radiation readings.
379 | We write the query:
380 | 
381 | ```sql
382 | SELECT reading - avg(reading) FROM Survey WHERE quant = 'rad';
383 | ```
384 | 
385 | What does this actually produce, and can you think of why?
386 | 
387 | :::::::::::::::  solution
388 | 
389 | ## Solution
390 | 
391 | The query produces only one row of results when we what we really want is a result for each of the readings.
392 | The `avg()` function produces only a single value, and because it is run first, the table is reduced to a single row.
393 | The `reading` value is simply an arbitrary one.
394 | 
395 | To achieve what we wanted, we would have to run two queries:
396 | 
397 | ```sql
398 | SELECT avg(reading) FROM Survey WHERE quant='rad';
399 | ```
400 | 
401 | This produces the average value (6.5625), which we can then insert into a second query:
402 | 
403 | ```sql
404 | SELECT reading - 6.5625 FROM Survey WHERE quant = 'rad';
405 | ```
406 | 
407 | This produces what we want, but we can combine this into a single query using subqueries.
408 | 
409 | ```sql
410 | SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') FROM Survey WHERE quant = 'rad';
411 | ```
412 | 
413 | This way we don't have execute two queries.
414 | 
415 | In summary what we have done is to replace `avg(reading)` with `(SELECT avg(reading) FROM Survey WHERE quant='rad')` in the original query.
416 | 
417 | :::::::::::::::::::::::::
418 | 
419 | ::::::::::::::::::::::::::::::::::::::::::::::::::
420 | 
421 | :::::::::::::::::::::::::::::::::::::::  challenge
422 | 
423 | ## Using the group\_concat function
424 | 
425 | The function `group_concat(field, separator)`
426 | concatenates all the values in a field
427 | using the specified separator character
428 | (or ',' if the separator isn't specified).
429 | Use this to produce a one-line list of scientists' names,
430 | such as:
431 | 
432 | ```sql
433 | William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth
434 | ```
435 | 
436 | Can you find a way to list all the scientists family names separated by a comma?
437 | Can you find a way to list all the scientists personal and family names separated by a comma?
438 | 
439 | :::::::::::::::  solution
440 | 
441 | List all the family names separated by a comma:
442 | 
443 | ```sql
444 | SELECT group_concat(family, ',') FROM Person;
445 | ```
446 | 
447 | List all the full names separated by a comma:
448 | 
449 | ```sql
450 | SELECT group_concat(personal || ' ' || family, ',') FROM Person;
451 | ```
452 | 
453 | :::::::::::::::::::::::::
454 | 
455 | ::::::::::::::::::::::::::::::::::::::::::::::::::
456 | 
457 | :::::::::::::::::::::::::::::::::::::::: keypoints
458 | 
459 | - Use aggregation functions to combine multiple values.
460 | - Aggregation functions ignore `null` values.
461 | - Aggregation happens after filtering.
462 | - Use GROUP BY to combine subsets separately.
463 | - If no aggregation function is specified for a field, the query may return an arbitrary value for that field.
464 | 
465 | ::::::::::::::::::::::::::::::::::::::::::::::::::
466 | 
467 | 
468 | 


--------------------------------------------------------------------------------