├── requirements.txt ├── .github └── workflows │ ├── sandpaper-version.txt │ ├── pr-close-signal.yaml │ ├── pr-post-remove-branch.yaml │ ├── pr-preflight.yaml │ ├── sandpaper-main.yaml │ ├── update-workflows.yaml │ ├── pr-receive.yaml │ ├── update-cache.yaml │ ├── pr-comment.yaml │ └── README.md ├── .update-copyright.conf ├── episodes ├── data │ ├── site.csv │ ├── person.csv │ ├── visited.csv │ └── survey.csv ├── files │ ├── survey.db │ └── sql-novice-survey-data.zip ├── fig │ ├── sql-filter.odg │ ├── sql-aggregation.odg │ └── sql-join-structure.svg ├── 08-hygiene.md ├── 04-calc.md ├── 02-sort-dup.md ├── 05-null.md ├── 11-prog-R.md ├── 09-create.md ├── 03-filter.md ├── 10-prog.md ├── 01-select.md ├── 07-join.md └── 06-agg.md ├── site └── README.md ├── profiles └── learner-profiles.md ├── learners ├── discuss.md ├── setup.md └── reference.md ├── CITATION ├── CODE_OF_CONDUCT.md ├── .editorconfig ├── AUTHORS ├── bin └── create-db.sql ├── .gitignore ├── .zenodo.json ├── .mailmap ├── index.md ├── README.md ├── config.yaml ├── LICENSE.md ├── instructors └── instructor-notes.md └── CONTRIBUTING.md /requirements.txt: -------------------------------------------------------------------------------- 1 | PyYAML 2 | update-copyright 3 | -------------------------------------------------------------------------------- /.github/workflows/sandpaper-version.txt: -------------------------------------------------------------------------------- 1 | 0.16.12 2 | -------------------------------------------------------------------------------- /.update-copyright.conf: -------------------------------------------------------------------------------- 1 | [project] 2 | vcs: Git 3 | 4 | [files] 5 | authors: yes 6 | files: no 7 | -------------------------------------------------------------------------------- /episodes/data/site.csv: -------------------------------------------------------------------------------- 1 | DR-1,-49.85,-128.57 2 | DR-3,-47.15,-126.72 3 | MSK-4,-48.87,-123.4 4 | -------------------------------------------------------------------------------- /site/README.md: -------------------------------------------------------------------------------- 1 | This directory contains rendered lesson materials. Please do not edit files 2 | here. 3 | -------------------------------------------------------------------------------- /episodes/files/survey.db: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/files/survey.db -------------------------------------------------------------------------------- /episodes/fig/sql-filter.odg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/fig/sql-filter.odg -------------------------------------------------------------------------------- /profiles/learner-profiles.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: FIXME 3 | --- 4 | 5 | This is a placeholder file. Please add content here. 6 | -------------------------------------------------------------------------------- /episodes/fig/sql-aggregation.odg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/fig/sql-aggregation.odg -------------------------------------------------------------------------------- /episodes/data/person.csv: -------------------------------------------------------------------------------- 1 | dyer,William,Dyer 2 | pb,Frank,Pabodie 3 | lake,Anderson,Lake 4 | roe,Valentina,Roerich 5 | danforth,Frank,Danforth 6 | -------------------------------------------------------------------------------- /episodes/files/sql-novice-survey-data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/swcarpentry/sql-novice-survey/HEAD/episodes/files/sql-novice-survey-data.zip -------------------------------------------------------------------------------- /episodes/data/visited.csv: -------------------------------------------------------------------------------- 1 | 619,DR-1,1927-02-08 2 | 622,DR-1,1927-02-10 3 | 734,DR-3,1930-01-07 4 | 735,DR-3,1930-01-12 5 | 751,DR-3,1930-02-26 6 | 752,DR-3, 7 | 837,MSK-4,1932-01-14 8 | 844,DR-1,1932-03-22 9 | -------------------------------------------------------------------------------- /learners/discuss.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Discussion 3 | --- 4 | 5 | Relational databases are the most widely used by far, 6 | but other kinds also exist, 7 | such as the document-oriented database [MongoDB](https://www.mongodb.com/). 8 | 9 | 10 | -------------------------------------------------------------------------------- /CITATION: -------------------------------------------------------------------------------- 1 | Please cite as: 2 | 3 | Abigail Cabunoc and Sheldon McKay (eds): "Software Carpentry: Using 4 | Databases and SQL." Version 2017.08, August 2017, 5 | https://github.com/swcarpentry/sql-novice-survey, 6 | https://doi.org/10.5281/zenodo.838776 7 | -------------------------------------------------------------------------------- /episodes/data/survey.csv: -------------------------------------------------------------------------------- 1 | 619,dyer,rad,9.82 2 | 619,dyer,sal,0.13 3 | 622,dyer,rad,7.8 4 | 622,dyer,sal,0.09 5 | 734,pb,rad,8.41 6 | 734,lake,sal,0.05 7 | 734,pb,temp,-21.5 8 | 735,pb,rad,7.22 9 | 735,,sal,0.06 10 | 735,,temp,-26.0 11 | 751,pb,rad,4.35 12 | 751,pb,temp,-18.5 13 | 751,lake,sal,0.1 14 | 752,lake,rad,2.19 15 | 752,lake,sal,0.09 16 | 752,lake,temp,-16.0 17 | 752,roe,sal,41.6 18 | 837,lake,rad,1.46 19 | 837,lake,sal,0.21 20 | 837,roe,sal,22.5 21 | 844,roe,rad,11.25 22 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Contributor Code of Conduct" 3 | --- 4 | 5 | As contributors and maintainers of this project, 6 | we pledge to follow the [The Carpentries Code of Conduct][coc]. 7 | 8 | Instances of abusive, harassing, or otherwise unacceptable behavior 9 | may be reported by following our [reporting guidelines][coc-reporting]. 10 | 11 | 12 | [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html 13 | [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html 14 | -------------------------------------------------------------------------------- /.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | charset = utf-8 5 | insert_final_newline = true 6 | trim_trailing_whitespace = true 7 | 8 | [*.md] 9 | indent_size = 2 10 | indent_style = space 11 | max_line_length = 100 # Please keep this in sync with bin/lesson_check.py! 12 | trim_trailing_whitespace = false # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (
) 13 | 14 | [*.r] 15 | max_line_length = 80 16 | 17 | [*.py] 18 | indent_size = 4 19 | indent_style = space 20 | max_line_length = 79 21 | 22 | [*.sh] 23 | end_of_line = lf 24 | 25 | [Makefile] 26 | indent_style = tab 27 | -------------------------------------------------------------------------------- /.github/workflows/pr-close-signal.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Send Close Pull Request Signal" 2 | 3 | on: 4 | pull_request: 5 | types: 6 | [closed] 7 | 8 | jobs: 9 | send-close-signal: 10 | name: "Send closing signal" 11 | runs-on: ubuntu-22.04 12 | if: ${{ github.event.action == 'closed' }} 13 | steps: 14 | - name: "Create PRtifact" 15 | run: | 16 | mkdir -p ./pr 17 | printf ${{ github.event.number }} > ./pr/NUM 18 | - name: Upload Diff 19 | uses: actions/upload-artifact@v4 20 | with: 21 | name: pr 22 | path: ./pr 23 | -------------------------------------------------------------------------------- /AUTHORS: -------------------------------------------------------------------------------- 1 | Paula Andrea 2 | Pauline Barmby 3 | Karl Broman 4 | Amy Brown 5 | Abigail Cabunoc Mayes 6 | Daniel Chen 7 | Liam Clark 8 | Peter Cock 9 | Matthew Collins 10 | Harrison Dekker 11 | Deborah Digges 12 | Stevan Earl 13 | Dirk Eddelbuettel 14 | Ivan Gonzalez 15 | John Gosset 16 | Thomas Guignard 17 | Jonathan Guyer 18 | Kate Hertweck 19 | Nick James 20 | Luke W. Johnston 21 | W. Trevor King 22 | Andrew Kubiak 23 | Avishek Kumar 24 | Peter Li 25 | Tobin Magle 26 | Sue McClatchy 27 | Sheldon McKay 28 | James Mickley 29 | John R. Moreau 30 | Joshua Nahum 31 | Maneesha Sane 32 | Raniere Silva 33 | Luc Small 34 | Donald Speer 35 | Mark Stacy 36 | Jeff Stafford 37 | Scott Talafuse 38 | Morgan Taschuk 39 | Chris Tomlinson 40 | Ioan Vancea 41 | Ben Waugh 42 | Ethan White 43 | Greg Wilson 44 | Donny Winston 45 | -------------------------------------------------------------------------------- /bin/create-db.sql: -------------------------------------------------------------------------------- 1 | -- Create database to be used for learners. 2 | -- The data for the database are available as CSV files. 3 | -- For more information, see https://www.sqlite.org/cli.html#csv 4 | 5 | -- Generate tables. 6 | create table Person (id text, personal text, family text); 7 | create table Site (name text, lat real, long real); 8 | create table Visited (id text, site text, dated text); 9 | create table Survey (taken integer, person text, quant text, reading real); 10 | 11 | -- Import data. 12 | .mode csv 13 | .import data/person.csv Person 14 | .import data/site.csv Site 15 | .import data/survey.csv Survey 16 | .import data/visited.csv Visited 17 | 18 | -- Convert empty strings to NULLs. 19 | UPDATE Visited SET dated = null WHERE dated = ''; 20 | UPDATE Survey SET person = null WHERE person = ''; 21 | -------------------------------------------------------------------------------- /.github/workflows/pr-post-remove-branch.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Remove Temporary PR Branch" 2 | 3 | on: 4 | workflow_run: 5 | workflows: ["Bot: Send Close Pull Request Signal"] 6 | types: 7 | - completed 8 | 9 | jobs: 10 | delete: 11 | name: "Delete branch from Pull Request" 12 | runs-on: ubuntu-22.04 13 | if: > 14 | github.event.workflow_run.event == 'pull_request' && 15 | github.event.workflow_run.conclusion == 'success' 16 | permissions: 17 | contents: write 18 | steps: 19 | - name: 'Download artifact' 20 | uses: carpentries/actions/download-workflow-artifact@main 21 | with: 22 | run: ${{ github.event.workflow_run.id }} 23 | name: pr 24 | - name: "Get PR Number" 25 | id: get-pr 26 | run: | 27 | unzip pr.zip 28 | echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT 29 | - name: 'Remove branch' 30 | uses: carpentries/actions/remove-branch@main 31 | with: 32 | pr: ${{ steps.get-pr.outputs.NUM }} 33 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # sandpaper files 2 | episodes/*html 3 | site/* 4 | !site/README.md 5 | 6 | # History files 7 | .Rhistory 8 | .Rapp.history 9 | # Session Data files 10 | .RData 11 | # User-specific files 12 | .Ruserdata 13 | # Example code in package build process 14 | *-Ex.R 15 | # Output files from R CMD build 16 | /*.tar.gz 17 | # Output files from R CMD check 18 | /*.Rcheck/ 19 | # RStudio files 20 | .Rproj.user/ 21 | # produced vignettes 22 | vignettes/*.html 23 | vignettes/*.pdf 24 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 25 | .httr-oauth 26 | # knitr and R markdown default cache directories 27 | *_cache/ 28 | /cache/ 29 | # Temporary files created by R markdown 30 | *.utf8.md 31 | *.knit.md 32 | # R Environment Variables 33 | .Renviron 34 | # pkgdown site 35 | docs/ 36 | # translation temp files 37 | po/*~ 38 | # renv detritus 39 | renv/sandbox/ 40 | *.pyc 41 | *~ 42 | .DS_Store 43 | .ipynb_checkpoints 44 | .sass-cache 45 | .jekyll-cache/ 46 | .jekyll-metadata 47 | __pycache__ 48 | _site 49 | .Rproj.user 50 | .bundle/ 51 | .vendor/ 52 | vendor/ 53 | .docker-vendor/ 54 | Gemfile.lock 55 | .*history 56 | -------------------------------------------------------------------------------- /.zenodo.json: -------------------------------------------------------------------------------- 1 | { 2 | "contributors": [ 3 | { 4 | "type": "Editor", 5 | "name": "Henry Senyondo", 6 | "orcid": "0000-0001-7105-5808" 7 | } 8 | ], 9 | "creators": [ 10 | { 11 | "name": "Henry Senyondo", 12 | "orcid": "0000-0001-7105-5808" 13 | }, 14 | { 15 | "name": "James Scott-Brown" 16 | }, 17 | { 18 | "name": "Dan Michael Heggø" 19 | }, 20 | { 21 | "name": "Kyrre Traavik Låberg" 22 | }, 23 | { 24 | "name": "Colin Sauze" 25 | }, 26 | { 27 | "name": "Mengzhen Sun" 28 | }, 29 | { 30 | "name": "Simon Willison" 31 | }, 32 | { 33 | "name": "Jenna Jordan", 34 | "orcid": "0000-0001-9246-5355" 35 | }, 36 | { 37 | "name": "Mohammed Tanash", 38 | "orcid": "0000-0002-2877-5735" 39 | }, 40 | { 41 | "name": "Peter Aronoff" 42 | }, 43 | { 44 | "name": "Samuel Lelièvre", 45 | "orcid": "0000-0002-7275-0965" 46 | }, 47 | { 48 | "name": "Wilfred Tyler Gee" 49 | }, 50 | { 51 | "name": "Andrew Jerrison" 52 | } 53 | ], 54 | "license": { 55 | "id": "CC-BY-4.0" 56 | } 57 | } -------------------------------------------------------------------------------- /.mailmap: -------------------------------------------------------------------------------- 1 | Abigail Cabunoc Mayes 2 | Abigail Cabunoc Mayes 3 | Chris Tomlinson 4 | Deborah Digges 5 | Donald Speer 6 | François Michonneau 7 | Greg Wilson 8 | James Allen 9 | John R. Moreau 10 | Liam Clark 11 | Luke W. Johnston 12 | Maneesha Sane 13 | Mike Jackson 14 | Nick James 15 | Pauline Barmby 16 | Raniere Silva 17 | Raniere Silva 18 | Rémi Emonet 19 | Rémi Emonet 20 | Sheldon McKay 21 | Sue McClatchy 22 | Timothée Poisot 23 | Tobin Magle 24 | -------------------------------------------------------------------------------- /.github/workflows/pr-preflight.yaml: -------------------------------------------------------------------------------- 1 | name: "Pull Request Preflight Check" 2 | 3 | on: 4 | pull_request_target: 5 | branches: 6 | ["main"] 7 | types: 8 | ["opened", "synchronize", "reopened"] 9 | 10 | jobs: 11 | test-pr: 12 | name: "Test if pull request is valid" 13 | if: ${{ github.event.action != 'closed' }} 14 | runs-on: ubuntu-22.04 15 | outputs: 16 | is_valid: ${{ steps.check-pr.outputs.VALID }} 17 | permissions: 18 | pull-requests: write 19 | steps: 20 | - name: "Get Invalid Hashes File" 21 | id: hash 22 | run: | 23 | echo "json<> $GITHUB_OUTPUT 26 | - name: "Check PR" 27 | id: check-pr 28 | uses: carpentries/actions/check-valid-pr@main 29 | with: 30 | pr: ${{ github.event.number }} 31 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 32 | fail_on_error: true 33 | - name: "Comment result of validation" 34 | id: comment-diff 35 | if: ${{ always() }} 36 | uses: carpentries/actions/comment-diff@main 37 | with: 38 | pr: ${{ github.event.number }} 39 | body: ${{ steps.check-pr.outputs.MSG }} 40 | -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | --- 2 | site: sandpaper::sandpaper_site 3 | --- 4 | 5 | In the late 1920s and early 1930s, 6 | William Dyer, 7 | Frank Pabodie, 8 | and Valentina Roerich led expeditions to the 9 | [Pole of Inaccessibility](https://en.wikipedia.org/wiki/Pole_of_inaccessibility) 10 | in the South Pacific, 11 | and then onward to Antarctica. 12 | Two years ago, 13 | their expeditions were found in a storage locker at Miskatonic University. 14 | We have scanned and OCR the data they contain, 15 | and we now want to store that information 16 | in a way that will make search and analysis easy. 17 | 18 | Three common options for storage are 19 | text files, 20 | spreadsheets, 21 | and databases. 22 | Text files are easiest to create, 23 | and work well with version control, 24 | but then we would have to build search and analysis tools ourselves. 25 | Spreadsheets are good for doing simple analyses, 26 | but they don't handle large or complex data sets well. 27 | Databases, however, include powerful tools for search and analysis, 28 | and can handle large, complex data sets. 29 | These lessons will show how to use a database to explore the expeditions' data. 30 | 31 | :::::::::::::::::::::::::::::::::::::::::: prereq 32 | 33 | ## Prerequisites 34 | 35 | - This lesson requires the Unix shell, plus [SQLite3](https://www.sqlite.org/) or [DB Browser for SQLite](https://sqlitebrowser.org/). 36 | - Please download the database we will use: [survey.db](episodes/files/survey.db) 37 | 38 | 39 | :::::::::::::::::::::::::::::::::::::::::::::::::: 40 | 41 | 42 | -------------------------------------------------------------------------------- /learners/setup.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Setup 3 | --- 4 | 5 | # Software 6 | 7 | For this course you will need the UNIX shell (described in the [UNIX Shell lesson](https://swcarpentry.github.io/shell-novice/#install-software)), plus [SQLite3](https://www.sqlite.org/) or 8 | [DB Browser for SQLite](https://sqlitebrowser.org/). 9 | 10 | If you are running **macOS** you should already have SQLite installed. You can run `sqlite3 --version` 11 | in a terminal to confirm that it is available. You can also download DB Browser for SQLite from 12 | [their website](https://sqlitebrowser.org/dl/.) 13 | 14 | If you are running **Linux**, you may already have SQLite3 installed, please use the command 15 | `which sqlite3` to see the path of the program, otherwise you should be able to get it 16 | from your package manager (on Debian/Ubuntu, you can use the command `apt install sqlite3`). 17 | 18 | If you are running **Windows**, download installers and run them as administrator. 19 | Make sure you select the right installer version for your system. 20 | If the installer asks to add the path to the environment variables, check yes, otherwise you have to manually add the path of the executable to the `PATH` environmental variables. 21 | This path informs the system where to find the executable program. 22 | 23 | If installing SQLite3 using Anaconda, refer to the [anaconda sqlite docs](https://anaconda.org/anaconda/sqlite). 24 | 25 | After the installation and the setting of the paths, close the terminal and reopen a new terminal. 26 | This enables paths and configurations to be loaded. 27 | 28 | # Files 29 | 30 | Please download the database we'll be using: [survey.db](files/survey.db) 31 | 32 | 33 | -------------------------------------------------------------------------------- /.github/workflows/sandpaper-main.yaml: -------------------------------------------------------------------------------- 1 | name: "01 Build and Deploy Site" 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - master 8 | schedule: 9 | - cron: '0 0 * * 2' 10 | workflow_dispatch: 11 | inputs: 12 | name: 13 | description: 'Who triggered this build?' 14 | required: true 15 | default: 'Maintainer (via GitHub)' 16 | reset: 17 | description: 'Reset cached markdown files' 18 | required: false 19 | default: false 20 | type: boolean 21 | jobs: 22 | full-build: 23 | name: "Build Full Site" 24 | 25 | # 2024-10-01: ubuntu-latest is now 24.04 and R is not installed by default in the runner image 26 | # pin to 22.04 for now 27 | runs-on: ubuntu-22.04 28 | permissions: 29 | checks: write 30 | contents: write 31 | pages: write 32 | env: 33 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 34 | RENV_PATHS_ROOT: ~/.local/share/renv/ 35 | steps: 36 | 37 | - name: "Checkout Lesson" 38 | uses: actions/checkout@v4 39 | 40 | - name: "Set up R" 41 | uses: r-lib/actions/setup-r@v2 42 | with: 43 | use-public-rspm: true 44 | install-r: false 45 | 46 | - name: "Set up Pandoc" 47 | uses: r-lib/actions/setup-pandoc@v2 48 | 49 | - name: "Setup Lesson Engine" 50 | uses: carpentries/actions/setup-sandpaper@main 51 | with: 52 | cache-version: ${{ secrets.CACHE_VERSION }} 53 | 54 | - name: "Setup Package Cache" 55 | uses: carpentries/actions/setup-lesson-deps@main 56 | with: 57 | cache-version: ${{ secrets.CACHE_VERSION }} 58 | 59 | - name: "Deploy Site" 60 | run: | 61 | reset <- "${{ github.event.inputs.reset }}" == "true" 62 | sandpaper::package_cache_trigger(TRUE) 63 | sandpaper:::ci_deploy(reset = reset) 64 | shell: Rscript {0} 65 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![GitHub release][shields_release]][swc_sql_novice_survey_releases] 2 | [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://slack-invite.carpentries.org/) 3 | [![Slack Status](https://img.shields.io/badge/Slack_Channel-swc--sql-E01563.svg)](https://carpentries.slack.com/messages/C9X3YNVNY) 4 | [![DOI][zenodo_badge_DOI]][all_releases_DOI] 5 | 6 | # sql-novice-survey 7 | 8 | An introduction to databases and SQL using Antarctic survey data. 9 | Please see [https://swcarpentry.github.io/sql-novice-survey/](https://swcarpentry.github.io/sql-novice-survey/) for a rendered version of this material, 10 | [the lesson template documentation][lesson-example] 11 | for instructions on formatting, building, and submitting material, 12 | or run `make` in this directory for a list of helpful commands. 13 | 14 | ## Authors 15 | 16 | A list of contributors to the lesson can be found in [AUTHORS](AUTHORS) 17 | 18 | ## License 19 | 20 | Instructional material from this lesson is made available under the Creative 21 | Commons Attribution (CC BY 4.0) license. Except where otherwise noted, example 22 | programs and software included as part of this lesson are made available under 23 | the MIT license. For more information, see [LICENSE.md](LICENSE.md). 24 | 25 | ## Citation 26 | 27 | To cite this lesson, please consult with [CITATION](CITATION) 28 | To use a particular version's DOI, refer to [all Zenodo released versions][all_zenodo_versions] 29 | 30 | Maintainer(s): 31 | 32 | - [Henry Senyondo](https://carpentries.org/instructors/#henrykironde) 33 | - [Kellie Templeman](https://github.com/kltempleman) 34 | - [Novica Nakov](https://github.com/novica) 35 | 36 | [swc_sql_novice_survey_releases]: https://github.com/swcarpentry/sql-novice-survey/releases 37 | [shields_release]: https://img.shields.io/github/v/release/swcarpentry/sql-novice-survey 38 | [all_releases_DOI]: https://doi.org/10.5281/zenodo.3265270 39 | [zenodo_badge_DOI]: https://zenodo.org/badge/DOI/10.5281/zenodo.3265271.svg 40 | [lesson-example]: https://carpentries.github.io/lesson-example/ 41 | [all_zenodo_versions]: https://zenodo.org/search?page=1&size=20&q=3265271&sort=-version&all_versions=True 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /.github/workflows/update-workflows.yaml: -------------------------------------------------------------------------------- 1 | name: "02 Maintain: Update Workflow Files" 2 | 3 | on: 4 | workflow_dispatch: 5 | inputs: 6 | name: 7 | description: 'Who triggered this build (enter github username to tag yourself)?' 8 | required: true 9 | default: 'weekly run' 10 | clean: 11 | description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)' 12 | required: false 13 | default: '.yaml' 14 | schedule: 15 | # Run every Tuesday 16 | - cron: '0 0 * * 2' 17 | 18 | jobs: 19 | check_token: 20 | name: "Check SANDPAPER_WORKFLOW token" 21 | runs-on: ubuntu-22.04 22 | outputs: 23 | workflow: ${{ steps.validate.outputs.wf }} 24 | repo: ${{ steps.validate.outputs.repo }} 25 | steps: 26 | - name: "validate token" 27 | id: validate 28 | uses: carpentries/actions/check-valid-credentials@main 29 | with: 30 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 31 | 32 | update_workflow: 33 | name: "Update Workflow" 34 | runs-on: ubuntu-22.04 35 | needs: check_token 36 | if: ${{ needs.check_token.outputs.workflow == 'true' }} 37 | steps: 38 | - name: "Checkout Repository" 39 | uses: actions/checkout@v4 40 | 41 | - name: Update Workflows 42 | id: update 43 | uses: carpentries/actions/update-workflows@main 44 | with: 45 | clean: ${{ github.event.inputs.clean }} 46 | 47 | - name: Create Pull Request 48 | id: cpr 49 | if: "${{ steps.update.outputs.new }}" 50 | uses: carpentries/create-pull-request@main 51 | with: 52 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 53 | delete-branch: true 54 | branch: "update/workflows" 55 | commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}" 56 | title: "Update Workflows to Version ${{ steps.update.outputs.new }}" 57 | body: | 58 | :robot: This is an automated build 59 | 60 | Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }} 61 | 62 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }} 63 | 64 | [1]: https://github.com/carpentries/create-pull-request/tree/main 65 | labels: "type: template and tools" 66 | draft: false 67 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | #------------------------------------------------------------ 2 | # Values for this lesson. 3 | #------------------------------------------------------------ 4 | 5 | # Which carpentry is this (swc, dc, lc, or cp)? 6 | # swc: Software Carpentry 7 | # dc: Data Carpentry 8 | # lc: Library Carpentry 9 | # cp: Carpentries (to use for instructor training for instance) 10 | # incubator: The Carpentries Incubator 11 | carpentry: 'swc' 12 | 13 | # Overall title for pages. 14 | title: 'Databases and SQL' 15 | 16 | # Date the lesson was created (YYYY-MM-DD, this is empty by default) 17 | created: '2014-11-21' 18 | 19 | # Comma-separated list of keywords for the lesson 20 | keywords: 'software, data, lesson, The Carpentries' 21 | 22 | # Life cycle stage of the lesson 23 | # possible values: pre-alpha, alpha, beta, stable 24 | life_cycle: 'stable' 25 | 26 | # License of the lesson materials (recommended CC-BY 4.0) 27 | license: 'CC-BY 4.0' 28 | 29 | # Link to the source repository for this lesson 30 | source: 'https://github.com/swcarpentry/sql-novice-survey' 31 | 32 | # Default branch of your lesson 33 | branch: 'main' 34 | 35 | # Who to contact if there are any issues 36 | contact: 'team@carpentries.org' 37 | 38 | # Navigation ------------------------------------------------ 39 | # 40 | # Use the following menu items to specify the order of 41 | # individual pages in each dropdown section. Leave blank to 42 | # include all pages in the folder. 43 | # 44 | # Example ------------- 45 | # 46 | # episodes: 47 | # - introduction.md 48 | # - first-steps.md 49 | # 50 | # learners: 51 | # - setup.md 52 | # 53 | # instructors: 54 | # - instructor-notes.md 55 | # 56 | # profiles: 57 | # - one-learner.md 58 | # - another-learner.md 59 | 60 | # Order of episodes in your lesson 61 | episodes: 62 | - 01-select.md 63 | - 02-sort-dup.md 64 | - 03-filter.md 65 | - 04-calc.md 66 | - 05-null.md 67 | - 06-agg.md 68 | - 07-join.md 69 | - 08-hygiene.md 70 | - 09-create.md 71 | - 10-prog.md 72 | - 11-prog-R.md 73 | 74 | # Information for Learners 75 | learners: 76 | 77 | # Information for Instructors 78 | instructors: 79 | 80 | # Learner Profiles 81 | profiles: 82 | 83 | # Customisation --------------------------------------------- 84 | # 85 | # This space below is where custom yaml items (e.g. pinning 86 | # sandpaper and varnish versions) should live 87 | 88 | 89 | url: 'https://swcarpentry.github.io/sql-novice-survey' 90 | analytics: carpentries 91 | lang: en 92 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Licenses" 3 | --- 4 | 5 | ## Instructional Material 6 | 7 | All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) 8 | instructional material is made available under the [Creative Commons 9 | Attribution license][cc-by-human]. The following is a human-readable summary of 10 | (and not a substitute for) the [full legal text of the CC BY 4.0 11 | license][cc-by-legal]. 12 | 13 | You are free: 14 | 15 | - to **Share**---copy and redistribute the material in any medium or format 16 | - to **Adapt**---remix, transform, and build upon the material 17 | 18 | for any purpose, even commercially. 19 | 20 | The licensor cannot revoke these freedoms as long as you follow the license 21 | terms. 22 | 23 | Under the following terms: 24 | 25 | - **Attribution**---You must give appropriate credit (mentioning that your work 26 | is derived from work that is Copyright (c) The Carpentries and, where 27 | practical, linking to ), provide a [link to the 28 | license][cc-by-human], and indicate if changes were made. You may do so in 29 | any reasonable manner, but not in any way that suggests the licensor endorses 30 | you or your use. 31 | 32 | - **No additional restrictions**---You may not apply legal terms or 33 | technological measures that legally restrict others from doing anything the 34 | license permits. With the understanding that: 35 | 36 | Notices: 37 | 38 | * You do not have to comply with the license for elements of the material in 39 | the public domain or where your use is permitted by an applicable exception 40 | or limitation. 41 | * No warranties are given. The license may not give you all of the permissions 42 | necessary for your intended use. For example, other rights such as publicity, 43 | privacy, or moral rights may limit how you use the material. 44 | 45 | ## Software 46 | 47 | Except where otherwise noted, the example programs and other software provided 48 | by The Carpentries are made available under the [OSI][osi]-approved [MIT 49 | license][mit-license]. 50 | 51 | Permission is hereby granted, free of charge, to any person obtaining a copy of 52 | this software and associated documentation files (the "Software"), to deal in 53 | the Software without restriction, including without limitation the rights to 54 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 55 | of the Software, and to permit persons to whom the Software is furnished to do 56 | so, subject to the following conditions: 57 | 58 | The above copyright notice and this permission notice shall be included in all 59 | copies or substantial portions of the Software. 60 | 61 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 62 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 63 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 64 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 65 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 66 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 67 | SOFTWARE. 68 | 69 | ## Trademark 70 | 71 | "The Carpentries", "Software Carpentry", "Data Carpentry", and "Library 72 | Carpentry" and their respective logos are registered trademarks of 73 | [The Carpentries, Inc.][carpentries]. 74 | 75 | [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ 76 | [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode 77 | [mit-license]: https://opensource.org/licenses/mit-license.html 78 | [carpentries]: https://carpentries.org 79 | [osi]: https://opensource.org 80 | -------------------------------------------------------------------------------- /learners/reference.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Glossary' 3 | --- 4 | 5 | ## Glossary 6 | 7 | [aggregation function]{#aggregation-function} 8 | : A function that combines multiple values to produce a single new value (e.g. sum, mean, median). 9 | 10 | [atomic]{#atomic} 11 | : Describes a value *not* divisible into parts that one might want to 12 | work with separately. For example, if one wanted to work with 13 | first and last names separately, the values "Ada" and "Lovelace" 14 | would be atomic, but the value "Ada Lovelace" would not. 15 | 16 | [cascading delete]{#cascading-delete} 17 | : An [SQL](#sql) constraint requiring that if a given [record](#record) is deleted, 18 | all records referencing it (via [foreign key](#foreign-key)) in other [tables](#table) 19 | must also be deleted. 20 | 21 | [case insensitive]{#case-insensitive} 22 | : Treating text as if upper and lower case characters were the same. 23 | See also: [case sensitive](#case-sensitive). 24 | 25 | [case sensitive]{#case-sensitive} 26 | : Treating upper and lower case characters as different. See also: [case insensitive](#case-insensitive). 27 | 28 | [comma-separated values (CSV)]{#comma-separated-values-csv} 29 | : A common textual representation for tables in which the values in each row are separated by commas. 30 | 31 | [cross product]{#cross-product} 32 | : A pairing of all elements of one set with all elements of another. 33 | 34 | [cursor]{#cursor} 35 | : A pointer into a database that keeps track of outstanding operations. 36 | 37 | [database manager]{#database-manager} 38 | : A program that manages a database, such as SQLite. 39 | 40 | [fields]{#fields} 41 | : A set of data values of a particular type, one for each [record](#record) in a [table](#table). 42 | 43 | [filter]{#filter} 44 | : To select only the records that meet certain conditions. 45 | 46 | [foreign key]{#foreign-key} 47 | : One or more values in a [database table](#table) that identify 48 | [records](#record) in another table. 49 | 50 | [prepared statement]{#prepared-statement} 51 | : A template for an [SQL](#sql) query in which some values can be filled in. 52 | 53 | [primary key]{#primary-key} 54 | : One or more [fields](#fields) in a [database table](#table) whose values are 55 | guaranteed to be unique for each [record](#record), i.e., whose values 56 | uniquely identify the entry. 57 | 58 | [query]{#query} 59 | : A textual description of a database operation. Queries are expressed in 60 | a special-purpose language called [SQL](#sql), and despite the name "query", 61 | they may modify or delete data as well as interrogate it. 62 | 63 | [record]{#record} 64 | : A set of related values making up a single entry in a [database table](#table), 65 | typically shown as a row. See also: [fields](#fields). 66 | 67 | [referential integrity]{#referential-integrity} 68 | : The internal consistency of values in a database. If an entry in one table 69 | contains a [foreign key](#foreign-key), but the corresponding [records](#record) 70 | don't exist, referential integrity has been violated. 71 | 72 | [relational database]{#relational-database} 73 | : A collection of data organized into [tables](#table). 74 | 75 | [sentinel value]{#sentinel-value} 76 | : A value in a collection that has a special meaning, such as 999 to mean "age unknown". 77 | 78 | [SQL]{#sql} 79 | : A special-purpose language for describing operations on [relational databases](#relational-database). 80 | 81 | [SQL injection attack]{#sql-injection-attack} 82 | : An attack on a program in which the user's input contains malicious SQL statements. 83 | If this text is copied directly into an SQL statement, it will be executed in the database. 84 | 85 | [table]{#table} 86 | : A set of data in a [relational database](#relational-database) organized into a set 87 | of [records](#record), each having the same named [fields](#fields). 88 | 89 | [wildcard]{#wildcard} 90 | : A character used in pattern matching. In SQL's `like` operator, the wildcard "%" 91 | matches zero or more characters, so that `%able%` matches "fixable" and "tablets". 92 | 93 | 94 | -------------------------------------------------------------------------------- /.github/workflows/pr-receive.yaml: -------------------------------------------------------------------------------- 1 | name: "Receive Pull Request" 2 | 3 | on: 4 | pull_request: 5 | types: 6 | [opened, synchronize, reopened] 7 | 8 | concurrency: 9 | group: ${{ github.ref }} 10 | cancel-in-progress: true 11 | 12 | jobs: 13 | test-pr: 14 | name: "Record PR number" 15 | if: ${{ github.event.action != 'closed' }} 16 | runs-on: ubuntu-22.04 17 | outputs: 18 | is_valid: ${{ steps.check-pr.outputs.VALID }} 19 | steps: 20 | - name: "Record PR number" 21 | id: record 22 | if: ${{ always() }} 23 | run: | 24 | echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR 25 | - name: "Upload PR number" 26 | id: upload 27 | if: ${{ always() }} 28 | uses: actions/upload-artifact@v4 29 | with: 30 | name: pr 31 | path: ${{ github.workspace }}/NR 32 | - name: "Get Invalid Hashes File" 33 | id: hash 34 | run: | 35 | echo "json<> $GITHUB_OUTPUT 38 | - name: "echo output" 39 | run: | 40 | echo "${{ steps.hash.outputs.json }}" 41 | - name: "Check PR" 42 | id: check-pr 43 | uses: carpentries/actions/check-valid-pr@main 44 | with: 45 | pr: ${{ github.event.number }} 46 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 47 | 48 | build-md-source: 49 | name: "Build markdown source files if valid" 50 | needs: test-pr 51 | runs-on: ubuntu-22.04 52 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 53 | env: 54 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 55 | RENV_PATHS_ROOT: ~/.local/share/renv/ 56 | CHIVE: ${{ github.workspace }}/site/chive 57 | PR: ${{ github.workspace }}/site/pr 58 | MD: ${{ github.workspace }}/site/built 59 | steps: 60 | - name: "Check Out Main Branch" 61 | uses: actions/checkout@v4 62 | 63 | - name: "Check Out Staging Branch" 64 | uses: actions/checkout@v4 65 | with: 66 | ref: md-outputs 67 | path: ${{ env.MD }} 68 | 69 | - name: "Set up R" 70 | uses: r-lib/actions/setup-r@v2 71 | with: 72 | use-public-rspm: true 73 | install-r: false 74 | 75 | - name: "Set up Pandoc" 76 | uses: r-lib/actions/setup-pandoc@v2 77 | 78 | - name: "Setup Lesson Engine" 79 | uses: carpentries/actions/setup-sandpaper@main 80 | with: 81 | cache-version: ${{ secrets.CACHE_VERSION }} 82 | 83 | - name: "Setup Package Cache" 84 | uses: carpentries/actions/setup-lesson-deps@main 85 | with: 86 | cache-version: ${{ secrets.CACHE_VERSION }} 87 | 88 | - name: "Validate and Build Markdown" 89 | id: build-site 90 | run: | 91 | sandpaper::package_cache_trigger(TRUE) 92 | sandpaper::validate_lesson(path = '${{ github.workspace }}') 93 | sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE) 94 | shell: Rscript {0} 95 | 96 | - name: "Generate Artifacts" 97 | id: generate-artifacts 98 | run: | 99 | sandpaper:::ci_bundle_pr_artifacts( 100 | repo = '${{ github.repository }}', 101 | pr_number = '${{ github.event.number }}', 102 | path_md = '${{ env.MD }}', 103 | path_pr = '${{ env.PR }}', 104 | path_archive = '${{ env.CHIVE }}', 105 | branch = 'md-outputs' 106 | ) 107 | shell: Rscript {0} 108 | 109 | - name: "Upload PR" 110 | uses: actions/upload-artifact@v4 111 | with: 112 | name: pr 113 | path: ${{ env.PR }} 114 | overwrite: true 115 | 116 | - name: "Upload Diff" 117 | uses: actions/upload-artifact@v4 118 | with: 119 | name: diff 120 | path: ${{ env.CHIVE }} 121 | retention-days: 1 122 | 123 | - name: "Upload Build" 124 | uses: actions/upload-artifact@v4 125 | with: 126 | name: built 127 | path: ${{ env.MD }} 128 | retention-days: 1 129 | 130 | - name: "Teardown" 131 | run: sandpaper::reset_site() 132 | shell: Rscript {0} 133 | -------------------------------------------------------------------------------- /.github/workflows/update-cache.yaml: -------------------------------------------------------------------------------- 1 | name: "03 Maintain: Update Package Cache" 2 | 3 | on: 4 | workflow_dispatch: 5 | inputs: 6 | name: 7 | description: 'Who triggered this build (enter github username to tag yourself)?' 8 | required: true 9 | default: 'monthly run' 10 | schedule: 11 | # Run every tuesday 12 | - cron: '0 0 * * 2' 13 | 14 | jobs: 15 | preflight: 16 | name: "Preflight Check" 17 | runs-on: ubuntu-22.04 18 | outputs: 19 | ok: ${{ steps.check.outputs.ok }} 20 | steps: 21 | - id: check 22 | run: | 23 | if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then 24 | echo "ok=true" >> $GITHUB_OUTPUT 25 | echo "Running on request" 26 | # using single brackets here to avoid 08 being interpreted as octal 27 | # https://github.com/carpentries/sandpaper/issues/250 28 | elif [ `date +%d` -le 7 ]; then 29 | # If the Tuesday lands in the first week of the month, run it 30 | echo "ok=true" >> $GITHUB_OUTPUT 31 | echo "Running on schedule" 32 | else 33 | echo "ok=false" >> $GITHUB_OUTPUT 34 | echo "Not Running Today" 35 | fi 36 | 37 | check_renv: 38 | name: "Check if We Need {renv}" 39 | runs-on: ubuntu-22.04 40 | needs: preflight 41 | if: ${{ needs.preflight.outputs.ok == 'true'}} 42 | outputs: 43 | needed: ${{ steps.renv.outputs.exists }} 44 | steps: 45 | - name: "Checkout Lesson" 46 | uses: actions/checkout@v4 47 | - id: renv 48 | run: | 49 | if [[ -d renv ]]; then 50 | echo "exists=true" >> $GITHUB_OUTPUT 51 | fi 52 | 53 | check_token: 54 | name: "Check SANDPAPER_WORKFLOW token" 55 | runs-on: ubuntu-22.04 56 | needs: check_renv 57 | if: ${{ needs.check_renv.outputs.needed == 'true' }} 58 | outputs: 59 | workflow: ${{ steps.validate.outputs.wf }} 60 | repo: ${{ steps.validate.outputs.repo }} 61 | steps: 62 | - name: "validate token" 63 | id: validate 64 | uses: carpentries/actions/check-valid-credentials@main 65 | with: 66 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 67 | 68 | update_cache: 69 | name: "Update Package Cache" 70 | needs: check_token 71 | if: ${{ needs.check_token.outputs.repo== 'true' }} 72 | runs-on: ubuntu-22.04 73 | env: 74 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 75 | RENV_PATHS_ROOT: ~/.local/share/renv/ 76 | steps: 77 | 78 | - name: "Checkout Lesson" 79 | uses: actions/checkout@v4 80 | 81 | - name: "Set up R" 82 | uses: r-lib/actions/setup-r@v2 83 | with: 84 | use-public-rspm: true 85 | install-r: false 86 | 87 | - name: "Update {renv} deps and determine if a PR is needed" 88 | id: update 89 | uses: carpentries/actions/update-lockfile@main 90 | with: 91 | cache-version: ${{ secrets.CACHE_VERSION }} 92 | 93 | - name: Create Pull Request 94 | id: cpr 95 | if: ${{ steps.update.outputs.n > 0 }} 96 | uses: carpentries/create-pull-request@main 97 | with: 98 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 99 | delete-branch: true 100 | branch: "update/packages" 101 | commit-message: "[actions] update ${{ steps.update.outputs.n }} packages" 102 | title: "Update ${{ steps.update.outputs.n }} packages" 103 | body: | 104 | :robot: This is an automated build 105 | 106 | This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions: 107 | 108 | ``` 109 | ${{ steps.update.outputs.report }} 110 | ``` 111 | 112 | :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates. 113 | 114 | If you want to inspect these changes locally, you can use the following code to check out a new branch: 115 | 116 | ```bash 117 | git fetch origin update/packages 118 | git checkout update/packages 119 | ``` 120 | 121 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }} 122 | 123 | [1]: https://github.com/carpentries/create-pull-request/tree/main 124 | labels: "type: package cache" 125 | draft: false 126 | -------------------------------------------------------------------------------- /instructors/instructor-notes.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Instructor Notes 3 | --- 4 | 5 | > **database** (dā'tə-bās') noun: 6 | > "A collection of data arranged for ease and speed of search and retrieval by a computer" 7 | > 8 | > — The American Heritage® Science Dictionary 9 | > {: .quotation} 10 | 11 | - Three common options for storing data 12 | - Text 13 | - Easy to create, work well with version control 14 | - But then we have to build search and analysis tools ourselves 15 | - Spreadsheets 16 | - Good for simple analyses 17 | - But don't handle large or complex data sets well 18 | - Databases 19 | - Include powerful tools for search and analysis 20 | - Can handle large, complex data sets. 21 | 22 | ## Overall 23 | 24 | Relational databases are not as widely used in science as in business, 25 | but they are still a common way to store large data sets with complex structure. 26 | Even when the data itself isn't in a database, 27 | the metadata could be: 28 | for example, 29 | meteorological data might be stored in files on disk, 30 | but data about when and where observations were made, 31 | data ranges, 32 | and so on could be in a database 33 | to make it easier for scientists to find what they want to. 34 | 35 | - The first few sections (up to "Missing Data") usually go very quickly. 36 | The pace usually slows down a bit when null values are discussed 37 | mostly because learners have a lot of details to keep straight by this point. 38 | Things *really* slow down during the discussion of joins, 39 | but this is the key idea in the whole lesson: 40 | important ideas like primary keys and referential integrity 41 | only make sense once learners have seen how they're used in joins. 42 | It's worth going over things a couple of times if necessary (with lots of examples). 43 | 44 | - The sections on creating and modifying data, 45 | and programming with databases, 46 | can be dropped if time is short. 47 | Of the two, 48 | people seem to care most about how to add data (which only takes a few minutes to demonstrate). 49 | 50 | - Simple calculations are actually easier to do in a spreadsheet; the 51 | advantages of using a database become clear as soon as filtering 52 | and joins are needed. Instructors may therefore want to show a 53 | spreadsheet with the information from the four database tables 54 | consolidated into a single sheet, and demonstrate what's needed in 55 | both systems to answer questions like, "What was the average 56 | radiation reading in 1931?" 57 | 58 | - Some advanced learners may have heard that NoSQL databases 59 | (i.e., ones that don't use the relational model) 60 | are the next big thing, 61 | and ask why we're not teaching those. 62 | The answers are: 63 | 64 | 1. Relational databases are far more widely used than NoSQL databases. 65 | 2. We have far more experience with relational databases than with any other kind, 66 | so we have a better idea of what to teach and how to teach it. 67 | 3. NoSQL databases are as different from each other as they are from relational databases. 68 | Until a leader emerges, it isn't clear *which* NoSQL database we should teach. 69 | 70 | ## Resources 71 | 72 | - `data/*.csv`: CSV versions of data in sample survey database. 73 | - `bin/create-db.sql`: generate survey database used in examples based on CSV. 74 | 75 | ## SQLite Setup 76 | 77 | In order to execute the following lessons interactively, 78 | please install SQLite as mentioned in the setup instructions for your workshop. 79 | Then: 80 | 81 | ```bash 82 | $ git clone http://github.com/swcarpentry/sql-novice-survey.git 83 | $ cd sql-novice-survey 84 | ``` 85 | 86 | Next, 87 | create the database that will be used: 88 | 89 | ```bash 90 | $ sqlite3 survey.sqlite '.read bin/create-db.sql' 91 | ``` 92 | 93 | This reads commands from `bin/create-db.sql`, 94 | which sets up the tables and loads data from the CSV files in the `data` directory. 95 | 96 | To run commands interactively, 97 | run SQLite on `survey.sqlite`: 98 | 99 | ```bash 100 | $ sqlite3 survey.sqlite 101 | SQLite version 3.8.5 2014-08-15 22:37:57 102 | Enter ".help" for usage hints. 103 | sqlite> 104 | ``` 105 | 106 | ## Troubleshooting 107 | 108 | The command history and line editing features provided by `readline` are 109 | invaluable with a command-line tool like `sqlite3`. Participants should be 110 | encouraged strongly to start with a simple SQL statement and then use the 111 | up-arrow key to go back and add clauses one at a time, or fix problems, rather 112 | than typing each command from scratch. Unfortunately on some Linux and Mac OS X 113 | systems participants have found that the arrow keys do not scroll through the 114 | command history as expected. 115 | 116 | A workaround for this it to use the [rlwrap](https://github.com/hanslub42/rlwrap) 117 | (readline wrapper) command when starting SQLite: 118 | 119 | ```bash 120 | $ rlwrap sqlite3 survey.sqlite 121 | ``` 122 | 123 | The `rlwrap` package is available in the standard Fedora repository 124 | (but wasn't needed when I [@benwaugh] taught this) and appears to be 125 | available in [Ubuntu](https://packages.ubuntu.com/precise/rlwrap) too, 126 | and in [OS X using Homebrew](https://news.ycombinator.com/item?id=5087790). 127 | 128 | 129 | -------------------------------------------------------------------------------- /episodes/08-hygiene.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Data Hygiene 3 | teaching: 15 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain what an atomic value is. 10 | - Distinguish between atomic and non-atomic values. 11 | - Explain why every value in a database should be atomic. 12 | - Explain what a primary key is and why every record should have one. 13 | - Identify primary keys in database tables. 14 | - Explain why database entries should not contain redundant information. 15 | - Identify redundant information in databases. 16 | 17 | :::::::::::::::::::::::::::::::::::::::::::::::::: 18 | 19 | :::::::::::::::::::::::::::::::::::::::: questions 20 | 21 | - How should I format data in a database, and why? 22 | 23 | :::::::::::::::::::::::::::::::::::::::::::::::::: 24 | 25 | Now that we have seen how joins work, we can see why the relational 26 | model is so useful and how best to use it. The first rule is that 27 | every value should be [atomic](../learners/reference.md#atomic), i.e., not 28 | contain parts that we might want to work with separately. We store 29 | personal and family names in separate columns instead of putting the 30 | entire name in one column so that we don't have to use substring 31 | operations to get the name's components. More importantly, we store 32 | the two parts of the name separately because splitting on spaces is 33 | unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel 34 | Steubart". 35 | 36 | The second rule is that every record should have a unique primary key. 37 | This can be a serial number that has no intrinsic meaning, 38 | one of the values in the record (like the `id` field in the `Person` table), 39 | or even a combination of values: 40 | the triple `(taken, person, quant)` from the `Survey` table uniquely identifies every measurement. 41 | 42 | The third rule is that there should be no redundant information. 43 | For example, 44 | we could get rid of the `Site` table and rewrite the `Visited` table like this: 45 | 46 | | id | lat | long | dated | 47 | | -------- | --------- | ---------- | ----------- | 48 | | 619 | \-49.85 | \-128.57 | 1927-02-08 | 49 | | 622 | \-49.85 | \-128.57 | 1927-02-10 | 50 | | 734 | \-47.15 | \-126.72 | 1930-01-07 | 51 | | 735 | \-47.15 | \-126.72 | 1930-01-12 | 52 | | 751 | \-47.15 | \-126.72 | 1930-02-26 | 53 | | 752 | \-47.15 | \-126.72 | \-null- | 54 | | 837 | \-48.87 | \-123.40 | 1932-01-14 | 55 | | 844 | \-49.85 | \-128.57 | 1932-03-22 | 56 | 57 | In fact, 58 | we could use a single table that recorded all the information about each reading in each row, 59 | just as a spreadsheet would. 60 | The problem is that it's very hard to keep data organized this way consistent: 61 | if we realize that the date of a particular visit to a particular site is wrong, 62 | we have to change multiple records in the database. 63 | What's worse, 64 | we may have to guess which records to change, 65 | since other sites may also have been visited on that date. 66 | 67 | The fourth rule is that the units for every value should be stored explicitly. 68 | Our database doesn't do this, 69 | and that's a problem: 70 | Roerich's salinity measurements are several orders of magnitude larger than anyone else's, 71 | but we don't know if that means she was using parts per million instead of parts per thousand, 72 | or whether there actually was a saline anomaly at that site in 1932. 73 | 74 | Stepping back, 75 | data and the tools used to store it have a symbiotic relationship: 76 | we use tables and joins because it's efficient, 77 | provided our data is organized a certain way, 78 | but organize our data that way because we have tools to manipulate it efficiently. 79 | As anthropologists say, 80 | the tool shapes the hand that shapes the tool. 81 | 82 | ::::::::::::::::::::::::::::::::::::::: challenge 83 | 84 | ## Identifying Atomic Values 85 | 86 | Which of the following are atomic values? Which are not? Why? 87 | 88 | - New Zealand 89 | - 87 Turing Avenue 90 | - January 25, 1971 91 | - the XY coordinate (0.5, 3.3) 92 | 93 | ::::::::::::::: solution 94 | 95 | ## Solution 96 | 97 | New Zealand is the only clear-cut atomic value. 98 | 99 | The address and the XY coordinate contain more than one piece of information 100 | which should be stored separately: 101 | 102 | - House number, street name 103 | - X coordinate, Y coordinate 104 | 105 | The date entry is less clear cut, because it contains month, day, and year elements. 106 | However, there is a `DATE` datatype in SQL, and dates should be stored using this format. 107 | If we need to work with the month, day, or year separately, we can use the SQL functions available for our database software 108 | (for example [`EXTRACT`](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm) or [`STRFTIME`](https://www.sqlite.org/lang_datefunc.html) for SQLite). 109 | 110 | 111 | 112 | ::::::::::::::::::::::::: 113 | 114 | :::::::::::::::::::::::::::::::::::::::::::::::::: 115 | 116 | ::::::::::::::::::::::::::::::::::::::: challenge 117 | 118 | ## Identifying a Primary Key 119 | 120 | What is the primary key in this table? 121 | I.e., what value or combination of values uniquely identifies a record? 122 | 123 | | latitude | longitude | date | temperature | 124 | | -------- | --------- | ---------- | ----------- | 125 | | 57\.3 | \-22.5 | 2015-01-09 | \-14.2 | 126 | 127 | ::::::::::::::: solution 128 | 129 | ## Solution 130 | 131 | Latitude, longitude, and date are all required to uniquely identify the temperature record. 132 | 133 | 134 | 135 | ::::::::::::::::::::::::: 136 | 137 | :::::::::::::::::::::::::::::::::::::::::::::::::: 138 | 139 | :::::::::::::::::::::::::::::::::::::::: keypoints 140 | 141 | - Every value in a database should be atomic. 142 | - Every record should have a unique primary key. 143 | - A database should not contain redundant information. 144 | - Units and similar metadata should be stored with the data. 145 | 146 | :::::::::::::::::::::::::::::::::::::::::::::::::: 147 | 148 | 149 | -------------------------------------------------------------------------------- /episodes/fig/sql-join-structure.svg: -------------------------------------------------------------------------------- 1 | 2 | 5 | ]> 6 | 8 | 9 | 11 | 12 | G 13 | 14 | 15 | Person 16 | 17 | Person 18 | 19 | id 20 | 21 | text 22 | 23 | personal 24 | 25 | text 26 | 27 | family 28 | 29 | text 30 | 31 | 32 | Survey 33 | 34 | Survey 35 | 36 | taken 37 | 38 | integer 39 | 40 | person 41 | 42 | text 43 | 44 | quant 45 | 46 | text 47 | 48 | reading 49 | 50 | real 51 | 52 | 53 | Person->Survey 54 | 55 | 56 | 57 | 58 | Site 59 | 60 | Site 61 | 62 | name 63 | 64 | text 65 | 66 | lat 67 | 68 | real 69 | 70 | long 71 | 72 | real 73 | 74 | 75 | Visited 76 | 77 | Visited 78 | 79 | id 80 | 81 | integer 82 | 83 | site 84 | 85 | text 86 | 87 | dated 88 | 89 | text 90 | 91 | 92 | Site->Visited 93 | 94 | 95 | 96 | 97 | Visited->Survey 98 | 99 | 100 | 101 | 102 | 103 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | ## Contributing 2 | 3 | [The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data 4 | Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source 5 | projects, and we welcome contributions of all kinds: new lessons, fixes to 6 | existing material, bug reports, and reviews of proposed changes are all 7 | welcome. 8 | 9 | ### Contributor Agreement 10 | 11 | By contributing, you agree that we may redistribute your work under [our 12 | license](LICENSE.md). In exchange, we will address your issues and/or assess 13 | your change proposal as promptly as we can, and help you become a member of our 14 | community. Everyone involved in [The Carpentries][cp-site] agrees to abide by 15 | our [code of conduct](CODE_OF_CONDUCT.md). 16 | 17 | ### How to Contribute 18 | 19 | The easiest way to get started is to file an issue to tell us about a spelling 20 | mistake, some awkward wording, or a factual error. This is a good way to 21 | introduce yourself and to meet some of our community members. 22 | 23 | 1. If you do not have a [GitHub][github] account, you can [send us comments by 24 | email][contact]. However, we will be able to respond more quickly if you use 25 | one of the other methods described below. 26 | 27 | 2. If you have a [GitHub][github] account, or are willing to [create 28 | one][github-join], but do not know how to use Git, you can report problems 29 | or suggest improvements by [creating an issue][repo-issues]. This allows us 30 | to assign the item to someone and to respond to it in a threaded discussion. 31 | 32 | 3. If you are comfortable with Git, and would like to add or change material, 33 | you can submit a pull request (PR). Instructions for doing this are 34 | [included below](#using-github). For inspiration about changes that need to 35 | be made, check out the [list of open issues][issues] across the Carpentries. 36 | 37 | Note: if you want to build the website locally, please refer to [The Workbench 38 | documentation][template-doc]. 39 | 40 | ### Where to Contribute 41 | 42 | 1. If you wish to change this lesson, add issues and pull requests here. 43 | 2. If you wish to change the template used for workshop websites, please refer 44 | to [The Workbench documentation][template-doc]. 45 | 46 | 47 | ### What to Contribute 48 | 49 | There are many ways to contribute, from writing new exercises and improving 50 | existing ones to updating or filling in the documentation and submitting [bug 51 | reports][issues] about things that do not work, are not clear, or are missing. 52 | If you are looking for ideas, please see [the list of issues for this 53 | repository][repo-issues], or the issues for [Data Carpentry][dc-issues], 54 | [Library Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects. 55 | 56 | Comments on issues and reviews of pull requests are just as welcome: we are 57 | smarter together than we are on our own. **Reviews from novices and newcomers 58 | are particularly valuable**: it's easy for people who have been using these 59 | lessons for a while to forget how impenetrable some of this material can be, so 60 | fresh eyes are always welcome. 61 | 62 | ### What *Not* to Contribute 63 | 64 | Our lessons already contain more material than we can cover in a typical 65 | workshop, so we are usually *not* looking for more concepts or tools to add to 66 | them. As a rule, if you want to introduce a new idea, you must (a) estimate how 67 | long it will take to teach and (b) explain what you would take out to make room 68 | for it. The first encourages contributors to be honest about requirements; the 69 | second, to think hard about priorities. 70 | 71 | We are also not looking for exercises or other material that only run on one 72 | platform. Our workshops typically contain a mixture of Windows, macOS, and 73 | Linux users; in order to be usable, our lessons must run equally well on all 74 | three. 75 | 76 | ### Using GitHub 77 | 78 | If you choose to contribute via GitHub, you may want to look at [How to 79 | Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we 80 | use [GitHub flow][github-flow] to manage changes: 81 | 82 | 1. Create a new branch in your desktop copy of this repository for each 83 | significant change. 84 | 2. Commit the change in that branch. 85 | 3. Push that branch to your fork of this repository on GitHub. 86 | 4. Submit a pull request from that branch to the [upstream repository][repo]. 87 | 5. If you receive feedback, make changes on your desktop and push to your 88 | branch on GitHub: the pull request will update automatically. 89 | 90 | NB: The published copy of the lesson is usually in the `main` branch. 91 | 92 | Each lesson has a team of maintainers who review issues and pull requests or 93 | encourage others to do so. The maintainers are community volunteers, and have 94 | final say over what gets merged into the lesson. 95 | 96 | ### Other Resources 97 | 98 | The Carpentries is a global organisation with volunteers and learners all over 99 | the world. We share values of inclusivity and a passion for sharing knowledge, 100 | teaching and learning. There are several ways to connect with The Carpentries 101 | community listed at including via social 102 | media, slack, newsletters, and email lists. You can also [reach us by 103 | email][contact]. 104 | 105 | [repo]: https://github.com/swcarpentry/sql-novice-survey 106 | [repo-issues]: https://github.com/swcarpentry/sql-novice-survey/issues 107 | [contact]: mailto:team@carpentries.org 108 | [cp-site]: https://carpentries.org/ 109 | [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry 110 | [dc-lessons]: https://datacarpentry.org/lessons/ 111 | [dc-site]: https://datacarpentry.org/ 112 | [discuss-list]: https://lists.software-carpentry.org/listinfo/discuss 113 | [github]: https://github.com 114 | [github-flow]: https://guides.github.com/introduction/flow/ 115 | [github-join]: https://github.com/join 116 | [how-contribute]: https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github 117 | [issues]: https://carpentries.org/help-wanted-issues/ 118 | [lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry 119 | [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry 120 | [swc-lessons]: https://software-carpentry.org/lessons/ 121 | [swc-site]: https://software-carpentry.org/ 122 | [lc-site]: https://librarycarpentry.org/ 123 | [template-doc]: https://carpentries.github.io/workbench/ 124 | -------------------------------------------------------------------------------- /.github/workflows/pr-comment.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Comment on the Pull Request" 2 | 3 | # read-write repo token 4 | # access to secrets 5 | on: 6 | workflow_run: 7 | workflows: ["Receive Pull Request"] 8 | types: 9 | - completed 10 | 11 | concurrency: 12 | group: pr-${{ github.event.workflow_run.pull_requests[0].number }} 13 | cancel-in-progress: true 14 | 15 | 16 | jobs: 17 | # Pull requests are valid if: 18 | # - they match the sha of the workflow run head commit 19 | # - they are open 20 | # - no .github files were committed 21 | test-pr: 22 | name: "Test if pull request is valid" 23 | runs-on: ubuntu-22.04 24 | if: > 25 | github.event.workflow_run.event == 'pull_request' && 26 | github.event.workflow_run.conclusion == 'success' 27 | outputs: 28 | is_valid: ${{ steps.check-pr.outputs.VALID }} 29 | payload: ${{ steps.check-pr.outputs.payload }} 30 | number: ${{ steps.get-pr.outputs.NUM }} 31 | msg: ${{ steps.check-pr.outputs.MSG }} 32 | steps: 33 | - name: 'Download PR artifact' 34 | id: dl 35 | uses: carpentries/actions/download-workflow-artifact@main 36 | with: 37 | run: ${{ github.event.workflow_run.id }} 38 | name: 'pr' 39 | 40 | - name: "Get PR Number" 41 | if: ${{ steps.dl.outputs.success == 'true' }} 42 | id: get-pr 43 | run: | 44 | unzip pr.zip 45 | echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT 46 | 47 | - name: "Fail if PR number was not present" 48 | id: bad-pr 49 | if: ${{ steps.dl.outputs.success != 'true' }} 50 | run: | 51 | echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.' 52 | exit 1 53 | - name: "Get Invalid Hashes File" 54 | id: hash 55 | run: | 56 | echo "json<> $GITHUB_OUTPUT 59 | - name: "Check PR" 60 | id: check-pr 61 | if: ${{ steps.dl.outputs.success == 'true' }} 62 | uses: carpentries/actions/check-valid-pr@main 63 | with: 64 | pr: ${{ steps.get-pr.outputs.NUM }} 65 | sha: ${{ github.event.workflow_run.head_sha }} 66 | headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire 67 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 68 | fail_on_error: true 69 | 70 | # Create an orphan branch on this repository with two commits 71 | # - the current HEAD of the md-outputs branch 72 | # - the output from running the current HEAD of the pull request through 73 | # the md generator 74 | create-branch: 75 | name: "Create Git Branch" 76 | needs: test-pr 77 | runs-on: ubuntu-22.04 78 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 79 | env: 80 | NR: ${{ needs.test-pr.outputs.number }} 81 | permissions: 82 | contents: write 83 | steps: 84 | - name: 'Checkout md outputs' 85 | uses: actions/checkout@v4 86 | with: 87 | ref: md-outputs 88 | path: built 89 | fetch-depth: 1 90 | 91 | - name: 'Download built markdown' 92 | id: dl 93 | uses: carpentries/actions/download-workflow-artifact@main 94 | with: 95 | run: ${{ github.event.workflow_run.id }} 96 | name: 'built' 97 | 98 | - if: ${{ steps.dl.outputs.success == 'true' }} 99 | run: unzip built.zip 100 | 101 | - name: "Create orphan and push" 102 | if: ${{ steps.dl.outputs.success == 'true' }} 103 | run: | 104 | cd built/ 105 | git config --local user.email "actions@github.com" 106 | git config --local user.name "GitHub Actions" 107 | CURR_HEAD=$(git rev-parse HEAD) 108 | git checkout --orphan md-outputs-PR-${NR} 109 | git add -A 110 | git commit -m "source commit: ${CURR_HEAD}" 111 | ls -A | grep -v '^.git$' | xargs -I _ rm -r '_' 112 | cd .. 113 | unzip -o -d built built.zip 114 | cd built 115 | git add -A 116 | git commit --allow-empty -m "differences for PR #${NR}" 117 | git push -u --force --set-upstream origin md-outputs-PR-${NR} 118 | 119 | # Comment on the Pull Request with a link to the branch and the diff 120 | comment-pr: 121 | name: "Comment on Pull Request" 122 | needs: [test-pr, create-branch] 123 | runs-on: ubuntu-22.04 124 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 125 | env: 126 | NR: ${{ needs.test-pr.outputs.number }} 127 | permissions: 128 | pull-requests: write 129 | steps: 130 | - name: 'Download comment artifact' 131 | id: dl 132 | uses: carpentries/actions/download-workflow-artifact@main 133 | with: 134 | run: ${{ github.event.workflow_run.id }} 135 | name: 'diff' 136 | 137 | - if: ${{ steps.dl.outputs.success == 'true' }} 138 | run: unzip ${{ github.workspace }}/diff.zip 139 | 140 | - name: "Comment on PR" 141 | id: comment-diff 142 | if: ${{ steps.dl.outputs.success == 'true' }} 143 | uses: carpentries/actions/comment-diff@main 144 | with: 145 | pr: ${{ env.NR }} 146 | path: ${{ github.workspace }}/diff.md 147 | 148 | # Comment if the PR is open and matches the SHA, but the workflow files have 149 | # changed 150 | comment-changed-workflow: 151 | name: "Comment if workflow files have changed" 152 | needs: test-pr 153 | runs-on: ubuntu-22.04 154 | if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }} 155 | env: 156 | NR: ${{ github.event.workflow_run.pull_requests[0].number }} 157 | body: ${{ needs.test-pr.outputs.msg }} 158 | permissions: 159 | pull-requests: write 160 | steps: 161 | - name: 'Check for spoofing' 162 | id: dl 163 | uses: carpentries/actions/download-workflow-artifact@main 164 | with: 165 | run: ${{ github.event.workflow_run.id }} 166 | name: 'built' 167 | 168 | - name: 'Alert if spoofed' 169 | id: spoof 170 | if: ${{ steps.dl.outputs.success == 'true' }} 171 | run: | 172 | echo 'body<> $GITHUB_ENV 173 | echo '' >> $GITHUB_ENV 174 | echo '## :x: DANGER :x:' >> $GITHUB_ENV 175 | echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV 176 | echo '' >> $GITHUB_ENV 177 | echo 'EOF' >> $GITHUB_ENV 178 | 179 | - name: "Comment on PR" 180 | id: comment-diff 181 | uses: carpentries/actions/comment-diff@main 182 | with: 183 | pr: ${{ env.NR }} 184 | body: ${{ env.body }} 185 | -------------------------------------------------------------------------------- /episodes/04-calc.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Calculating New Values 3 | teaching: 5 4 | exercises: 5 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write queries that calculate new values for each selected record. 10 | 11 | :::::::::::::::::::::::::::::::::::::::::::::::::: 12 | 13 | :::::::::::::::::::::::::::::::::::::::: questions 14 | 15 | - How can I calculate new values on the fly? 16 | 17 | :::::::::::::::::::::::::::::::::::::::::::::::::: 18 | 19 | After carefully re-reading the expedition logs, 20 | we realize that the radiation measurements they report 21 | may need to be corrected upward by 5%. 22 | Rather than modifying the stored data, 23 | we can do this calculation on the fly 24 | as part of our query: 25 | 26 | ```sql 27 | SELECT 1.05 * reading FROM Survey WHERE quant = 'rad'; 28 | ``` 29 | 30 | | 1\.05 \* reading | 31 | | ------------------------- | 32 | | 10\.311 | 33 | | 8\.19 | 34 | | 8\.8305 | 35 | | 7\.581 | 36 | | 4\.5675 | 37 | | 2\.2995 | 38 | | 1\.533 | 39 | | 11\.8125 | 40 | 41 | When we run the query, 42 | the expression `1.05 * reading` is evaluated for each row. 43 | Expressions can use any of the fields, 44 | all of usual arithmetic operators, 45 | and a variety of common functions. 46 | (Exactly which ones depends on which database manager is being used.) 47 | For example, 48 | we can convert temperature readings from Fahrenheit to Celsius 49 | and round to two decimal places: 50 | 51 | ```sql 52 | SELECT taken, round(5 * (reading - 32) / 9, 2) FROM Survey WHERE quant = 'temp'; 53 | ``` 54 | 55 | | taken | round(5\*(reading-32)/9, 2) | 56 | | ------------------------- | -------------------------- | 57 | | 734 | \-29.72 | 58 | | 735 | \-32.22 | 59 | | 751 | \-28.06 | 60 | | 752 | \-26.67 | 61 | 62 | As you can see from this example, though, the string describing our 63 | new field (generated from the equation) can become quite unwieldy. SQL 64 | allows us to rename our fields, any field for that matter, whether it 65 | was calculated or one of the existing fields in our database, for 66 | succinctness and clarity. For example, we could write the previous 67 | query as: 68 | 69 | ```sql 70 | SELECT taken, round(5 * (reading - 32) / 9, 2) as Celsius FROM Survey WHERE quant = 'temp'; 71 | ``` 72 | 73 | | taken | Celsius | 74 | | ------------------------- | -------------------------- | 75 | | 734 | \-29.72 | 76 | | 735 | \-32.22 | 77 | | 751 | \-28.06 | 78 | | 752 | \-26.67 | 79 | 80 | We can also combine values from different fields, 81 | for example by using the string concatenation operator `||`: 82 | 83 | ```sql 84 | SELECT personal || ' ' || family FROM Person; 85 | ``` 86 | 87 | | personal || ' ' || family | 88 | | ------------------------- | 89 | | William Dyer | 90 | | Frank Pabodie | 91 | | Anderson Lake | 92 | | Valentina Roerich | 93 | | Frank Danforth | 94 | 95 | ::::::::::::::::::::::::::::::::::::::: challenge 96 | 97 | ## Fixing Salinity Readings 98 | 99 | After further reading, 100 | we realize that Valentina Roerich 101 | was reporting salinity as percentages. 102 | Write a query that returns all of her salinity measurements 103 | from the `Survey` table 104 | with the values divided by 100. 105 | 106 | ::::::::::::::: solution 107 | 108 | ## Solution 109 | 110 | ```sql 111 | SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal'; 112 | ``` 113 | 114 | | taken | reading / 100 | 115 | | ------------------------- | -------------------------- | 116 | | 752 | 0\.416 | 117 | | 837 | 0\.225 | 118 | 119 | ::::::::::::::::::::::::: 120 | 121 | :::::::::::::::::::::::::::::::::::::::::::::::::: 122 | 123 | ::::::::::::::::::::::::::::::::::::::: challenge 124 | 125 | ## Unions 126 | 127 | The `UNION` operator combines the results of two queries: 128 | 129 | ```sql 130 | SELECT * FROM Person WHERE id = 'dyer' UNION SELECT * FROM Person WHERE id = 'roe'; 131 | ``` 132 | 133 | | id | personal | family | 134 | | ------------------------- | -------------------------- | ------- | 135 | | dyer | William | Dyer | 136 | | roe | Valentina | Roerich | 137 | 138 | The `UNION ALL` command is equivalent to the `UNION` operator, 139 | except that `UNION ALL` will select all values. 140 | The difference is that `UNION ALL` will not eliminate duplicate rows. 141 | Instead, `UNION ALL` pulls all rows from the query 142 | specifics and combines them into a table. 143 | The `UNION` command does a `SELECT DISTINCT` on the results set. 144 | If all the records to be returned are unique from your union, 145 | use `UNION ALL` instead, it gives faster results since it skips the `DISTINCT` step. 146 | For this section, we shall use UNION. 147 | 148 | Use `UNION` to create a consolidated list of salinity measurements 149 | in which Valentina Roerich's, and only Valentina's, 150 | have been corrected as described in the previous challenge. 151 | The output should be something like: 152 | 153 | | taken | reading | 154 | | ------------------------- | -------------------------- | 155 | | 619 | 0\.13 | 156 | | 622 | 0\.09 | 157 | | 734 | 0\.05 | 158 | | 751 | 0\.1 | 159 | | 752 | 0\.09 | 160 | | 752 | 0\.416 | 161 | | 837 | 0\.21 | 162 | | 837 | 0\.225 | 163 | 164 | ::::::::::::::: solution 165 | 166 | ## Solution 167 | 168 | ```sql 169 | SELECT taken, reading FROM Survey WHERE person != 'roe' AND quant = 'sal' UNION SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal' ORDER BY taken ASC; 170 | ``` 171 | 172 | ::::::::::::::::::::::::: 173 | 174 | :::::::::::::::::::::::::::::::::::::::::::::::::: 175 | 176 | ::::::::::::::::::::::::::::::::::::::: challenge 177 | 178 | ## Selecting Major Site Identifiers 179 | 180 | The site identifiers in the `Visited` table have two parts 181 | separated by a '-': 182 | 183 | ```sql 184 | SELECT DISTINCT site FROM Visited; 185 | ``` 186 | 187 | | site | 188 | | ------------------------- | 189 | | DR-1 | 190 | | DR-3 | 191 | | MSK-4 | 192 | 193 | Some major site identifiers (i.e. the letter codes) are two letters long and some are three. 194 | The "in string" function `instr(X, Y)` 195 | returns the 1-based index of the first occurrence of string Y in string X, 196 | or 0 if Y does not exist in X. 197 | The substring function `substr(X, I, [L])` 198 | returns the substring of X starting at index I, with an optional length L. 199 | Use these two functions to produce a list of unique major site identifiers. 200 | (For this data, 201 | the list should contain only "DR" and "MSK"). 202 | 203 | ::::::::::::::: solution 204 | 205 | ## Solution 206 | 207 | ```sql 208 | SELECT DISTINCT substr(site, 1, instr(site, '-') - 1) AS MajorSite FROM Visited; 209 | ``` 210 | 211 | ::::::::::::::::::::::::: 212 | 213 | :::::::::::::::::::::::::::::::::::::::::::::::::: 214 | 215 | :::::::::::::::::::::::::::::::::::::::: keypoints 216 | 217 | - Queries can do the usual arithmetic operations on values. 218 | - Use UNION to combine the results of two or more queries. 219 | 220 | :::::::::::::::::::::::::::::::::::::::::::::::::: 221 | 222 | 223 | -------------------------------------------------------------------------------- /episodes/02-sort-dup.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Sorting and Removing Duplicates 3 | teaching: 10 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write queries that display results in a particular order. 10 | - Write queries that eliminate duplicate values from data. 11 | 12 | :::::::::::::::::::::::::::::::::::::::::::::::::: 13 | 14 | :::::::::::::::::::::::::::::::::::::::: questions 15 | 16 | - How can I sort a query's results? 17 | - How can I remove duplicate values from a query's results? 18 | 19 | :::::::::::::::::::::::::::::::::::::::::::::::::: 20 | 21 | In beginning our examination of the Antarctic data, we want to know: 22 | 23 | - what kind of quantity measurements were taken at each site; 24 | - which scientists took measurements on the expedition; 25 | 26 | To determine which measurements were taken at each site, 27 | we can examine the `Survey` table. 28 | Data is often redundant, 29 | so queries often return redundant information. 30 | For example, 31 | if we select the quantities that have been measured 32 | from the `Survey` table, 33 | we get this: 34 | 35 | ```sql 36 | SELECT quant FROM Survey; 37 | ``` 38 | 39 | | quant | 40 | | ---------- | 41 | | rad | 42 | | sal | 43 | | rad | 44 | | sal | 45 | | rad | 46 | | sal | 47 | | temp | 48 | | rad | 49 | | sal | 50 | | temp | 51 | | rad | 52 | | temp | 53 | | sal | 54 | | rad | 55 | | sal | 56 | | temp | 57 | | sal | 58 | | rad | 59 | | sal | 60 | | sal | 61 | | rad | 62 | 63 | This result makes it difficult to see all of the different types of 64 | `quant` in the Survey table. We can eliminate the redundant output to 65 | make the result more readable by adding the `DISTINCT` keyword to our 66 | query: 67 | 68 | ```sql 69 | SELECT DISTINCT quant FROM Survey; 70 | ``` 71 | 72 | | quant | 73 | | ---------- | 74 | | rad | 75 | | sal | 76 | | temp | 77 | 78 | If we want to determine which visit (stored in the `taken` column) 79 | have which `quant` measurement, 80 | we can use the `DISTINCT` keyword on multiple columns. 81 | If we select more than one column, 82 | distinct *sets* of values are returned 83 | (in this case *pairs*, because we are selecting two columns): 84 | 85 | ```sql 86 | SELECT DISTINCT taken, quant FROM Survey; 87 | ``` 88 | 89 | | taken | quant | 90 | | ---------- | --------- | 91 | | 619 | rad | 92 | | 619 | sal | 93 | | 622 | rad | 94 | | 622 | sal | 95 | | 734 | rad | 96 | | 734 | sal | 97 | | 734 | temp | 98 | | 735 | rad | 99 | | 735 | sal | 100 | | 735 | temp | 101 | | 751 | rad | 102 | | 751 | temp | 103 | | 751 | sal | 104 | | 752 | rad | 105 | | 752 | sal | 106 | | 752 | temp | 107 | | 837 | rad | 108 | | 837 | sal | 109 | | 844 | rad | 110 | 111 | Notice in both cases that duplicates are removed 112 | even if the rows they come from didn't appear to be adjacent in the database table. 113 | 114 | Our next task is to identify the scientists on the expedition by looking at the `Person` table. 115 | As we mentioned earlier, 116 | database records are not stored in any particular order. 117 | This means that query results aren't necessarily sorted, 118 | and even if they are, 119 | we often want to sort them in a different way, 120 | e.g., by their identifier instead of by their personal name. 121 | We can do this in SQL by adding an `ORDER BY` clause to our query: 122 | 123 | ```sql 124 | SELECT * FROM Person ORDER BY id; 125 | ``` 126 | 127 | | id | personal | family | 128 | | ---------- | --------- | -------- | 129 | | danfort | Frank | Danforth | 130 | | dyer | William | Dyer | 131 | | lake | Anderson | Lake | 132 | | pb | Frank | Pabodie | 133 | | roe | Valentina | Roerich | 134 | 135 | By default, when we use `ORDER BY`, 136 | results are sorted in ascending order of the column we specify 137 | (i.e., 138 | from least to greatest). 139 | 140 | We can sort in the opposite order using `DESC` (for "descending"): 141 | 142 | ::::::::::::::::::::::::::::::::::::::::: callout 143 | 144 | ## A note on ordering 145 | 146 | While it may look that the records are consistent every time we ask for them in this lesson, that is because no one has changed or modified any of the data so far. Remember to use `ORDER BY` if you want the rows returned to have any sort of consistent or predictable order. 147 | 148 | 149 | :::::::::::::::::::::::::::::::::::::::::::::::::: 150 | 151 | ```sql 152 | SELECT * FROM person ORDER BY id DESC; 153 | ``` 154 | 155 | | id | personal | family | 156 | | ---------- | --------- | -------- | 157 | | roe | Valentina | Roerich | 158 | | pb | Frank | Pabodie | 159 | | lake | Anderson | Lake | 160 | | dyer | William | Dyer | 161 | | danfort | Frank | Danforth | 162 | 163 | (And if we want to make it clear that we're sorting in ascending order, 164 | we can use `ASC` instead of `DESC`.) 165 | 166 | In order to look at which scientist measured quantities during each visit, 167 | we can look again at the `Survey` table. 168 | We can also sort on several fields at once. 169 | For example, 170 | this query sorts results first in ascending order by `taken`, 171 | and then in descending order by `person` 172 | within each group of equal `taken` values: 173 | 174 | ```sql 175 | SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC; 176 | ``` 177 | 178 | | taken | person | quant | 179 | | ---------- | --------- | -------- | 180 | | 619 | dyer | rad | 181 | | 619 | dyer | sal | 182 | | 622 | dyer | rad | 183 | | 622 | dyer | sal | 184 | | 734 | pb | rad | 185 | | 734 | pb | temp | 186 | | 734 | lake | sal | 187 | | 735 | pb | rad | 188 | | 735 | \-null- | sal | 189 | | 735 | \-null- | temp | 190 | | 751 | pb | rad | 191 | | 751 | pb | temp | 192 | | 751 | lake | sal | 193 | | 752 | roe | sal | 194 | | 752 | lake | rad | 195 | | 752 | lake | sal | 196 | | 752 | lake | temp | 197 | | 837 | roe | sal | 198 | | 837 | lake | rad | 199 | | 837 | lake | sal | 200 | | 844 | roe | rad | 201 | 202 | This query gives us a good idea of which scientist was involved in which visit, 203 | and what measurements they performed during the visit. 204 | 205 | Looking at the table, it seems like some scientists specialized in 206 | certain kinds of measurements. We can examine which scientists 207 | performed which measurements by selecting the appropriate columns and 208 | removing duplicates. 209 | 210 | ```sql 211 | SELECT DISTINCT quant, person FROM Survey ORDER BY quant ASC; 212 | ``` 213 | 214 | | quant | person | 215 | | ---------- | --------- | 216 | | rad | dyer | 217 | | rad | pb | 218 | | rad | lake | 219 | | rad | roe | 220 | | sal | dyer | 221 | | sal | lake | 222 | | sal | \-null- | 223 | | sal | roe | 224 | | temp | pb | 225 | | temp | \-null- | 226 | | temp | lake | 227 | 228 | ::::::::::::::::::::::::::::::::::::::: challenge 229 | 230 | ## Finding Distinct Dates 231 | 232 | Write a query that selects distinct dates from the `Visited` table. 233 | 234 | ::::::::::::::: solution 235 | 236 | ## Solution 237 | 238 | ```sql 239 | SELECT DISTINCT dated FROM Visited; 240 | ``` 241 | 242 | | dated | 243 | | ---------- | 244 | | 1927-02-08 | 245 | | 1927-02-10 | 246 | | 1930-01-07 | 247 | | 1930-01-12 | 248 | | 1930-02-26 | 249 | |   | 250 | | 1932-01-14 | 251 | | 1932-03-22 | 252 | 253 | ::::::::::::::::::::::::: 254 | 255 | :::::::::::::::::::::::::::::::::::::::::::::::::: 256 | 257 | ::::::::::::::::::::::::::::::::::::::: challenge 258 | 259 | ## Displaying Full Names 260 | 261 | Write a query that displays the full names of the scientists in the `Person` table, 262 | ordered by family name. 263 | 264 | ::::::::::::::: solution 265 | 266 | ## Solution 267 | 268 | ```sql 269 | SELECT personal, family FROM Person ORDER BY family ASC; 270 | ``` 271 | 272 | | personal | family | 273 | | ---------- | --------- | 274 | | Frank | Danforth | 275 | | William | Dyer | 276 | | Anderson | Lake | 277 | | Frank | Pabodie | 278 | | Valentina | Roerich | 279 | 280 | ::::::::::::::::::::::::: 281 | 282 | :::::::::::::::::::::::::::::::::::::::::::::::::: 283 | 284 | :::::::::::::::::::::::::::::::::::::::: keypoints 285 | 286 | - The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY. 287 | - The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT. 288 | 289 | :::::::::::::::::::::::::::::::::::::::::::::::::: 290 | 291 | 292 | -------------------------------------------------------------------------------- /episodes/05-null.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Missing Data 3 | teaching: 15 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain how databases represent missing information. 10 | - Explain the three-valued logic databases use when manipulating missing information. 11 | - Write queries that handle missing information correctly. 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - How do databases represent missing information? 18 | - What special handling does missing information require? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | Real-world data is never complete --- there are always holes. 23 | Databases represent these holes using a special value called `null`. 24 | `null` is not zero, `False`, or the empty string; 25 | it is a one-of-a-kind value that means "nothing here". 26 | Dealing with `null` requires a few special tricks 27 | and some careful thinking. 28 | 29 | By default, SQLite does not display NULL values in its output. The `.nullvalue` 30 | command causes SQLite to display the value you specify for NULLs. We will use 31 | the value `-null-` to make the NULLs easier to see: 32 | 33 | ```sql 34 | .nullvalue -null- 35 | ``` 36 | 37 | To start, 38 | let's have a look at the `Visited` table. 39 | There are eight records, 40 | but #752 doesn't have a date --- or rather, 41 | its date is null: 42 | 43 | ```sql 44 | SELECT * FROM Visited; 45 | ``` 46 | 47 | | id | site | dated | 48 | | ----- | ------ | ---------- | 49 | | 619 | DR-1 | 1927-02-08 | 50 | | 622 | DR-1 | 1927-02-10 | 51 | | 734 | DR-3 | 1930-01-07 | 52 | | 735 | DR-3 | 1930-01-12 | 53 | | 751 | DR-3 | 1930-02-26 | 54 | | 752 | DR-3 | \-null- | 55 | | 837 | MSK-4 | 1932-01-14 | 56 | | 844 | DR-1 | 1932-03-22 | 57 | 58 | Null doesn't behave like other values. 59 | If we select the records that come before 1930: 60 | 61 | ```sql 62 | SELECT * FROM Visited WHERE dated < '1930-01-01'; 63 | ``` 64 | 65 | | id | site | dated | 66 | | ----- | ------ | ---------- | 67 | | 619 | DR-1 | 1927-02-08 | 68 | | 622 | DR-1 | 1927-02-10 | 69 | 70 | we get two results, 71 | and if we select the ones that come during or after 1930: 72 | 73 | ```sql 74 | SELECT * FROM Visited WHERE dated >= '1930-01-01'; 75 | ``` 76 | 77 | | id | site | dated | 78 | | ----- | ------ | ---------- | 79 | | 734 | DR-3 | 1930-01-07 | 80 | | 735 | DR-3 | 1930-01-12 | 81 | | 751 | DR-3 | 1930-02-26 | 82 | | 837 | MSK-4 | 1932-01-14 | 83 | | 844 | DR-1 | 1932-03-22 | 84 | 85 | we get five, 86 | but record #752 isn't in either set of results. 87 | The reason is that 88 | `null<'1930-01-01'` 89 | is neither true nor false: 90 | null means, "We don't know," 91 | and if we don't know the value on the left side of a comparison, 92 | we don't know whether the comparison is true or false. 93 | Since databases represent "don't know" as null, 94 | the value of `null<'1930-01-01'` 95 | is actually `null`. 96 | `null>='1930-01-01'` is also null 97 | because we can't answer to that question either. 98 | And since the only records kept by a `WHERE` 99 | are those for which the test is true, 100 | record #752 isn't included in either set of results. 101 | 102 | Comparisons aren't the only operations that behave this way with nulls. 103 | `1+null` is `null`, 104 | `5*null` is `null`, 105 | `log(null)` is `null`, 106 | and so on. 107 | In particular, 108 | comparing things to null with = and != produces null: 109 | 110 | ```sql 111 | SELECT * FROM Visited WHERE dated = NULL; 112 | ``` 113 | 114 | produces no output, and neither does: 115 | 116 | ```sql 117 | SELECT * FROM Visited WHERE dated != NULL; 118 | ``` 119 | 120 | To check whether a value is `null` or not, 121 | we must use a special test `IS NULL`: 122 | 123 | ```sql 124 | SELECT * FROM Visited WHERE dated IS NULL; 125 | ``` 126 | 127 | | id | site | dated | 128 | | ----- | ------ | ---------- | 129 | | 752 | DR-3 | \-null- | 130 | 131 | or its inverse `IS NOT NULL`: 132 | 133 | ```sql 134 | SELECT * FROM Visited WHERE dated IS NOT NULL; 135 | ``` 136 | 137 | | id | site | dated | 138 | | ----- | ------ | ---------- | 139 | | 619 | DR-1 | 1927-02-08 | 140 | | 622 | DR-1 | 1927-02-10 | 141 | | 734 | DR-3 | 1930-01-07 | 142 | | 735 | DR-3 | 1930-01-12 | 143 | | 751 | DR-3 | 1930-02-26 | 144 | | 837 | MSK-4 | 1932-01-14 | 145 | | 844 | DR-1 | 1932-03-22 | 146 | 147 | Null values can cause headaches wherever they appear. 148 | For example, 149 | suppose we want to find all the salinity measurements 150 | that weren't taken by Lake. 151 | It's natural to write the query like this: 152 | 153 | ```sql 154 | SELECT * FROM Survey WHERE quant = 'sal' AND person != 'lake'; 155 | ``` 156 | 157 | | taken | person | quant | reading | 158 | | ----- | ------ | ---------- | ------- | 159 | | 619 | dyer | sal | 0\.13 | 160 | | 622 | dyer | sal | 0\.09 | 161 | | 752 | roe | sal | 41\.6 | 162 | | 837 | roe | sal | 22\.5 | 163 | 164 | but this query filters omits the records 165 | where we don't know who took the measurement. 166 | Once again, 167 | the reason is that when `person` is `null`, 168 | the `!=` comparison produces `null`, 169 | so the record isn't kept in our results. 170 | If we want to keep these records 171 | we need to add an explicit check: 172 | 173 | ```sql 174 | SELECT * FROM Survey WHERE quant = 'sal' AND (person != 'lake' OR person IS NULL); 175 | ``` 176 | 177 | | taken | person | quant | reading | 178 | | ----- | ------ | ---------- | ------- | 179 | | 619 | dyer | sal | 0\.13 | 180 | | 622 | dyer | sal | 0\.09 | 181 | | 735 | \-null- | sal | 0\.06 | 182 | | 752 | roe | sal | 41\.6 | 183 | | 837 | roe | sal | 22\.5 | 184 | 185 | We still have to decide whether this is the right thing to do or not. 186 | If we want to be absolutely sure that 187 | we aren't including any measurements by Lake in our results, 188 | we need to exclude all the records for which we don't know who did the work. 189 | 190 | In contrast to arithmetic or Boolean operators, aggregation functions 191 | that combine multiple values, such as `min`, `max` or `avg`, *ignore* 192 | `null` values. In the majority of cases, this is a desirable output: 193 | for example, unknown values are thus not affecting our data when we 194 | are averaging it. Aggregation functions will be addressed in more 195 | detail in [the next section](06-agg.md). 196 | 197 | ::::::::::::::::::::::::::::::::::::::: challenge 198 | 199 | ## Sorting by Known Date 200 | 201 | Write a query that sorts the records in `Visited` by date, 202 | omitting entries for which the date is not known 203 | (i.e., is null). 204 | 205 | ::::::::::::::: solution 206 | 207 | ## Solution 208 | 209 | ```sql 210 | SELECT * FROM Visited WHERE dated IS NOT NULL ORDER BY dated ASC; 211 | ``` 212 | 213 | | id | site | dated | 214 | | ----- | ------ | ---------- | 215 | | 619 | DR-1 | 1927-02-08 | 216 | | 622 | DR-1 | 1927-02-10 | 217 | | 734 | DR-3 | 1930-01-07 | 218 | | 735 | DR-3 | 1930-01-12 | 219 | | 751 | DR-3 | 1930-02-26 | 220 | | 837 | MSK-4 | 1932-01-14 | 221 | | 844 | DR-1 | 1932-03-22 | 222 | 223 | ::::::::::::::::::::::::: 224 | 225 | :::::::::::::::::::::::::::::::::::::::::::::::::: 226 | 227 | ::::::::::::::::::::::::::::::::::::::: challenge 228 | 229 | ## NULL in a Set 230 | 231 | What do you expect the following query to produce? 232 | 233 | ```sql 234 | SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL); 235 | ``` 236 | 237 | What does it actually produce? 238 | 239 | ::::::::::::::: solution 240 | 241 | ## Solution 242 | 243 | You might expect the above query to return rows where dated is either '1927-02-08' or NULL. 244 | Instead it only returns rows where dated is '1927-02-08', the same as you would get from this 245 | simpler query: 246 | 247 | ```sql 248 | SELECT * FROM Visited WHERE dated IN ('1927-02-08'); 249 | ``` 250 | 251 | The reason is that the `IN` operator works with a set of *values*, but NULL is by definition 252 | not a value and is therefore simply ignored. 253 | 254 | If we wanted to actually include NULL, we would have to rewrite the query to use the IS NULL condition: 255 | 256 | ```sql 257 | SELECT * FROM Visited WHERE dated = '1927-02-08' OR dated IS NULL; 258 | ``` 259 | 260 | ::::::::::::::::::::::::: 261 | 262 | :::::::::::::::::::::::::::::::::::::::::::::::::: 263 | 264 | ::::::::::::::::::::::::::::::::::::::: challenge 265 | 266 | ## Pros and Cons of Sentinels 267 | 268 | Some database designers prefer to use 269 | a [sentinel value](../learners/reference.md#sentinel-value) 270 | to mark missing data rather than `null`. 271 | For example, 272 | they will use the date "0000-00-00" to mark a missing date, 273 | or -1.0 to mark a missing salinity or radiation reading 274 | (since actual readings cannot be negative). 275 | What does this simplify? 276 | What burdens or risks does it introduce? 277 | 278 | 279 | :::::::::::::::::::::::::::::::::::::::::::::::::: 280 | 281 | :::::::::::::::::::::::::::::::::::::::: keypoints 282 | 283 | - Databases use a special value called NULL to represent missing information. 284 | - Almost all operations on NULL produce NULL. 285 | - Queries can test for NULLs using IS NULL and IS NOT NULL. 286 | 287 | :::::::::::::::::::::::::::::::::::::::::::::::::: 288 | 289 | 290 | -------------------------------------------------------------------------------- /.github/workflows/README.md: -------------------------------------------------------------------------------- 1 | # Carpentries Workflows 2 | 3 | This directory contains workflows to be used for Lessons using the {sandpaper} 4 | lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml` 5 | and `pr-receive.yaml`) and the rest are bots to handle pull request management. 6 | 7 | These workflows will likely change as {sandpaper} evolves, so it is important to 8 | keep them up-to-date. To do this in your lesson you can do the following in your 9 | R console: 10 | 11 | ```r 12 | # Install/Update sandpaper 13 | options(repos = c(carpentries = "https://carpentries.r-universe.dev/", 14 | CRAN = "https://cloud.r-project.org")) 15 | install.packages("sandpaper") 16 | 17 | # update the workflows in your lesson 18 | library("sandpaper") 19 | update_github_workflows() 20 | ``` 21 | 22 | Inside this folder, you will find a file called `sandpaper-version.txt`, which 23 | will contain a version number for sandpaper. This will be used in the future to 24 | alert you if a workflow update is needed. 25 | 26 | What follows are the descriptions of the workflow files: 27 | 28 | ## Deployment 29 | 30 | ### 01 Build and Deploy (sandpaper-main.yaml) 31 | 32 | This is the main driver that will only act on the main branch of the repository. 33 | This workflow does the following: 34 | 35 | 1. checks out the lesson 36 | 2. provisions the following resources 37 | - R 38 | - pandoc 39 | - lesson infrastructure (stored in a cache) 40 | - lesson dependencies if needed (stored in a cache) 41 | 3. builds the lesson via `sandpaper:::ci_deploy()` 42 | 43 | #### Caching 44 | 45 | This workflow has two caches; one cache is for the lesson infrastructure and 46 | the other is for the lesson dependencies if the lesson contains rendered 47 | content. These caches are invalidated by new versions of the infrastructure and 48 | the `renv.lock` file, respectively. If there is a problem with the cache, 49 | manual invaliation is necessary. You will need maintain access to the repository 50 | and you can either go to the actions tab and [click on the caches button to find 51 | and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/) 52 | or by setting the `CACHE_VERSION` secret to the current date (which will 53 | invalidate all of the caches). 54 | 55 | ## Updates 56 | 57 | ### Setup Information 58 | 59 | These workflows run on a schedule and at the maintainer's request. Because they 60 | create pull requests that update workflows/require the downstream actions to run, 61 | they need a special repository/organization secret token called 62 | `SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope. 63 | 64 | This can be an individual user token, OR it can be a trusted bot account. If you 65 | have a repository in one of the official Carpentries accounts, then you do not 66 | need to worry about this token being present because the Carpentries Core Team 67 | will take care of supplying this token. 68 | 69 | If you want to use your personal account: you can go to 70 | 71 | to create a token. Once you have created your token, you should copy it to your 72 | clipboard and then go to your repository's settings > secrets > actions and 73 | create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token. 74 | 75 | If you do not specify your token correctly, the runs will not fail and they will 76 | give you instructions to provide the token for your repository. 77 | 78 | ### 02 Maintain: Update Workflow Files (update-workflow.yaml) 79 | 80 | The {sandpaper} repository was designed to do as much as possible to separate 81 | the tools from the content. For local builds, this is absolutely true, but 82 | there is a minor issue when it comes to workflow files: they must live inside 83 | the repository. 84 | 85 | This workflow ensures that the workflow files are up-to-date. The way it work is 86 | to download the update-workflows.sh script from GitHub and run it. The script 87 | will do the following: 88 | 89 | 1. check the recorded version of sandpaper against the current version on github 90 | 2. update the files if there is a difference in versions 91 | 92 | After the files are updated, if there are any changes, they are pushed to a 93 | branch called `update/workflows` and a pull request is created. Maintainers are 94 | encouraged to review the changes and accept the pull request if the outputs 95 | are okay. 96 | 97 | This update is run weekly or on demand. 98 | 99 | ### 03 Maintain: Update Package Cache (update-cache.yaml) 100 | 101 | For lessons that have generated content, we use {renv} to ensure that the output 102 | is stable. This is controlled by a single lockfile which documents the packages 103 | needed for the lesson and the version numbers. This workflow is skipped in 104 | lessons that do not have generated content. 105 | 106 | Because the lessons need to remain current with the package ecosystem, it's a 107 | good idea to make sure these packages can be updated periodically. The 108 | update cache workflow will do this by checking for updates, applying them in a 109 | branch called `updates/packages` and creating a pull request with _only the 110 | lockfile changed_. 111 | 112 | From here, the markdown documents will be rebuilt and you can inspect what has 113 | changed based on how the packages have updated. 114 | 115 | ## Pull Request and Review Management 116 | 117 | Because our lessons execute code, pull requests are a secruity risk for any 118 | lesson and thus have security measures associted with them. **Do not merge any 119 | pull requests that do not pass checks and do not have bots commented on them.** 120 | 121 | This series of workflows all go together and are described in the following 122 | diagram and the below sections: 123 | 124 | ![Graph representation of a pull request](https://carpentries.github.io/sandpaper/articles/img/pr-flow.dot.svg) 125 | 126 | ### Pre Flight Pull Request Validation (pr-preflight.yaml) 127 | 128 | This workflow runs every time a pull request is created and its purpose is to 129 | validate that the pull request is okay to run. This means the following things: 130 | 131 | 1. The pull request does not contain modified workflow files 132 | 2. If the pull request contains modified workflow files, it does not contain 133 | modified content files (such as a situation where @carpentries-bot will 134 | make an automated pull request) 135 | 3. The pull request does not contain an invalid commit hash (e.g. from a fork 136 | that was made before a lesson was transitioned from styles to use the 137 | workbench). 138 | 139 | Once the checks are finished, a comment is issued to the pull request, which 140 | will allow maintainers to determine if it is safe to run the 141 | "Receive Pull Request" workflow from new contributors. 142 | 143 | ### Receive Pull Request (pr-receive.yaml) 144 | 145 | **Note of caution:** This workflow runs arbitrary code by anyone who creates a 146 | pull request. GitHub has safeguarded the token used in this workflow to have no 147 | priviledges in the repository, but we have taken precautions to protect against 148 | spoofing. 149 | 150 | This workflow is triggered with every push to a pull request. If this workflow 151 | is already running and a new push is sent to the pull request, the workflow 152 | running from the previous push will be cancelled and a new workflow run will be 153 | started. 154 | 155 | The first step of this workflow is to check if it is valid (e.g. that no 156 | workflow files have been modified). If there are workflow files that have been 157 | modified, a comment is made that indicates that the workflow is not run. If 158 | both a workflow file and lesson content is modified, an error will occurr. 159 | 160 | The second step (if valid) is to build the generated content from the pull 161 | request. This builds the content and uploads three artifacts: 162 | 163 | 1. The pull request number (pr) 164 | 2. A summary of changes after the rendering process (diff) 165 | 3. The rendered files (build) 166 | 167 | Because this workflow builds generated content, it follows the same general 168 | process as the `sandpaper-main` workflow with the same caching mechanisms. 169 | 170 | The artifacts produced are used by the next workflow. 171 | 172 | ### Comment on Pull Request (pr-comment.yaml) 173 | 174 | This workflow is triggered if the `pr-receive.yaml` workflow is successful. 175 | The steps in this workflow are: 176 | 177 | 1. Test if the workflow is valid and comment the validity of the workflow to the 178 | pull request. 179 | 2. If it is valid: create an orphan branch with two commits: the current state 180 | of the repository and the proposed changes. 181 | 3. If it is valid: update the pull request comment with the summary of changes 182 | 183 | Importantly: if the pull request is invalid, the branch is not created so any 184 | malicious code is not published. 185 | 186 | From here, the maintainer can request changes from the author and eventually 187 | either merge or reject the PR. When this happens, if the PR was valid, the 188 | preview branch needs to be deleted. 189 | 190 | ### Send Close PR Signal (pr-close-signal.yaml) 191 | 192 | Triggered any time a pull request is closed. This emits an artifact that is the 193 | pull request number for the next action 194 | 195 | ### Remove Pull Request Branch (pr-post-remove-branch.yaml) 196 | 197 | Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with 198 | the pull request (if it was created). 199 | -------------------------------------------------------------------------------- /episodes/11-prog-R.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Programming with Databases - R 3 | teaching: 30 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write short programs that execute SQL queries. 10 | - Trace the execution of a program that contains an SQL query. 11 | - Explain why most database applications are written in a general-purpose language rather than in SQL. 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - How can I access databases from programs written in R? 18 | 19 | :::::::::::::::::::::::::::::::::::::::::::::::::: 20 | 21 | To close, 22 | let's have a look at how to access a database from 23 | a data analysis language like R. 24 | Other languages use almost exactly the same model: 25 | library and function names may differ, 26 | but the concepts are the same. 27 | 28 | Here's a short R program that selects latitudes and longitudes 29 | from an SQLite database stored in a file called `survey.db`: 30 | 31 | ```r 32 | library(RSQLite) 33 | connection <- dbConnect(SQLite(), "survey.db") 34 | results <- dbGetQuery(connection, "SELECT Site.lat, Site.long FROM Site;") 35 | print(results) 36 | dbDisconnect(connection) 37 | ``` 38 | 39 | ```output 40 | lat long 41 | 1 -49.85 -128.57 42 | 2 -47.15 -126.72 43 | 3 -48.87 -123.40 44 | ``` 45 | 46 | The program starts by importing the `RSQLite` library. 47 | If we were connecting to MySQL, DB2, or some other database, 48 | we would import a different library, 49 | but all of them provide the same functions, 50 | so that the rest of our program does not have to change 51 | (at least, not much) 52 | if we switch from one database to another. 53 | 54 | Line 2 establishes a connection to the database. 55 | Since we're using SQLite, 56 | all we need to specify is the name of the database file. 57 | Other systems may require us to provide a username and password as well. 58 | 59 | On line 3, we retrieve the results from an SQL query. 60 | It's our job to make sure that SQL is properly formatted; 61 | if it isn't, 62 | or if something goes wrong when it is being executed, 63 | the database will report an error. 64 | This result is a dataframe with one row for each entry and one column for each column in the database. 65 | 66 | Finally, the last line closes our connection, 67 | since the database can only keep a limited number of these open at one time. 68 | Since establishing a connection takes time, 69 | though, 70 | we shouldn't open a connection, 71 | do one operation, 72 | then close the connection, 73 | only to reopen it a few microseconds later to do another operation. 74 | Instead, 75 | it's normal to create one connection that stays open for the lifetime of the program. 76 | 77 | Queries in real applications will often depend on values provided by users. 78 | For example, 79 | this function takes a user's ID as a parameter and returns their name: 80 | 81 | ```r 82 | library(RSQLite) 83 | 84 | connection <- dbConnect(SQLite(), "survey.db") 85 | 86 | getName <- function(personID) { 87 | query <- paste0("SELECT personal || ' ' || family FROM Person WHERE id =='", 88 | personID, "';") 89 | return(dbGetQuery(connection, query)) 90 | } 91 | 92 | print(paste("full name for dyer:", getName('dyer'))) 93 | 94 | dbDisconnect(connection) 95 | ``` 96 | 97 | ```output 98 | full name for dyer: William Dyer 99 | ``` 100 | 101 | We use string concatenation on the first line of this function 102 | to construct a query containing the user ID we have been given. 103 | This seems simple enough, 104 | but what happens if someone gives us this string as input? 105 | 106 | ```sql 107 | dyer'; DROP TABLE Survey; SELECT ' 108 | ``` 109 | 110 | It looks like there's garbage after the user's ID, 111 | but it is very carefully chosen garbage. 112 | If we insert this string into our query, 113 | the result is: 114 | 115 | ```sql 116 | SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT ''; 117 | ``` 118 | 119 | If we execute this, 120 | it will erase one of the tables in our database. 121 | 122 | This is called an [SQL injection attack](../learners/reference.md#sql-injection-attack), 123 | and it has been used to attack thousands of programs over the years. 124 | In particular, 125 | many web sites that take data from users insert values directly into queries 126 | without checking them carefully first. 127 | A very [relevant XKCD](https://xkcd.com/327/) that explains the 128 | dangers of using raw input in queries a little more succinctly: 129 | 130 | ![](https://imgs.xkcd.com/comics/exploits_of_a_mom.png){alt='relevant XKCD'} 131 | 132 | Since an unscrupulous parent might try to smuggle commands into our queries in many different ways, 133 | the safest way to deal with this threat is 134 | to replace characters like quotes with their escaped equivalents, 135 | so that we can safely put whatever the user gives us inside a string. 136 | We can do this by using a [prepared statement](../learners/reference.md#prepared-statement) 137 | instead of formatting our statements as strings. 138 | Here's what our example program looks like if we do this: 139 | 140 | ```r 141 | library(RSQLite) 142 | connection <- dbConnect(SQLite(), "survey.db") 143 | 144 | getName <- function(personID) { 145 | query <- "SELECT personal || ' ' || family FROM Person WHERE id == ?" 146 | return(dbGetPreparedQuery(connection, query, data.frame(personID))) 147 | } 148 | 149 | print(paste("full name for dyer:", getName('dyer'))) 150 | 151 | dbDisconnect(connection) 152 | ``` 153 | 154 | ```output 155 | full name for dyer: William Dyer 156 | ``` 157 | 158 | The key changes are in the query string and the `dbGetQuery` call (we use dbGetPreparedQuery instead). 159 | Instead of formatting the query ourselves, 160 | we put question marks in the query template where we want to insert values. 161 | When we call `dbGetPreparedQuery`, 162 | we provide a dataframe 163 | that contains as many values as there are question marks in the query. 164 | The library matches values to question marks in order, 165 | and translates any special characters in the values 166 | into their escaped equivalents 167 | so that they are safe to use. 168 | 169 | ::::::::::::::::::::::::::::::::::::::: challenge 170 | 171 | ## Filling a Table vs. Printing Values 172 | 173 | Write an R program that creates a new database in a file called 174 | `original.db` containing a single table called `Pressure`, with a 175 | single field called `reading`, and inserts 100,000 random numbers 176 | between 10.0 and 25.0. How long does it take this program to run? 177 | How long does it take to run a program that simply writes those 178 | random numbers to a file? 179 | 180 | 181 | :::::::::::::::::::::::::::::::::::::::::::::::::: 182 | 183 | ::::::::::::::::::::::::::::::::::::::: challenge 184 | 185 | ## Filtering in SQL vs. Filtering in R 186 | 187 | Write an R program that creates a new database called 188 | `backup.db` with the same structure as `original.db` and copies all 189 | the values greater than 20.0 from `original.db` to `backup.db`. 190 | Which is faster: filtering values in the query, or reading 191 | everything into memory and filtering in R? 192 | 193 | 194 | :::::::::::::::::::::::::::::::::::::::::::::::::: 195 | 196 | ## Database helper functions in R 197 | 198 | R's database interface packages (like `RSQLite`) all share 199 | a common set of helper functions useful for exploring databases and 200 | reading/writing entire tables at once. 201 | 202 | To view all tables in a database, we can use `dbListTables()`: 203 | 204 | ```r 205 | connection <- dbConnect(SQLite(), "survey.db") 206 | dbListTables(connection) 207 | ``` 208 | 209 | ```output 210 | "Person" "Site" "Survey" "Visited" 211 | ``` 212 | 213 | To view all column names of a table, use `dbListFields()`: 214 | 215 | ```r 216 | dbListFields(connection, "Survey") 217 | ``` 218 | 219 | ```output 220 | "taken" "person" "quant" "reading" 221 | ``` 222 | 223 | To read an entire table as a dataframe, use `dbReadTable()`: 224 | 225 | ```r 226 | dbReadTable(connection, "Person") 227 | ``` 228 | 229 | ```output 230 | id personal family 231 | 1 dyer William Dyer 232 | 2 pb Frank Pabodie 233 | 3 lake Anderson Lake 234 | 4 roe Valentina Roerich 235 | 5 danforth Frank Danforth 236 | ``` 237 | 238 | Finally to write an entire table to a database, you can use `dbWriteTable()`. 239 | Note that we will always want to use the `row.names = FALSE` argument or R 240 | will write the row names as a separate column. 241 | In this example we will write R's built-in `iris` dataset as a table in `survey.db`. 242 | 243 | ```r 244 | dbWriteTable(connection, "iris", iris, row.names = FALSE) 245 | head(dbReadTable(connection, "iris")) 246 | ``` 247 | 248 | ```output 249 | Sepal.Length Sepal.Width Petal.Length Petal.Width Species 250 | 1 5.1 3.5 1.4 0.2 setosa 251 | 2 4.9 3.0 1.4 0.2 setosa 252 | 3 4.7 3.2 1.3 0.2 setosa 253 | 4 4.6 3.1 1.5 0.2 setosa 254 | 5 5.0 3.6 1.4 0.2 setosa 255 | 6 5.4 3.9 1.7 0.4 setosa 256 | ``` 257 | 258 | And as always, remember to close the database connection when done! 259 | 260 | ```r 261 | dbDisconnect(connection) 262 | ``` 263 | 264 | :::::::::::::::::::::::::::::::::::::::: keypoints 265 | 266 | - Data analysis languages have libraries for accessing databases. 267 | - To connect to a database, a program must use a library specific to that database manager. 268 | - R's libraries can be used to directly query or read from a database. 269 | - Programs can read query results in batches or all at once. 270 | - Queries should be written using parameter substitution, not string formatting. 271 | - R has multiple helper functions to make working with databases easier. 272 | 273 | :::::::::::::::::::::::::::::::::::::::::::::::::: 274 | 275 | 276 | -------------------------------------------------------------------------------- /episodes/09-create.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Creating and Modifying Data 3 | teaching: 15 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write statements that create tables. 10 | - Write statements to insert, modify, and delete records. 11 | 12 | :::::::::::::::::::::::::::::::::::::::::::::::::: 13 | 14 | :::::::::::::::::::::::::::::::::::::::: questions 15 | 16 | - How can I create, modify, and delete tables and data? 17 | 18 | :::::::::::::::::::::::::::::::::::::::::::::::::: 19 | 20 | So far we have only looked at how to get information out of a database, 21 | both because that is more frequent than adding information, 22 | and because most other operations only make sense 23 | once queries are understood. 24 | 25 | The `Person`, `Survey`, `Site`, and `Visited` tables from the `survey,db` database were 26 | used during the earlier episodes. We're going to build a new database over the course 27 | of the upcoming episodes. Exit the `SQLite` interactive session if you're still in it. 28 | 29 | ```sql 30 | .exit 31 | ``` 32 | 33 | Launch `SQLite3` and create a new database, lets call it `newsurvey.db`. 34 | We use a different name to avoid confusion with the currently existing `survey.db` database. 35 | 36 | ``` 37 | $ sqlite3 newsurvey.db 38 | ``` 39 | 40 | Run the `.mode column` and `.header on` commands again if you aren't using the `.sqliterc` file. 41 | (Note if you exited and restarted SQLite3 your settings will change back to the default) 42 | 43 | ```sql 44 | .mode column 45 | .header on 46 | ``` 47 | 48 | If we want to create and modify data, 49 | we need to know two other sets of commands. 50 | 51 | The first pair are [`CREATE TABLE`][create-table] and [`DROP TABLE`][drop-table]. 52 | While they are written as two words, 53 | they are actually single commands. 54 | The first one creates a new table; 55 | its arguments are the names and types of the table's columns. 56 | For example, 57 | the following statements create the four tables in our survey database: 58 | 59 | ```sql 60 | CREATE TABLE Person(id text, personal text, family text); 61 | CREATE TABLE Site(name text, lat real, long real); 62 | CREATE TABLE Visited(id integer, site text, dated text); 63 | CREATE TABLE Survey(taken integer, person text, quant text, reading real); 64 | ``` 65 | 66 | We can get rid of one of our tables using: 67 | 68 | ```sql 69 | DROP TABLE Survey; 70 | ``` 71 | 72 | Be very careful when doing this: 73 | if you drop the wrong table, hope that the person maintaining the database has a backup, 74 | but it's better not to have to rely on it. 75 | 76 | Different database systems support different data types for table columns, 77 | but most provide the following: 78 | 79 | | data type | use | 80 | | --------- | ----------------------------------------- | 81 | | INTEGER | a signed integer | 82 | | REAL | a floating point number | 83 | | TEXT | a character string | 84 | | BLOB | a "binary large object", such as an image | 85 | 86 | Most databases also support Booleans and date/time values; 87 | SQLite uses the integers 0 and 1 for the former, 88 | and represents the latter as discussed [earlier](03-filter.md). 89 | An increasing number of databases also support geographic data types, 90 | such as latitude and longitude. 91 | Keeping track of what particular systems do or do not offer, 92 | and what names they give different data types, 93 | is an unending portability headache. 94 | 95 | When we create a table, 96 | we can specify several kinds of constraints on its columns. 97 | For example, 98 | a better definition for the `Survey` table would be: 99 | 100 | ```sql 101 | CREATE TABLE Survey( 102 | taken integer not null, -- where reading taken 103 | person text, -- may not know who took it 104 | quant text not null, -- the quantity measured 105 | reading real not null, -- the actual reading 106 | primary key(taken, person, quant), -- key is taken + person + quant 107 | foreign key(taken) references Visited(id), 108 | foreign key(person) references Person(id) 109 | ); 110 | ``` 111 | 112 | Once again, 113 | exactly what constraints are available 114 | and what they're called 115 | depends on which database manager we are using. 116 | 117 | Once tables have been created, 118 | we can add, change, and remove records using our other set of commands, 119 | `INSERT`, `UPDATE`, and `DELETE`. 120 | 121 | Here is an example of inserting rows into the `Site` table: 122 | 123 | ```sql 124 | INSERT INTO Site (name, lat, long) VALUES ('DR-1', -49.85, -128.57); 125 | INSERT INTO Site (name, lat, long) VALUES ('DR-3', -47.15, -126.72); 126 | INSERT INTO Site (name, lat, long) VALUES ('MSK-4', -48.87, -123.40); 127 | ``` 128 | 129 | We can also insert values into one table directly from another: 130 | 131 | ```sql 132 | CREATE TABLE JustLatLong(lat real, long real); 133 | INSERT INTO JustLatLong SELECT lat, long FROM Site; 134 | ``` 135 | 136 | Modifying existing records is done using the `UPDATE` statement. 137 | To do this we tell the database which table we want to update, 138 | what we want to change the values to for any or all of the fields, 139 | and under what conditions we should update the values. 140 | 141 | For example, if we made a mistake when entering the lat and long values 142 | of the last `INSERT` statement above, we can correct it with an update: 143 | 144 | ```sql 145 | UPDATE Site SET lat = -47.87, long = -122.40 WHERE name = 'MSK-4'; 146 | ``` 147 | 148 | Be careful to not forget the `WHERE` clause or the update statement will 149 | modify *all* of the records in the `Site` table. 150 | 151 | Deleting records can be a bit trickier, 152 | because we have to ensure that the database remains internally consistent. 153 | If all we care about is a single table, 154 | we can use the `DELETE` command with a `WHERE` clause 155 | that matches the records we want to discard. 156 | For example, 157 | once we realize that Frank Danforth didn't take any measurements, 158 | we can remove him from the `Person` table like this: 159 | 160 | ```sql 161 | DELETE FROM Person WHERE id = 'danforth'; 162 | ``` 163 | 164 | But what if we removed Anderson Lake instead? 165 | Our `Survey` table would still contain seven records 166 | of measurements he'd taken, 167 | but that's never supposed to happen: 168 | `Survey.person` is a foreign key into the `Person` table, 169 | and all our queries assume there will be a row in the latter 170 | matching every value in the former. 171 | 172 | This problem is called [referential integrity](../learners/reference.md#referential-integrity): 173 | we need to ensure that all references between tables can always be resolved correctly. 174 | One way to do this is to delete all the records 175 | that use `'lake'` as a foreign key 176 | before deleting the record that uses it as a primary key. 177 | If our database manager supports it, 178 | we can automate this 179 | using [cascading delete](../learners/reference.md#cascading-delete). 180 | However, 181 | this technique is outside the scope of this chapter. 182 | 183 | ::::::::::::::::::::::::::::::::::::::::: callout 184 | 185 | ## Hybrid Storage Models 186 | 187 | Many applications use a hybrid storage model 188 | instead of putting everything into a database: 189 | the actual data (such as astronomical images) is stored in files, 190 | while the database stores the files' names, 191 | their modification dates, 192 | the region of the sky they cover, 193 | their spectral characteristics, 194 | and so on. 195 | This is also how most music player software is built: 196 | the database inside the application keeps track of the MP3 files, 197 | but the files themselves live on disk. 198 | 199 | 200 | :::::::::::::::::::::::::::::::::::::::::::::::::: 201 | 202 | ::::::::::::::::::::::::::::::::::::::: challenge 203 | 204 | ## Replacing NULL 205 | 206 | Write an SQL statement to replace all uses of `null` in 207 | `Survey.person` with the string `'unknown'`. 208 | 209 | ::::::::::::::: solution 210 | 211 | ## Solution 212 | 213 | ```sql 214 | UPDATE Survey SET person = 'unknown' WHERE person IS NULL; 215 | ``` 216 | 217 | ::::::::::::::::::::::::: 218 | 219 | :::::::::::::::::::::::::::::::::::::::::::::::::: 220 | 221 | ::::::::::::::::::::::::::::::::::::::: challenge 222 | 223 | ## Backing Up with SQL 224 | 225 | SQLite has several administrative commands that aren't part of the 226 | SQL standard. One of them is `.dump`, which prints the SQL commands 227 | needed to re-create the database. Another is `.read`, which reads a 228 | file created by `.dump` and restores the database. A colleague of 229 | yours thinks that storing dump files (which are text) in version 230 | control is a good way to track and manage changes to the database. 231 | What are the pros and cons of this approach? (Hint: records aren't 232 | stored in any particular order.) 233 | 234 | ::::::::::::::: solution 235 | 236 | ## Solution 237 | 238 | #### Advantages 239 | 240 | - A version control system will be able to show differences between versions 241 | of the dump file; something it can't do for binary files like databases 242 | - A VCS only saves changes between versions, rather than a complete copy of 243 | each version (save disk space) 244 | - The version control log will explain the reason for the changes in each version 245 | of the database 246 | 247 | #### Disadvantages 248 | 249 | - Artificial differences between commits because records don't have a fixed order 250 | 251 | 252 | 253 | ::::::::::::::::::::::::: 254 | 255 | :::::::::::::::::::::::::::::::::::::::::::::::::: 256 | 257 | [create-table]: https://www.sqlite.org/lang_createtable.html 258 | [drop-table]: https://www.sqlite.org/lang_droptable.html 259 | 260 | 261 | :::::::::::::::::::::::::::::::::::::::: keypoints 262 | 263 | - Use CREATE and DROP to create and delete tables. 264 | - Use INSERT to add data. 265 | - Use UPDATE to modify existing data. 266 | - Use DELETE to remove data. 267 | - It is simpler and safer to modify data when every record has a unique primary key. 268 | - Do not create dangling references by deleting records that other records refer to. 269 | 270 | :::::::::::::::::::::::::::::::::::::::::::::::::: 271 | 272 | 273 | -------------------------------------------------------------------------------- /episodes/03-filter.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Filtering 3 | teaching: 10 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write queries that select records that satisfy user-specified conditions. 10 | - Explain the order in which the clauses in a query are executed. 11 | 12 | :::::::::::::::::::::::::::::::::::::::::::::::::: 13 | 14 | :::::::::::::::::::::::::::::::::::::::: questions 15 | 16 | - How can I select subsets of data? 17 | 18 | :::::::::::::::::::::::::::::::::::::::::::::::::: 19 | 20 | One of the most powerful features of a database is 21 | the ability to [filter](../learners/reference.md#filter) data, 22 | i.e., 23 | to select only those records that match certain criteria. 24 | For example, 25 | suppose we want to see when a particular site was visited. 26 | We can select these records from the `Visited` table 27 | by using a `WHERE` clause in our query: 28 | 29 | ```sql 30 | SELECT * FROM Visited WHERE site = 'DR-1'; 31 | ``` 32 | 33 | | id | site | dated | 34 | | ------ | ------ | ---------- | 35 | | 619 | DR-1 | 1927-02-08 | 36 | | 622 | DR-1 | 1927-02-10 | 37 | | 844 | DR-1 | 1932-03-22 | 38 | 39 | The database manager executes this query in two stages. 40 | First, 41 | it checks at each row in the `Visited` table 42 | to see which ones satisfy the `WHERE`. 43 | It then uses the column names following the `SELECT` keyword 44 | to determine which columns to display. 45 | 46 | This processing order means that 47 | we can filter records using `WHERE` 48 | based on values in columns that aren't then displayed: 49 | 50 | ```sql 51 | SELECT id FROM Visited WHERE site = 'DR-1'; 52 | ``` 53 | 54 | | id | 55 | | ------ | 56 | | 619 | 57 | | 622 | 58 | | 844 | 59 | 60 | ![](fig/sql-filter.svg){alt='SQL Filtering in Action'} 61 | 62 | We can use many other Boolean operators to filter our data. 63 | For example, 64 | we can ask for all information from the DR-1 site collected before 1930: 65 | 66 | ```sql 67 | SELECT * FROM Visited WHERE site = 'DR-1' AND dated < '1930-01-01'; 68 | ``` 69 | 70 | | id | site | dated | 71 | | ------ | ------ | ---------- | 72 | | 619 | DR-1 | 1927-02-08 | 73 | | 622 | DR-1 | 1927-02-10 | 74 | 75 | ::::::::::::::::::::::::::::::::::::::::: callout 76 | 77 | ## Date Types 78 | 79 | Most database managers have a special data type for dates. 80 | In fact, many have two: 81 | one for dates, 82 | such as "May 31, 1971", 83 | and one for durations, 84 | such as "31 days". 85 | SQLite doesn't: 86 | instead, 87 | it stores dates as either text 88 | (in the ISO-8601 standard format "YYYY-MM-DD HH:MM:SS.SSSS"), 89 | real numbers 90 | ([Julian days](https://en.wikipedia.org/wiki/Julian_day), the number of days since November 24, 4714 BCE), 91 | or integers 92 | ([Unix time](https://en.wikipedia.org/wiki/Unix_time), the number of seconds since midnight, January 1, 1970). 93 | If this sounds complicated, 94 | it is, 95 | but not nearly as complicated as figuring out 96 | [historical dates in Sweden](https://en.wikipedia.org/wiki/Swedish_calendar). 97 | 98 | 99 | :::::::::::::::::::::::::::::::::::::::::::::::::: 100 | 101 | If we want to find out what measurements were taken by either Lake or Roerich, 102 | we can combine the tests on their names using `OR`: 103 | 104 | ```sql 105 | SELECT * FROM Survey WHERE person = 'lake' OR person = 'roe'; 106 | ``` 107 | 108 | | taken | person | quant | reading | 109 | | ------ | ------ | ---------- | ------- | 110 | | 734 | lake | sal | 0\.05 | 111 | | 751 | lake | sal | 0\.1 | 112 | | 752 | lake | rad | 2\.19 | 113 | | 752 | lake | sal | 0\.09 | 114 | | 752 | lake | temp | \-16.0 | 115 | | 752 | roe | sal | 41\.6 | 116 | | 837 | lake | rad | 1\.46 | 117 | | 837 | lake | sal | 0\.21 | 118 | | 837 | roe | sal | 22\.5 | 119 | | 844 | roe | rad | 11\.25 | 120 | 121 | Alternatively, 122 | we can use `IN` to see if a value is in a specific set: 123 | 124 | ```sql 125 | SELECT * FROM Survey WHERE person IN ('lake', 'roe'); 126 | ``` 127 | 128 | | taken | person | quant | reading | 129 | | ------ | ------ | ---------- | ------- | 130 | | 734 | lake | sal | 0\.05 | 131 | | 751 | lake | sal | 0\.1 | 132 | | 752 | lake | rad | 2\.19 | 133 | | 752 | lake | sal | 0\.09 | 134 | | 752 | lake | temp | \-16.0 | 135 | | 752 | roe | sal | 41\.6 | 136 | | 837 | lake | rad | 1\.46 | 137 | | 837 | lake | sal | 0\.21 | 138 | | 837 | roe | sal | 22\.5 | 139 | | 844 | roe | rad | 11\.25 | 140 | 141 | We can combine `AND` with `OR`, 142 | but we need to be careful about which operator is executed first. 143 | If we *don't* use parentheses, 144 | we get this: 145 | 146 | ```sql 147 | SELECT * FROM Survey WHERE quant = 'sal' AND person = 'lake' OR person = 'roe'; 148 | ``` 149 | 150 | | taken | person | quant | reading | 151 | | ------ | ------ | ---------- | ------- | 152 | | 734 | lake | sal | 0\.05 | 153 | | 751 | lake | sal | 0\.1 | 154 | | 752 | lake | sal | 0\.09 | 155 | | 752 | roe | sal | 41\.6 | 156 | | 837 | lake | sal | 0\.21 | 157 | | 837 | roe | sal | 22\.5 | 158 | | 844 | roe | rad | 11\.25 | 159 | 160 | which is salinity measurements by Lake, 161 | and *any* measurement by Roerich. 162 | We probably want this instead: 163 | 164 | ```sql 165 | SELECT * FROM Survey WHERE quant = 'sal' AND (person = 'lake' OR person = 'roe'); 166 | ``` 167 | 168 | | taken | person | quant | reading | 169 | | ------ | ------ | ---------- | ------- | 170 | | 734 | lake | sal | 0\.05 | 171 | | 751 | lake | sal | 0\.1 | 172 | | 752 | lake | sal | 0\.09 | 173 | | 752 | roe | sal | 41\.6 | 174 | | 837 | lake | sal | 0\.21 | 175 | | 837 | roe | sal | 22\.5 | 176 | 177 | We can also filter by partial matches. For example, if we want to 178 | know something just about the site names beginning with "DR" we can 179 | use the `LIKE` keyword. The percent symbol acts as a 180 | [wildcard](../learners/reference.md#wildcard), matching any characters in that 181 | place. It can be used at the beginning, middle, or end of the string: 182 | 183 | ```sql 184 | SELECT * FROM Visited WHERE site LIKE 'DR%'; 185 | ``` 186 | 187 | | id | site | dated | 188 | | ------ | ------ | ---------- | 189 | | 619 | DR-1 | 1927-02-08 | 190 | | 622 | DR-1 | 1927-02-10 | 191 | | 734 | DR-3 | 1930-01-07 | 192 | | 735 | DR-3 | 1930-01-12 | 193 | | 751 | DR-3 | 1930-02-26 | 194 | | 752 | DR-3 | | 195 | | 844 | DR-1 | 1932-03-22 | 196 | 197 | Finally, 198 | we can use `DISTINCT` with `WHERE` 199 | to give a second level of filtering: 200 | 201 | ```sql 202 | SELECT DISTINCT person, quant FROM Survey WHERE person = 'lake' OR person = 'roe'; 203 | ``` 204 | 205 | | person | quant | 206 | | ------ | ------ | 207 | | lake | sal | 208 | | lake | rad | 209 | | lake | temp | 210 | | roe | sal | 211 | | roe | rad | 212 | 213 | But remember: 214 | `DISTINCT` is applied to the values displayed in the chosen columns, 215 | not to the entire rows as they are being processed. 216 | 217 | ::::::::::::::::::::::::::::::::::::::::: callout 218 | 219 | ## Growing Queries 220 | 221 | What we have just done is how most people "grow" their SQL queries. 222 | We started with something simple that did part of what we wanted, 223 | then added more clauses one by one, 224 | testing their effects as we went. 225 | This is a good strategy --- in fact, 226 | for complex queries it's often the *only* strategy --- but 227 | it depends on quick turnaround, 228 | and on us recognizing the right answer when we get it. 229 | 230 | The best way to achieve a quick turnaround is often 231 | to put a subset of data in a temporary database 232 | and run our queries against that, 233 | or to fill a small database with synthesized records. 234 | For example, 235 | instead of trying our queries against an actual database of 20 million Australians, 236 | we could run it against a sample of ten thousand, 237 | or write a small program to generate ten thousand random (but plausible) records 238 | and use that. 239 | 240 | 241 | :::::::::::::::::::::::::::::::::::::::::::::::::: 242 | 243 | ::::::::::::::::::::::::::::::::::::::: challenge 244 | 245 | ## Fix This Query 246 | 247 | Suppose we want to select all sites that lie within 48 degrees of the equator. 248 | Our first query is: 249 | 250 | ```sql 251 | SELECT * FROM Site WHERE (lat > -48) OR (lat < 48); 252 | ``` 253 | 254 | Explain why this is wrong, 255 | and rewrite the query so that it is correct. 256 | 257 | ::::::::::::::: solution 258 | 259 | ## Solution 260 | 261 | Because we used `OR`, a site on the South Pole for example will still meet 262 | the second criteria and thus be included. Instead, we want to restrict this 263 | to sites that meet *both* criteria: 264 | 265 | ```sql 266 | SELECT * FROM Site WHERE (lat > -48) AND (lat < 48); 267 | ``` 268 | 269 | ::::::::::::::::::::::::: 270 | 271 | :::::::::::::::::::::::::::::::::::::::::::::::::: 272 | 273 | ::::::::::::::::::::::::::::::::::::::: challenge 274 | 275 | ## Finding Outliers 276 | 277 | Normalized salinity readings are supposed to be between 0.0 and 1.0. 278 | Write a query that selects all records from `Survey` 279 | with salinity values outside this range. 280 | 281 | ::::::::::::::: solution 282 | 283 | ## Solution 284 | 285 | ```sql 286 | SELECT * FROM Survey WHERE quant = 'sal' AND ((reading > 1.0) OR (reading < 0.0)); 287 | ``` 288 | 289 | | taken | person | quant | reading | 290 | | ------ | ------ | ---------- | ------- | 291 | | 752 | roe | sal | 41\.6 | 292 | | 837 | roe | sal | 22\.5 | 293 | 294 | ::::::::::::::::::::::::: 295 | 296 | :::::::::::::::::::::::::::::::::::::::::::::::::: 297 | 298 | ::::::::::::::::::::::::::::::::::::::: challenge 299 | 300 | ## Matching Patterns 301 | 302 | Which of these expressions are true? 303 | 304 | 1. `'a' LIKE 'a'` 305 | 2. `'a' LIKE '%a'` 306 | 3. `'beta' LIKE '%a'` 307 | 4. `'alpha' LIKE 'a%%'` 308 | 5. `'alpha' LIKE 'a%p%'` 309 | 310 | ::::::::::::::: solution 311 | 312 | ## Solution 313 | 314 | 1. True because these are the same character. 315 | 2. True because the wildcard can match *zero* or more characters. 316 | 3. True because the `%` matches `bet` and the `a` matches the `a`. 317 | 4. True because the first wildcard matches `lpha` and the second wildcard matches zero characters (or vice versa). 318 | 5. True because the first wildcard matches `l` and the second wildcard matches `ha`. 319 | 320 | 321 | 322 | ::::::::::::::::::::::::: 323 | 324 | :::::::::::::::::::::::::::::::::::::::::::::::::: 325 | 326 | :::::::::::::::::::::::::::::::::::::::: keypoints 327 | 328 | - Use WHERE to specify conditions that records must meet in order to be included in a query's results. 329 | - Use AND, OR, and NOT to combine tests. 330 | - Filtering is done on whole records, so conditions can use fields that are not actually displayed. 331 | - Write queries incrementally. 332 | 333 | :::::::::::::::::::::::::::::::::::::::::::::::::: 334 | 335 | 336 | -------------------------------------------------------------------------------- /episodes/10-prog.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Programming with Databases - Python 3 | teaching: 20 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Write short programs that execute SQL queries. 10 | - Trace the execution of a program that contains an SQL query. 11 | - Explain why most database applications are written in a general-purpose language rather than in SQL. 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - How can I access databases from programs written in Python? 18 | 19 | :::::::::::::::::::::::::::::::::::::::::::::::::: 20 | 21 | To close, 22 | let's have a look at how to access a database from 23 | a general-purpose programming language like Python. 24 | Other languages use almost exactly the same model: 25 | library and function names may differ, 26 | but the concepts are the same. 27 | 28 | Here's a short Python program that selects latitudes and longitudes 29 | from an SQLite database stored in a file called `survey.db`: 30 | 31 | ```python 32 | import sqlite3 33 | 34 | connection = sqlite3.connect("survey.db") 35 | cursor = connection.cursor() 36 | cursor.execute("SELECT Site.lat, Site.long FROM Site;") 37 | results = cursor.fetchall() 38 | for r in results: 39 | print(r) 40 | cursor.close() 41 | connection.close() 42 | ``` 43 | 44 | ```output 45 | (-49.85, -128.57) 46 | (-47.15, -126.72) 47 | (-48.87, -123.4) 48 | ``` 49 | 50 | The program starts by importing the `sqlite3` library. 51 | If we were connecting to MySQL, DB2, or some other database, 52 | we would import a different library, 53 | but all of them provide the same functions, 54 | so that the rest of our program does not have to change 55 | (at least, not much) 56 | if we switch from one database to another. 57 | 58 | Line 2 establishes a connection to the database. 59 | Since we're using SQLite, 60 | all we need to specify is the name of the database file. 61 | Other systems may require us to provide a username and password as well. 62 | Line 3 then uses this connection to create a [cursor](../learners/reference.md#cursor). 63 | Just like the cursor in an editor, 64 | its role is to keep track of where we are in the database. 65 | 66 | On line 4, we use that cursor to ask the database to execute a query for us. 67 | The query is written in SQL, 68 | and passed to `cursor.execute` as a string. 69 | It's our job to make sure that SQL is properly formatted; 70 | if it isn't, 71 | or if something goes wrong when it is being executed, 72 | the database will report an error. 73 | 74 | The database returns the results of the query to us 75 | in response to the `cursor.fetchall` call on line 5. 76 | This result is a list with one entry for each record in the result set; 77 | if we loop over that list (line 6) and print those list entries (line 7), 78 | we can see that each one is a tuple 79 | with one element for each field we asked for. 80 | 81 | Finally, lines 8 and 9 close our cursor and our connection, 82 | since the database can only keep a limited number of these open at one time. 83 | Since establishing a connection takes time, 84 | though, 85 | we shouldn't open a connection, 86 | do one operation, 87 | then close the connection, 88 | only to reopen it a few microseconds later to do another operation. 89 | Instead, 90 | it's normal to create one connection that stays open for the lifetime of the program. 91 | 92 | Queries in real applications will often depend on values provided by users. 93 | For example, 94 | this function takes a user's ID as a parameter and returns their name: 95 | 96 | ```python 97 | import sqlite3 98 | 99 | def get_name(database_file, person_id): 100 | query = "SELECT personal || ' ' || family FROM Person WHERE id='" + person_id + "';" 101 | 102 | connection = sqlite3.connect(database_file) 103 | cursor = connection.cursor() 104 | cursor.execute(query) 105 | results = cursor.fetchall() 106 | cursor.close() 107 | connection.close() 108 | 109 | return results[0][0] 110 | 111 | print("Full name for dyer:", get_name('survey.db', 'dyer')) 112 | ``` 113 | 114 | ```output 115 | Full name for dyer: William Dyer 116 | ``` 117 | 118 | We use string concatenation on the first line of this function 119 | to construct a query containing the user ID we have been given. 120 | This seems simple enough, 121 | but what happens if someone gives us this string as input? 122 | 123 | ```source 124 | dyer'; DROP TABLE Survey; SELECT ' 125 | ``` 126 | 127 | It looks like there's garbage after the user's ID, 128 | but it is very carefully chosen garbage. 129 | If we insert this string into our query, 130 | the result is: 131 | 132 | ```sql 133 | SELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT ''; 134 | ``` 135 | 136 | If we execute this, 137 | it will erase one of the tables in our database. 138 | 139 | This is called an [SQL injection attack](../learners/reference.md#sql-injection-attack), 140 | and it has been used to attack thousands of programs over the years. 141 | In particular, 142 | many web sites that take data from users insert values directly into queries 143 | without checking them carefully first. 144 | 145 | Since a villain might try to smuggle commands into our queries in many different ways, 146 | the safest way to deal with this threat is 147 | to replace characters like quotes with their escaped equivalents, 148 | so that we can safely put whatever the user gives us inside a string. 149 | We can do this by using a [prepared statement](../learners/reference.md#prepared-statement) 150 | instead of formatting our statements as strings. 151 | Here's what our example program looks like if we do this: 152 | 153 | ```python 154 | import sqlite3 155 | 156 | def get_name(database_file, person_id): 157 | query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" 158 | 159 | connection = sqlite3.connect(database_file) 160 | cursor = connection.cursor() 161 | cursor.execute(query, [person_id]) 162 | results = cursor.fetchall() 163 | cursor.close() 164 | connection.close() 165 | 166 | return results[0][0] 167 | 168 | print("Full name for dyer:", get_name('survey.db', 'dyer')) 169 | ``` 170 | 171 | ```output 172 | Full name for dyer: William Dyer 173 | ``` 174 | 175 | The key changes are in the query string and the `execute` call. 176 | Instead of formatting the query ourselves, 177 | we put question marks in the query template where we want to insert values. 178 | When we call `execute`, 179 | we provide a list 180 | that contains as many values as there are question marks in the query. 181 | The library matches values to question marks in order, 182 | and translates any special characters in the values 183 | into their escaped equivalents 184 | so that they are safe to use. 185 | 186 | We can also use `sqlite3`'s cursor to make changes to our database, 187 | such as inserting a new name. 188 | For instance, we can define a new function called `add_name` like so: 189 | 190 | ```python 191 | import sqlite3 192 | 193 | def add_name(database_file, new_person): 194 | query = "INSERT INTO Person (id, personal, family) VALUES (?, ?, ?);" 195 | 196 | connection = sqlite3.connect(database_file) 197 | cursor = connection.cursor() 198 | cursor.execute(query, list(new_person)) 199 | cursor.close() 200 | connection.close() 201 | 202 | 203 | def get_name(database_file, person_id): 204 | query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" 205 | 206 | connection = sqlite3.connect(database_file) 207 | cursor = connection.cursor() 208 | cursor.execute(query, [person_id]) 209 | results = cursor.fetchall() 210 | cursor.close() 211 | connection.close() 212 | 213 | return results[0][0] 214 | 215 | # Insert a new name 216 | add_name('survey.db', ('barrett', 'Mary', 'Barrett')) 217 | # Check it exists 218 | print("Full name for barrett:", get_name('survey.db', 'barrett')) 219 | ``` 220 | 221 | ```output 222 | IndexError: list index out of range 223 | ``` 224 | 225 | Note that in versions of sqlite3 >= 2.5, the `get_name` function described 226 | above will fail with an `IndexError: list index out of range`, 227 | even though we added Mary's 228 | entry into the table using `add_name`. 229 | This is because we must perform a `connection.commit()` before closing 230 | the connection, in order to save our changes to the database. 231 | 232 | ```python 233 | import sqlite3 234 | 235 | def add_name(database_file, new_person): 236 | query = "INSERT INTO Person (id, personal, family) VALUES (?, ?, ?);" 237 | 238 | connection = sqlite3.connect(database_file) 239 | cursor = connection.cursor() 240 | cursor.execute(query, list(new_person)) 241 | cursor.close() 242 | connection.commit() 243 | connection.close() 244 | 245 | 246 | def get_name(database_file, person_id): 247 | query = "SELECT personal || ' ' || family FROM Person WHERE id=?;" 248 | 249 | connection = sqlite3.connect(database_file) 250 | cursor = connection.cursor() 251 | cursor.execute(query, [person_id]) 252 | results = cursor.fetchall() 253 | cursor.close() 254 | connection.close() 255 | 256 | return results[0][0] 257 | 258 | # Insert a new name 259 | add_name('survey.db', ('barrett', 'Mary', 'Barrett')) 260 | # Check it exists 261 | print("Full name for barrett:", get_name('survey.db', 'barrett')) 262 | ``` 263 | 264 | ```output 265 | Full name for barrett: Mary Barrett 266 | ``` 267 | 268 | ::::::::::::::::::::::::::::::::::::::: challenge 269 | 270 | ## Filling a Table vs. Printing Values 271 | 272 | Write a Python program that creates a new database in a file called 273 | `original.db` containing a single table called `Pressure`, with a 274 | single field called `reading`, and inserts 100,000 random numbers 275 | between 10.0 and 25.0. How long does it take this program to run? 276 | How long does it take to run a program that simply writes those 277 | random numbers to a file? 278 | 279 | ::::::::::::::: solution 280 | 281 | ## Solution 282 | 283 | ```python 284 | import sqlite3 285 | # import random number generator 286 | from numpy.random import uniform 287 | 288 | random_numbers = uniform(low=10.0, high=25.0, size=100000) 289 | 290 | connection = sqlite3.connect("original.db") 291 | cursor = connection.cursor() 292 | cursor.execute("CREATE TABLE Pressure (reading float not null)") 293 | query = "INSERT INTO Pressure (reading) VALUES (?);" 294 | 295 | for number in random_numbers: 296 | cursor.execute(query, [number]) 297 | 298 | cursor.close() 299 | # save changes to file for next exercise 300 | connection.commit() 301 | connection.close() 302 | ``` 303 | 304 | For comparison, the following program writes the random numbers 305 | into the file `random_numbers.txt`: 306 | 307 | ```python 308 | from numpy.random import uniform 309 | 310 | random_numbers = uniform(low=10.0, high=25.0, size=100000) 311 | with open('random_numbers.txt', 'w') as outfile: 312 | for number in random_numbers: 313 | # need to add linebreak \n 314 | outfile.write("{}\n".format(number)) 315 | ``` 316 | 317 | ::::::::::::::::::::::::: 318 | 319 | :::::::::::::::::::::::::::::::::::::::::::::::::: 320 | 321 | ::::::::::::::::::::::::::::::::::::::: challenge 322 | 323 | ## Filtering in SQL vs. Filtering in Python 324 | 325 | Write a Python program that creates a new database called 326 | `backup.db` with the same structure as `original.db` and copies all 327 | the values greater than 20.0 from `original.db` to `backup.db`. 328 | Which is faster: filtering values in the query, or reading 329 | everything into memory and filtering in Python? 330 | 331 | ::::::::::::::: solution 332 | 333 | ## Solution 334 | 335 | The first example reads all the data into memory and filters the 336 | numbers using the if statement in Python. 337 | 338 | ```python 339 | import sqlite3 340 | 341 | connection_original = sqlite3.connect("original.db") 342 | cursor_original = connection_original.cursor() 343 | cursor_original.execute("SELECT * FROM Pressure;") 344 | results = cursor_original.fetchall() 345 | cursor_original.close() 346 | connection_original.close() 347 | 348 | connection_backup = sqlite3.connect("backup.db") 349 | cursor_backup = connection_backup.cursor() 350 | cursor_backup.execute("CREATE TABLE Pressure (reading float not null)") 351 | query = "INSERT INTO Pressure (reading) VALUES (?);" 352 | 353 | for entry in results: 354 | # number is saved in first column of the table 355 | if entry[0] > 20.0: 356 | cursor_backup.execute(query, entry) 357 | 358 | cursor_backup.close() 359 | connection_backup.commit() 360 | connection_backup.close() 361 | ``` 362 | 363 | In contrast the following example uses the conditional `SELECT` statement 364 | to filter the numbers in SQL. 365 | The only lines that changed are in line 5, where the values are fetched 366 | from `original.db` and the for loop starting in line 15 used to insert 367 | the numbers into `backup.db`. 368 | Note how this version does not require the use of Python's if statement. 369 | 370 | ```python 371 | import sqlite3 372 | 373 | connection_original = sqlite3.connect("original.db") 374 | cursor_original = connection_original.cursor() 375 | cursor_original.execute("SELECT * FROM Pressure WHERE reading > 20.0;") 376 | results = cursor_original.fetchall() 377 | cursor_original.close() 378 | connection_original.close() 379 | 380 | connection_backup = sqlite3.connect("backup.db") 381 | cursor_backup = connection_backup.cursor() 382 | cursor_backup.execute("CREATE TABLE Pressure (reading float not null)") 383 | query = "INSERT INTO Pressure (reading) VALUES (?);" 384 | 385 | for entry in results: 386 | cursor_backup.execute(query, entry) 387 | 388 | cursor_backup.close() 389 | connection_backup.commit() 390 | connection_backup.close() 391 | ``` 392 | 393 | ::::::::::::::::::::::::: 394 | 395 | :::::::::::::::::::::::::::::::::::::::::::::::::: 396 | 397 | ::::::::::::::::::::::::::::::::::::::: challenge 398 | 399 | ## Generating Insert Statements 400 | 401 | One of our colleagues has sent us a 402 | [CSV](../learners/reference.md#comma-separated-values-csv) 403 | file containing 404 | temperature readings by Robert Olmstead, which is formatted like 405 | this: 406 | 407 | ```output 408 | Taken,Temp 409 | 619,-21.5 410 | 622,-15.5 411 | ``` 412 | 413 | Write a small Python program that reads this file in and INSERTs these 414 | records into the survey database. 415 | Note: you will need to add an entry for Olmstead 416 | to the `Person` table. If you are testing your program repeatedly, 417 | you may want to investigate SQL's `INSERT or REPLACE` command. 418 | 419 | 420 | :::::::::::::::::::::::::::::::::::::::::::::::::: 421 | 422 | :::::::::::::::::::::::::::::::::::::::: keypoints 423 | 424 | - General-purpose languages have libraries for accessing databases. 425 | - To connect to a database, a program must use a library specific to that database manager. 426 | - These libraries use a connection-and-cursor model. 427 | - Programs can read query results in batches or all at once. 428 | - Queries should be written using parameter substitution, not string formatting. 429 | 430 | :::::::::::::::::::::::::::::::::::::::::::::::::: 431 | 432 | 433 | -------------------------------------------------------------------------------- /episodes/01-select.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Selecting Data 3 | teaching: 10 4 | exercises: 5 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain the difference between a table, a record, and a field. 10 | - Explain the difference between a database and a database manager. 11 | - Write a query to select all values for specific fields from a single table. 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - How can I get data from a database? 18 | 19 | :::::::::::::::::::::::::::::::::::::::::::::::::: 20 | 21 | A [relational database](../learners/reference.md#relational-database) 22 | is a way to store and manipulate information. 23 | Databases are arranged as [tables](../learners/reference.md#table). 24 | Each table has columns (also known as [fields](../learners/reference.md#fields)) that describe the data, 25 | and rows (also known as [records](../learners/reference.md#record)) which contain the data. 26 | 27 | When we are using a spreadsheet, 28 | we put formulas into cells to calculate new values based on old ones. 29 | When we are using a database, 30 | we send commands 31 | (usually called [queries](../learners/reference.md#query\)) 32 | to a [database manager](../learners/reference.md#database-manager): 33 | a program that manipulates the database for us. 34 | The database manager does whatever lookups and calculations the query specifies, 35 | returning the results in a tabular form 36 | that we can then use as a starting point for further queries. 37 | 38 | Queries are written in a language called [SQL](../learners/reference.md#sql), 39 | which stands for "Structured Query Language". 40 | SQL provides hundreds of different ways to analyze and recombine data. 41 | We will only look at a handful of queries, 42 | but that handful accounts for most of what scientists do. 43 | 44 | ::::::::::::::::::::::::::::::::::::::::: callout 45 | 46 | ## Changing Database Managers 47 | 48 | Many database managers --- Oracle, 49 | IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite --- understand 50 | SQL but each stores data in a different way, 51 | so a database created with one cannot be used directly by another. 52 | However, every database manager 53 | can import and export data in a variety of formats like .csv, SQL, 54 | so it *is* possible to move information from one to another. 55 | 56 | 57 | :::::::::::::::::::::::::::::::::::::::::::::::::: 58 | 59 | ::::::::::::::::::::::::::::::::::::::::: callout 60 | 61 | ## Getting Into and Out Of SQLite 62 | 63 | In order to use the SQLite commands *interactively*, we need to 64 | enter into the SQLite console. So, open up a terminal, and run 65 | 66 | ```bash 67 | $ cd /path/to/survey/data/ 68 | $ sqlite3 survey.db 69 | ``` 70 | 71 | The SQLite command is `sqlite3` and you are telling SQLite to open up 72 | the `survey.db`. You need to specify the `.db` file, otherwise SQLite 73 | will open up a temporary, empty database. 74 | 75 | To get out of SQLite, type out `.exit` or `.quit`. For some 76 | terminals, `Ctrl-D` can also work. If you forget any SQLite `.` (dot) 77 | command, type `.help`. 78 | 79 | 80 | :::::::::::::::::::::::::::::::::::::::::::::::::: 81 | 82 | Before we get into using SQLite to select the data, let's take a look at the tables of the database we will use in our examples: 83 | 84 |
85 | 86 |
87 | 88 | **Person**: People who took readings, `id` being the unique identifier for that person. 89 | 90 | | id | personal | family | 91 | | --------- | --------- | ---------- | 92 | | dyer | William | Dyer | 93 | | pb | Frank | Pabodie | 94 | | lake | Anderson | Lake | 95 | | roe | Valentina | Roerich | 96 | | danforth | Frank | Danforth | 97 | 98 | **Site**: Locations of the `sites` where readings were taken. 99 | 100 | | name | lat | long | 101 | | --------- | --------- | ---------- | 102 | | DR-1 | \-49.85 | \-128.57 | 103 | | DR-3 | \-47.15 | \-126.72 | 104 | | MSK-4 | \-48.87 | \-123.4 | 105 | 106 | **Visited**: Specific identification `id` of the precise locations where readings were taken at the sites and dates. 107 | 108 | | id | site | dated | 109 | | --------- | --------- | ---------- | 110 | | 619 | DR-1 | 1927-02-08 | 111 | | 622 | DR-1 | 1927-02-10 | 112 | | 734 | DR-3 | 1930-01-07 | 113 | | 735 | DR-3 | 1930-01-12 | 114 | | 751 | DR-3 | 1930-02-26 | 115 | | 752 | DR-3 | \-null- | 116 | | 837 | MSK-4 | 1932-01-14 | 117 | | 844 | DR-1 | 1932-03-22 | 118 | 119 |
120 | 121 |
122 | 123 | **Survey**: The measurements taken at each precise location on these sites. They are identified as `taken`. The field `quant` is short for quantity and indicates what is being measured.  The values are `rad`, `sal`, and `temp` referring to 'radiation', 'salinity' and 'temperature', respectively. 124 | 125 | | taken | person | quant | reading | 126 | | --------- | --------- | ---------- | ------- | 127 | | 619 | dyer | rad | 9\.82 | 128 | | 619 | dyer | sal | 0\.13 | 129 | | 622 | dyer | rad | 7\.8 | 130 | | 622 | dyer | sal | 0\.09 | 131 | | 734 | pb | rad | 8\.41 | 132 | | 734 | lake | sal | 0\.05 | 133 | | 734 | pb | temp | \-21.5 | 134 | | 735 | pb | rad | 7\.22 | 135 | | 735 | \-null- | sal | 0\.06 | 136 | | 735 | \-null- | temp | \-26.0 | 137 | | 751 | pb | rad | 4\.35 | 138 | | 751 | pb | temp | \-18.5 | 139 | | 751 | lake | sal | 0\.1 | 140 | | 752 | lake | rad | 2\.19 | 141 | | 752 | lake | sal | 0\.09 | 142 | | 752 | lake | temp | \-16.0 | 143 | | 752 | roe | sal | 41\.6 | 144 | | 837 | lake | rad | 1\.46 | 145 | | 837 | lake | sal | 0\.21 | 146 | | 837 | roe | sal | 22\.5 | 147 | | 844 | roe | rad | 11\.25 | 148 | 149 |
150 | 151 |
152 | 153 | Notice that three entries --- one in the `Visited` table, 154 | and two in the `Survey` table --- don't contain any actual 155 | data, but instead have a special `-null-` entry: 156 | we'll return to these missing values 157 | [later](05-null.md). 158 | 159 | ::::::::::::::::::::::::::::::::::::::::: callout 160 | 161 | ## Checking If Data is Available 162 | 163 | On the shell command line, 164 | change the working directory to the one where you saved `survey.db`. 165 | If you saved it at your Desktop you should use 166 | 167 | ```bash 168 | $ cd Desktop 169 | $ ls | grep survey.db 170 | ``` 171 | 172 | ```output 173 | survey.db 174 | ``` 175 | 176 | If you get the same output, you can run 177 | 178 | ```bash 179 | $ sqlite3 survey.db 180 | ``` 181 | 182 | ```output 183 | SQLite version 3.8.8 2015-01-16 12:08:06 184 | Enter ".help" for usage hints. 185 | sqlite> 186 | ``` 187 | 188 | that instructs SQLite to load the database in the `survey.db` file. 189 | 190 | For a list of useful system commands, enter `.help`. 191 | 192 | All SQLite-specific commands are prefixed with a `.` to distinguish them from SQL commands. 193 | 194 | Type `.tables` to list the tables in the database. 195 | 196 | ```sql 197 | .tables 198 | ``` 199 | 200 | ```output 201 | Person Site Survey Visited 202 | ``` 203 | 204 | If you had the above tables, you might be curious what information was stored in each table. 205 | To get more information on the tables, type `.schema` to see the SQL statements used to create the tables in the database. The statements will have a list of the columns and the data types each column stores. 206 | 207 | ```sql 208 | .schema 209 | ``` 210 | 211 | ```output 212 | CREATE TABLE Person (id text, personal text, family text); 213 | CREATE TABLE Site (name text, lat real, long real); 214 | CREATE TABLE Survey (taken integer, person text, quant text, reading real); 215 | CREATE TABLE Visited (id integer, site text, dated text); 216 | ``` 217 | 218 | The output is formatted as \<**columnName** *dataType*\>. Thus we can see from the first line that the table **Person** has three columns: 219 | 220 | - **id** with type *text* 221 | - **personal** with type *text* 222 | - **family** with type *text* 223 | 224 | Note: The available data types vary based on the database manager - you can search online for what data types are supported. 225 | 226 | You can change some SQLite settings to make the output easier to read. 227 | First, 228 | set the output mode to display left-aligned columns. 229 | Then turn on the display of column headers. 230 | 231 | ```sql 232 | .mode column 233 | .header on 234 | ``` 235 | 236 | Alternatively, you can get the settings automatically by creating a `.sqliterc` file. 237 | Add the commands above and reopen SQLite. 238 | For Windows, use `C:\Users\.sqliterc`. 239 | For Linux/MacOS, use `/Users//.sqliterc`. 240 | 241 | To exit SQLite and return to the shell command line, 242 | you can use either `.quit` or `.exit`. 243 | 244 | 245 | :::::::::::::::::::::::::::::::::::::::::::::::::: 246 | 247 | For now, 248 | let's write an SQL query that displays scientists' names. 249 | We do this using the SQL command `SELECT`, 250 | giving it the names of the columns we want and the table we want them from. 251 | Our query and its output look like this: 252 | 253 | ```sql 254 | SELECT family, personal FROM Person; 255 | ``` 256 | 257 | | family | personal | 258 | | --------- | --------- | 259 | | Dyer | William | 260 | | Pabodie | Frank | 261 | | Lake | Anderson | 262 | | Roerich | Valentina | 263 | | Danforth | Frank | 264 | 265 | The semicolon at the end of the query 266 | tells the database manager that the query is complete and ready to run. 267 | We have written our commands in upper case and the names for the table and columns 268 | in lower case, 269 | but we don't have to: 270 | as the example below shows, 271 | SQL is [case insensitive](../learners/reference.md#case-insensitive). 272 | 273 | ```sql 274 | SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn; 275 | ``` 276 | 277 | | family | personal | 278 | | --------- | --------- | 279 | | Dyer | William | 280 | | Pabodie | Frank | 281 | | Lake | Anderson | 282 | | Roerich | Valentina | 283 | | Danforth | Frank | 284 | 285 | You can use SQL's case insensitivity 286 | to distinguish between different parts of an SQL statement. 287 | In this lesson, we use the convention of using UPPER CASE for SQL keywords 288 | (such as `SELECT` and `FROM`), 289 | Title Case for table names, and lower case for field names. 290 | Whatever casing 291 | convention you choose, please be consistent: complex queries are hard 292 | enough to read without the extra cognitive load of random 293 | capitalization. 294 | 295 | While we are on the topic of SQL's syntax, one aspect of SQL's syntax 296 | that can frustrate novices and experts alike is forgetting to finish a 297 | command with `;` (semicolon). When you press enter for a command 298 | without adding the `;` to the end, it can look something like this: 299 | 300 | ```sql 301 | SELECT id FROM Person 302 | ...> 303 | ...> 304 | ``` 305 | 306 | This is SQL's prompt, where it is waiting for additional commands or 307 | for a `;` to let SQL know to finish. This is easy to fix! Just type 308 | `;` and press enter! 309 | 310 | Now, going back to our query, 311 | it's important to understand that 312 | the rows and columns in a database table aren't actually stored in any particular order. 313 | They will always be *displayed* in some order, 314 | but we can control that in various ways. 315 | For example, 316 | we could swap the columns in the output by writing our query as: 317 | 318 | ```sql 319 | SELECT personal, family FROM Person; 320 | ``` 321 | 322 | | personal | family | 323 | | --------- | --------- | 324 | | William | Dyer | 325 | | Frank | Pabodie | 326 | | Anderson | Lake | 327 | | Valentina | Roerich | 328 | | Frank | Danforth | 329 | 330 | or even repeat columns: 331 | 332 | ```sql 333 | SELECT id, id, id FROM Person; 334 | ``` 335 | 336 | | id | id | id | 337 | | --------- | --------- | ---------- | 338 | | dyer | dyer | dyer | 339 | | pb | pb | pb | 340 | | lake | lake | lake | 341 | | roe | roe | roe | 342 | | danforth | danforth | danforth | 343 | 344 | As a shortcut, 345 | we can select all of the columns in a table using `*`: 346 | 347 | ```sql 348 | SELECT * FROM Person; 349 | ``` 350 | 351 | | id | personal | family | 352 | | --------- | --------- | ---------- | 353 | | dyer | William | Dyer | 354 | | pb | Frank | Pabodie | 355 | | lake | Anderson | Lake | 356 | | roe | Valentina | Roerich | 357 | | danforth | Frank | Danforth | 358 | 359 | ::::::::::::::::::::::::::::::::::::::: challenge 360 | 361 | ## Understanding CREATE statements 362 | 363 | Use the `.schema` to identify column that contains integers. 364 | 365 | ::::::::::::::: solution 366 | 367 | ## Solution 368 | 369 | ```sql 370 | .schema 371 | ``` 372 | 373 | ```output 374 | CREATE TABLE Person (id text, personal text, family text); 375 | CREATE TABLE Site (name text, lat real, long real); 376 | CREATE TABLE Survey (taken integer, person text, quant text, reading real); 377 | CREATE TABLE Visited (id integer, site text, dated text); 378 | ``` 379 | 380 | From the output, we see that the **taken** column in the **Survey** table (3rd line) is composed of integers. 381 | 382 | 383 | 384 | ::::::::::::::::::::::::: 385 | 386 | :::::::::::::::::::::::::::::::::::::::::::::::::: 387 | 388 | ::::::::::::::::::::::::::::::::::::::: challenge 389 | 390 | ## Selecting Site Names 391 | 392 | Write a query that selects only the `name` column from the `Site` table. 393 | 394 | ::::::::::::::: solution 395 | 396 | ## Solution 397 | 398 | ```sql 399 | SELECT name FROM Site; 400 | ``` 401 | 402 | | name | 403 | | --------- | 404 | | DR-1 | 405 | | DR-3 | 406 | | MSK-4 | 407 | 408 | ::::::::::::::::::::::::: 409 | 410 | :::::::::::::::::::::::::::::::::::::::::::::::::: 411 | 412 | ::::::::::::::::::::::::::::::::::::::: challenge 413 | 414 | ## Query Style 415 | 416 | Many people format queries as: 417 | 418 | ```sql 419 | SELECT personal, family FROM person; 420 | ``` 421 | 422 | or as: 423 | 424 | ```sql 425 | select Personal, Family from PERSON; 426 | ``` 427 | 428 | What style do you find easiest to read, and why? 429 | 430 | 431 | :::::::::::::::::::::::::::::::::::::::::::::::::: 432 | 433 | :::::::::::::::::::::::::::::::::::::::: keypoints 434 | 435 | - A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records. 436 | - A database manager is a program that manipulates information stored in a database. 437 | - We write queries in a specialized language called SQL to extract information from databases. 438 | - Use SELECT... FROM... to get values from a database table. 439 | - SQL is case-insensitive (but data is case-sensitive). 440 | 441 | :::::::::::::::::::::::::::::::::::::::::::::::::: 442 | 443 | 444 | -------------------------------------------------------------------------------- /episodes/07-join.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Combining Data 3 | teaching: 20 4 | exercises: 20 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain the operation of a query that joins two tables. 10 | - Explain how to restrict the output of a query containing a join to only include meaningful combinations of values. 11 | - Write queries that join tables on equal keys. 12 | - Explain what primary and foreign keys are, and why they are useful. 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - How can I combine data from multiple tables? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | In order to submit our data to a web site 23 | that aggregates historical meteorological data, 24 | we might need to format it as 25 | latitude, longitude, date, quantity, and reading. 26 | However, 27 | our latitudes and longitudes are in the `Site` table, 28 | while the dates of measurements are in the `Visited` table 29 | and the readings themselves are in the `Survey` table. 30 | We need to combine these tables somehow. 31 | 32 | This figure shows the relations between the tables: 33 | 34 | ![](fig/sql-join-structure.svg){alt='Survey Database Structure'} 35 | 36 | The SQL command to do this is `JOIN`. 37 | To see how it works, 38 | let's start by joining the `Site` and `Visited` tables: 39 | 40 | ```sql 41 | SELECT * FROM Site JOIN Visited; 42 | ``` 43 | 44 | | name | lat | long | id | site | dated | 45 | | ------- | -------- | ---------- | --------- | ------- | ---------- | 46 | | DR-1 | \-49.85 | \-128.57 | 619 | DR-1 | 1927-02-08 | 47 | | DR-1 | \-49.85 | \-128.57 | 622 | DR-1 | 1927-02-10 | 48 | | DR-1 | \-49.85 | \-128.57 | 734 | DR-3 | 1930-01-07 | 49 | | DR-1 | \-49.85 | \-128.57 | 735 | DR-3 | 1930-01-12 | 50 | | DR-1 | \-49.85 | \-128.57 | 751 | DR-3 | 1930-02-26 | 51 | | DR-1 | \-49.85 | \-128.57 | 752 | DR-3 | \-null- | 52 | | DR-1 | \-49.85 | \-128.57 | 837 | MSK-4 | 1932-01-14 | 53 | | DR-1 | \-49.85 | \-128.57 | 844 | DR-1 | 1932-03-22 | 54 | | DR-3 | \-47.15 | \-126.72 | 619 | DR-1 | 1927-02-08 | 55 | | DR-3 | \-47.15 | \-126.72 | 622 | DR-1 | 1927-02-10 | 56 | | DR-3 | \-47.15 | \-126.72 | 734 | DR-3 | 1930-01-07 | 57 | | DR-3 | \-47.15 | \-126.72 | 735 | DR-3 | 1930-01-12 | 58 | | DR-3 | \-47.15 | \-126.72 | 751 | DR-3 | 1930-02-26 | 59 | | DR-3 | \-47.15 | \-126.72 | 752 | DR-3 | \-null- | 60 | | DR-3 | \-47.15 | \-126.72 | 837 | MSK-4 | 1932-01-14 | 61 | | DR-3 | \-47.15 | \-126.72 | 844 | DR-1 | 1932-03-22 | 62 | | MSK-4 | \-48.87 | \-123.4 | 619 | DR-1 | 1927-02-08 | 63 | | MSK-4 | \-48.87 | \-123.4 | 622 | DR-1 | 1927-02-10 | 64 | | MSK-4 | \-48.87 | \-123.4 | 734 | DR-3 | 1930-01-07 | 65 | | MSK-4 | \-48.87 | \-123.4 | 735 | DR-3 | 1930-01-12 | 66 | | MSK-4 | \-48.87 | \-123.4 | 751 | DR-3 | 1930-02-26 | 67 | | MSK-4 | \-48.87 | \-123.4 | 752 | DR-3 | \-null- | 68 | | MSK-4 | \-48.87 | \-123.4 | 837 | MSK-4 | 1932-01-14 | 69 | | MSK-4 | \-48.87 | \-123.4 | 844 | DR-1 | 1932-03-22 | 70 | 71 | `JOIN` creates 72 | the [cross product](../learners/reference.md#cross-product) 73 | of two tables, 74 | i.e., 75 | it joins each record of one table with each record of the other table 76 | to give all possible combinations. 77 | Since there are three records in `Site` 78 | and eight in `Visited`, 79 | the join's output has 24 records (3 \* 8 = 24) . 80 | And since each table has three fields, 81 | the output has six fields (3 + 3 = 6). 82 | 83 | What the join *hasn't* done is 84 | figure out if the records being joined have anything to do with each other. 85 | It has no way of knowing whether they do or not until we tell it how. 86 | To do that, 87 | we add a clause specifying that 88 | we're only interested in combinations that have the same site name, 89 | thus we need to use a filter: 90 | 91 | ```sql 92 | SELECT 93 | * 94 | FROM 95 | Site 96 | JOIN Visited ON Site.name = Visited.site; 97 | ``` 98 | 99 | | name | lat | long | id | site | dated | 100 | | ------- | -------- | ---------- | --------- | ------- | ---------- | 101 | | DR-1 | \-49.85 | \-128.57 | 619 | DR-1 | 1927-02-08 | 102 | | DR-1 | \-49.85 | \-128.57 | 622 | DR-1 | 1927-02-10 | 103 | | DR-1 | \-49.85 | \-128.57 | 844 | DR-1 | 1932-03-22 | 104 | | DR-3 | \-47.15 | \-126.72 | 734 | DR-3 | 1930-01-07 | 105 | | DR-3 | \-47.15 | \-126.72 | 735 | DR-3 | 1930-01-12 | 106 | | DR-3 | \-47.15 | \-126.72 | 751 | DR-3 | 1930-02-26 | 107 | | DR-3 | \-47.15 | \-126.72 | 752 | DR-3 | \-null- | 108 | | MSK-4 | \-48.87 | \-123.4 | 837 | MSK-4 | 1932-01-14 | 109 | 110 | `ON` is very similar to `WHERE`, 111 | and for all the queries in this lesson you can use them interchangeably. 112 | There are differences in how they affect [outer joins][outer], 113 | but that's beyond the scope of this lesson. 114 | Once we add this to our query, 115 | the database manager throws away records 116 | that combined information about two different sites, 117 | leaving us with just the ones we want. 118 | 119 | Notice that we used `Table.field` to specify field names 120 | in the output of the join. 121 | We do this because tables can have fields with the same name, 122 | and we need to be specific which ones we're talking about. 123 | For example, 124 | if we joined the `Person` and `Visited` tables, 125 | the result would inherit a field called `id` 126 | from each of the original tables. 127 | 128 | We can now use the same dotted notation 129 | to select the three columns we actually want 130 | out of our join: 131 | 132 | ```sql 133 | SELECT 134 | Site.lat, 135 | Site.long, 136 | Visited.dated 137 | FROM 138 | Site 139 | JOIN Visited ON Site.name = Visited.site; 140 | ``` 141 | 142 | | lat | long | dated | 143 | | ------- | -------- | ---------- | 144 | | \-49.85 | \-128.57 | 1927-02-08 | 145 | | \-49.85 | \-128.57 | 1927-02-10 | 146 | | \-49.85 | \-128.57 | 1932-03-22 | 147 | | \-47.15 | \-126.72 | \-null- | 148 | | \-47.15 | \-126.72 | 1930-01-12 | 149 | | \-47.15 | \-126.72 | 1930-02-26 | 150 | | \-47.15 | \-126.72 | 1930-01-07 | 151 | | \-48.87 | \-123.4 | 1932-01-14 | 152 | 153 | If joining two tables is good, 154 | joining many tables must be better. 155 | In fact, 156 | we can join any number of tables 157 | simply by adding more `JOIN` clauses to our query, 158 | and more `ON` tests to filter out combinations of records 159 | that don't make sense: 160 | 161 | ```sql 162 | SELECT 163 | Site.lat, 164 | Site.long, 165 | Visited.dated, 166 | Survey.quant, 167 | Survey.reading 168 | FROM 169 | Site 170 | JOIN Visited 171 | JOIN Survey ON Site.name = Visited.site 172 | AND Visited.id = Survey.taken 173 | AND Visited.dated IS NOT NULL; 174 | ``` 175 | 176 | | lat | long | dated | quant | reading | 177 | | ------- | -------- | ---------- | --------- | ------- | 178 | | \-49.85 | \-128.57 | 1927-02-08 | rad | 9\.82 | 179 | | \-49.85 | \-128.57 | 1927-02-08 | sal | 0\.13 | 180 | | \-49.85 | \-128.57 | 1927-02-10 | rad | 7\.8 | 181 | | \-49.85 | \-128.57 | 1927-02-10 | sal | 0\.09 | 182 | | \-47.15 | \-126.72 | 1930-01-07 | rad | 8\.41 | 183 | | \-47.15 | \-126.72 | 1930-01-07 | sal | 0\.05 | 184 | | \-47.15 | \-126.72 | 1930-01-07 | temp | \-21.5 | 185 | | \-47.15 | \-126.72 | 1930-01-12 | rad | 7\.22 | 186 | | \-47.15 | \-126.72 | 1930-01-12 | sal | 0\.06 | 187 | | \-47.15 | \-126.72 | 1930-01-12 | temp | \-26.0 | 188 | | \-47.15 | \-126.72 | 1930-02-26 | rad | 4\.35 | 189 | | \-47.15 | \-126.72 | 1930-02-26 | sal | 0\.1 | 190 | | \-47.15 | \-126.72 | 1930-02-26 | temp | \-18.5 | 191 | | \-48.87 | \-123.4 | 1932-01-14 | rad | 1\.46 | 192 | | \-48.87 | \-123.4 | 1932-01-14 | sal | 0\.21 | 193 | | \-48.87 | \-123.4 | 1932-01-14 | sal | 22\.5 | 194 | | \-49.85 | \-128.57 | 1932-03-22 | rad | 11\.25 | 195 | 196 | We can tell which records from `Site`, `Visited`, and `Survey` 197 | correspond with each other 198 | because those tables contain 199 | [primary keys](../learners/reference.md#primary-key) 200 | and [foreign keys](../learners/reference.md#foreign-key). 201 | A primary key is a value, 202 | or combination of values, 203 | that uniquely identifies each record in a table. 204 | A foreign key is a value (or combination of values) from one table 205 | that identifies a unique record in another table. 206 | Another way of saying this is that 207 | a foreign key is the primary key of one table 208 | that appears in some other table. 209 | In our database, 210 | `Person.id` is the primary key in the `Person` table, 211 | while `Survey.person` is a foreign key 212 | relating the `Survey` table's entries 213 | to entries in `Person`. 214 | 215 | Most database designers believe that 216 | every table should have a well-defined primary key. 217 | They also believe that this key should be separate from the data itself, 218 | so that if we ever need to change the data, 219 | we only need to make one change in one place. 220 | One easy way to do this is 221 | to create an arbitrary, unique ID for each record 222 | as we add it to the database. 223 | This is actually very common: 224 | those IDs have names like "student numbers" and "patient numbers", 225 | and they almost always turn out to have originally been 226 | a unique record identifier in some database system or other. 227 | As the query below demonstrates, 228 | SQLite [automatically numbers records][rowid] as they're added to tables, 229 | and we can use those record numbers in queries: 230 | 231 | ```sql 232 | SELECT rowid, * FROM Person; 233 | ``` 234 | 235 | | rowid | id | personal | family | 236 | | ------- | -------- | ---------- | --------- | 237 | | 1 | dyer | William | Dyer | 238 | | 2 | pb | Frank | Pabodie | 239 | | 3 | lake | Anderson | Lake | 240 | | 4 | roe | Valentina | Roerich | 241 | | 5 | danforth | Frank | Danforth | 242 | 243 | ::::::::::::::::::::::::::::::::::::::: challenge 244 | 245 | ## Listing Radiation Readings 246 | 247 | Write a query that lists all radiation readings from the DR-1 site. 248 | 249 | ::::::::::::::: solution 250 | 251 | ## Solution 252 | 253 | ```sql 254 | SELECT 255 | Survey.reading 256 | FROM 257 | Site 258 | JOIN 259 | Visited 260 | JOIN 261 | Survey 262 | ON Site.name = Visited.site 263 | AND Visited.id = Survey.taken 264 | WHERE 265 | Site.name = 'DR-1' 266 | AND Survey.quant = 'rad'; 267 | ``` 268 | 269 | | reading | 270 | | ------- | 271 | | 9\.82 | 272 | | 7\.8 | 273 | | 11\.25 | 274 | 275 | ::::::::::::::::::::::::: 276 | 277 | :::::::::::::::::::::::::::::::::::::::::::::::::: 278 | 279 | ::::::::::::::::::::::::::::::::::::::: challenge 280 | 281 | ## Where's Frank? 282 | 283 | Write a query that lists all sites visited by people named "Frank". 284 | 285 | ::::::::::::::: solution 286 | 287 | ## Solution 288 | 289 | ```sql 290 | SELECT 291 | DISTINCT Site.name 292 | FROM 293 | Site 294 | JOIN Visited 295 | JOIN Survey 296 | JOIN Person ON Site.name = Visited.site 297 | AND Visited.id = Survey.taken 298 | AND Survey.person = Person.id 299 | WHERE 300 | Person.personal = 'Frank'; 301 | ``` 302 | 303 | | name | 304 | | ------- | 305 | | DR-3 | 306 | 307 | ::::::::::::::::::::::::: 308 | 309 | :::::::::::::::::::::::::::::::::::::::::::::::::: 310 | 311 | ::::::::::::::::::::::::::::::::::::::: challenge 312 | 313 | ## Reading Queries 314 | 315 | Describe in your own words what the following query produces: 316 | 317 | ```sql 318 | SELECT Site.name FROM Site JOIN Visited 319 | ON Site.lat < -49.0 AND Site.name = Visited.site AND Visited.dated >= '1932-01-01'; 320 | ``` 321 | 322 | :::::::::::::::::::::::::::::::::::::::::::::::::: 323 | 324 | ::::::::::::::::::::::::::::::::::::::: challenge 325 | 326 | ## Who Has Been Where? 327 | 328 | Write a query that shows each site with exact location (lat, long) ordered by visited date, 329 | followed by personal name and family name of the person who visited the site 330 | and the type of measurement taken and its reading. Please avoid all null values. 331 | Tip: you should get 15 records with 8 fields. 332 | 333 | ::::::::::::::: solution 334 | 335 | ## Solution 336 | 337 | ```sql 338 | SELECT Site.name, Site.lat, Site.long, Person.personal, Person.family, Survey.quant, Survey.reading, Visited.dated 339 | FROM 340 | Site 341 | JOIN 342 | Visited 343 | JOIN 344 | Survey 345 | JOIN 346 | Person 347 | ON Site.name = Visited.site 348 | AND Visited.id = Survey.taken 349 | AND Survey.person = Person.id 350 | WHERE 351 | Survey.person IS NOT NULL 352 | AND Visited.dated IS NOT NULL 353 | ORDER BY 354 | Visited.dated; 355 | ``` 356 | 357 | | name | lat | long | personal | family | quant | reading | dated | 358 | | ------- | -------- | ---------- | --------- | ------- | ---------- | ------- | ---------- | 359 | | DR-1 | \-49.85 | \-128.57 | William | Dyer | rad | 9\.82 | 1927-02-08 | 360 | | DR-1 | \-49.85 | \-128.57 | William | Dyer | sal | 0\.13 | 1927-02-08 | 361 | | DR-1 | \-49.85 | \-128.57 | William | Dyer | rad | 7\.8 | 1927-02-10 | 362 | | DR-1 | \-49.85 | \-128.57 | William | Dyer | sal | 0\.09 | 1927-02-10 | 363 | | DR-3 | \-47.15 | \-126.72 | Anderson | Lake | sal | 0\.05 | 1930-01-07 | 364 | | DR-3 | \-47.15 | \-126.72 | Frank | Pabodie | rad | 8\.41 | 1930-01-07 | 365 | | DR-3 | \-47.15 | \-126.72 | Frank | Pabodie | temp | \-21.5 | 1930-01-07 | 366 | | DR-3 | \-47.15 | \-126.72 | Frank | Pabodie | rad | 7\.22 | 1930-01-12 | 367 | | DR-3 | \-47.15 | \-126.72 | Anderson | Lake | sal | 0\.1 | 1930-02-26 | 368 | | DR-3 | \-47.15 | \-126.72 | Frank | Pabodie | rad | 4\.35 | 1930-02-26 | 369 | | DR-3 | \-47.15 | \-126.72 | Frank | Pabodie | temp | \-18.5 | 1930-02-26 | 370 | | MSK-4 | \-48.87 | \-123.4 | Anderson | Lake | rad | 1\.46 | 1932-01-14 | 371 | | MSK-4 | \-48.87 | \-123.4 | Anderson | Lake | sal | 0\.21 | 1932-01-14 | 372 | | MSK-4 | \-48.87 | \-123.4 | Valentina | Roerich | sal | 22\.5 | 1932-01-14 | 373 | | DR-1 | \-49.85 | \-128.57 | Valentina | Roerich | rad | 11\.25 | 1932-03-22 | 374 | 375 | ::::::::::::::::::::::::: 376 | 377 | :::::::::::::::::::::::::::::::::::::::::::::::::: 378 | 379 | A good visual explanation of joins can be [found here][joinref] 380 | 381 | [outer]: https://en.wikipedia.org/wiki/Join_%28SQL%29#Outer_join 382 | [rowid]: https://www.sqlite.org/lang_createtable.html#rowid 383 | [joinref]: https://sql-joins.leopard.in.ua/ 384 | 385 | 386 | :::::::::::::::::::::::::::::::::::::::: keypoints 387 | 388 | - Use JOIN to combine data from two tables. 389 | - Use table.field notation to refer to fields when doing joins. 390 | - Every fact should be represented in a database exactly once. 391 | - A join produces all combinations of records from one table with records from another. 392 | - A primary key is a field (or set of fields) whose values uniquely identify the records in a table. 393 | - A foreign key is a field (or set of fields) in one table whose values are a primary key in another table. 394 | - We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables. 395 | - The most common join condition is matching keys. 396 | 397 | :::::::::::::::::::::::::::::::::::::::::::::::::: 398 | 399 | 400 | -------------------------------------------------------------------------------- /episodes/06-agg.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Aggregation 3 | teaching: 10 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Define aggregation and give examples of its use. 10 | - Write queries that compute aggregated values. 11 | - Trace the execution of a query that performs aggregation. 12 | - Explain how missing data is handled during aggregation. 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - How can I calculate sums, averages, and other summary values? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | We now want to calculate ranges and averages for our data. 23 | We know how to select all of the dates from the `Visited` table: 24 | 25 | ```sql 26 | SELECT dated FROM Visited; 27 | ``` 28 | 29 | | dated | 30 | | ---------------- | 31 | | 1927-02-08 | 32 | | 1927-02-10 | 33 | | 1930-01-07 | 34 | | 1930-01-12 | 35 | | 1930-02-26 | 36 | | \-null- | 37 | | 1932-01-14 | 38 | | 1932-03-22 | 39 | 40 | but to combine them, 41 | we must use an [aggregation function](../learners/reference.md#aggregation-function) 42 | such as `min` or `max`. 43 | Each of these functions takes a set of records as input, 44 | and produces a single record as output: 45 | 46 | ```sql 47 | SELECT min(dated) FROM Visited; 48 | ``` 49 | 50 | | min(dated) | 51 | | ---------------- | 52 | | 1927-02-08 | 53 | 54 | ![](fig/sql-aggregation.svg){alt='SQL Aggregation'} 55 | 56 | ```sql 57 | SELECT max(dated) FROM Visited; 58 | ``` 59 | 60 | | max(dated) | 61 | | ---------------- | 62 | | 1932-03-22 | 63 | 64 | `min` and `max` are just two of 65 | the aggregation functions built into SQL. 66 | Three others are `avg`, 67 | `count`, 68 | and `sum`: 69 | 70 | ```sql 71 | SELECT avg(reading) FROM Survey WHERE quant = 'sal'; 72 | ``` 73 | 74 | | avg(reading) | 75 | | ---------------- | 76 | | 7\.20333333333333 | 77 | 78 | ```sql 79 | SELECT count(reading) FROM Survey WHERE quant = 'sal'; 80 | ``` 81 | 82 | | count(reading) | 83 | | ---------------- | 84 | | 9 | 85 | 86 | ```sql 87 | SELECT sum(reading) FROM Survey WHERE quant = 'sal'; 88 | ``` 89 | 90 | | sum(reading) | 91 | | ---------------- | 92 | | 64\.83 | 93 | 94 | We used `count(reading)` here, 95 | but could have used `count(*)`, 96 | since the function doesn't care about the values themselves, 97 | just how many rows there are. 98 | 99 | Even a column other than `reading` could be used, 100 | but note that any `NULL` value will not be counted 101 | (to see, try `count`ing the `person` column, which contains a 102 | row with a `NULL`). 103 | This perhaps non-obvious behavior 104 | of aggregation functions is covered later 105 | in this episode. 106 | 107 | SQL lets us do several aggregations at once. 108 | We can, 109 | for example, 110 | find the range of sensible salinity measurements: 111 | 112 | ```sql 113 | SELECT min(reading), max(reading) FROM Survey WHERE quant = 'sal' AND reading <= 1.0; 114 | ``` 115 | 116 | | min(reading) | max(reading) | 117 | | ---------------- | -------------- | 118 | | 0\.05 | 0\.21 | 119 | 120 | We can also combine aggregated results with raw results, 121 | although the output might surprise you: 122 | 123 | ```sql 124 | SELECT person, count(*) FROM Survey WHERE quant = 'sal' AND reading <= 1.0; 125 | ``` 126 | 127 | | person | count(\*) | 128 | | ---------------- | -------------- | 129 | | lake | 7 | 130 | 131 | Why does Lake's name appear rather than Roerich's or Dyer's? 132 | The answer is that when it has to aggregate a field, 133 | but isn't told how to, 134 | the database manager chooses an actual value from the input set. 135 | It might use the first one processed, 136 | the last one, 137 | or something else entirely. 138 | 139 | Another important fact is that when there are no values to aggregate --- 140 | for example, where there are no rows satisfying the `WHERE` clause --- 141 | aggregation's result is "don't know" 142 | rather than zero or some other arbitrary value: 143 | 144 | ```sql 145 | SELECT person, max(reading), sum(reading) FROM Survey WHERE quant = 'missing'; 146 | ``` 147 | 148 | | person | max(reading) | sum(reading) | 149 | | ---------------- | -------------- | ---------------------- | 150 | | \-null- | \-null- | \-null- | 151 | 152 | One final important feature of aggregation functions is that 153 | they are inconsistent with the rest of SQL in a very useful way. 154 | If we add two values, 155 | and one of them is null, 156 | the result is null. 157 | By extension, 158 | if we use `sum` to add all the values in a set, 159 | and any of those values are null, 160 | the result should also be null. 161 | It's much more useful, 162 | though, 163 | for aggregation functions to ignore null values 164 | and only combine those that are non-null. 165 | This behavior lets us write our queries as: 166 | 167 | ```sql 168 | SELECT min(dated) FROM Visited; 169 | ``` 170 | 171 | | min(dated) | 172 | | ---------------- | 173 | | 1927-02-08 | 174 | 175 | instead of always having to filter explicitly: 176 | 177 | ```sql 178 | SELECT min(dated) FROM Visited WHERE dated IS NOT NULL; 179 | ``` 180 | 181 | | min(dated) | 182 | | ---------------- | 183 | | 1927-02-08 | 184 | 185 | Aggregating all records at once doesn't always make sense. 186 | For example, 187 | suppose we suspect that there is a systematic bias in our data, 188 | and that some scientists' radiation readings are higher than others. 189 | We know that this doesn't work: 190 | 191 | ```sql 192 | SELECT person, count(reading), round(avg(reading), 2) 193 | FROM Survey 194 | WHERE quant = 'rad'; 195 | ``` 196 | 197 | | person | count(reading) | round(avg(reading), 2) | 198 | | ---------------- | -------------- | ---------------------- | 199 | | roe | 8 | 6\.56 | 200 | 201 | because the database manager selects a single arbitrary scientist's name 202 | rather than aggregating separately for each scientist. 203 | Since there are only five scientists, 204 | we could write five queries of the form: 205 | 206 | ```sql 207 | SELECT person, count(reading), round(avg(reading), 2) 208 | FROM Survey 209 | WHERE quant = 'rad' 210 | AND person = 'dyer'; 211 | ``` 212 | 213 | | person | count(reading) | round(avg(reading), 2) | 214 | | ---------------- | -------------- | ---------------------- | 215 | | dyer | 2 | 8\.81 | 216 | 217 | but this would be tedious, 218 | and if we ever had a data set with fifty or five hundred scientists, 219 | the chances of us getting all of those queries right is small. 220 | 221 | What we need to do is 222 | tell the database manager to aggregate the hours for each scientist separately 223 | using a `GROUP BY` clause: 224 | 225 | ```sql 226 | SELECT person, count(reading), round(avg(reading), 2) 227 | FROM Survey 228 | WHERE quant = 'rad' 229 | GROUP BY person; 230 | ``` 231 | 232 | | person | count(reading) | round(avg(reading), 2) | 233 | | ---------------- | -------------- | ---------------------- | 234 | | dyer | 2 | 8\.81 | 235 | | lake | 2 | 1\.82 | 236 | | pb | 3 | 6\.66 | 237 | | roe | 1 | 11\.25 | 238 | 239 | `GROUP BY` does exactly what its name implies: 240 | groups all the records with the same value for the specified field together 241 | so that aggregation can process each batch separately. 242 | Since all the records in each batch have the same value for `person`, 243 | it no longer matters that the database manager 244 | is picking an arbitrary one to display 245 | alongside the aggregated `reading` values. 246 | 247 | Just as we can sort by multiple criteria at once, 248 | we can also group by multiple criteria. 249 | To get the average reading by scientist and quantity measured, 250 | for example, 251 | we just add another field to the `GROUP BY` clause: 252 | 253 | ```sql 254 | SELECT person, quant, count(reading), round(avg(reading), 2) 255 | FROM Survey 256 | GROUP BY person, quant; 257 | ``` 258 | 259 | | person | quant | count(reading) | round(avg(reading), 2) | 260 | | ---------------- | -------------- | ---------------------- | ---------------------- | 261 | | \-null- | sal | 1 | 0\.06 | 262 | | \-null- | temp | 1 | \-26.0 | 263 | | dyer | rad | 2 | 8\.81 | 264 | | dyer | sal | 2 | 0\.11 | 265 | | lake | rad | 2 | 1\.82 | 266 | | lake | sal | 4 | 0\.11 | 267 | | lake | temp | 1 | \-16.0 | 268 | | pb | rad | 3 | 6\.66 | 269 | | pb | temp | 2 | \-20.0 | 270 | | roe | rad | 1 | 11\.25 | 271 | | roe | sal | 2 | 32\.05 | 272 | 273 | Note that we have added `quant` to the list of fields displayed, 274 | since the results wouldn't make much sense otherwise. 275 | 276 | Let's go one step further and remove all the entries 277 | where we don't know who took the measurement: 278 | 279 | ```sql 280 | SELECT person, quant, count(reading), round(avg(reading), 2) 281 | FROM Survey 282 | WHERE person IS NOT NULL 283 | GROUP BY person, quant 284 | ORDER BY person, quant; 285 | ``` 286 | 287 | | person | quant | count(reading) | round(avg(reading), 2) | 288 | | ---------------- | -------------- | ---------------------- | ---------------------- | 289 | | dyer | rad | 2 | 8\.81 | 290 | | dyer | sal | 2 | 0\.11 | 291 | | lake | rad | 2 | 1\.82 | 292 | | lake | sal | 4 | 0\.11 | 293 | | lake | temp | 1 | \-16.0 | 294 | | pb | rad | 3 | 6\.66 | 295 | | pb | temp | 2 | \-20.0 | 296 | | roe | rad | 1 | 11\.25 | 297 | | roe | sal | 2 | 32\.05 | 298 | 299 | Looking more closely, 300 | this query: 301 | 302 | 1. selected records from the `Survey` table 303 | where the `person` field was not null; 304 | 305 | 2. grouped those records into subsets 306 | so that the `person` and `quant` values in each subset 307 | were the same; 308 | 309 | 3. ordered those subsets first by `person`, 310 | and then within each sub-group by `quant`; 311 | and 312 | 313 | 4. counted the number of records in each subset, 314 | calculated the average `reading` in each, 315 | and chose a `person` and `quant` value from each 316 | (it doesn't matter which ones, 317 | since they're all equal). 318 | 319 | ::::::::::::::::::::::::::::::::::::::: challenge 320 | 321 | ## Counting Temperature Readings 322 | 323 | How many temperature readings did Frank Pabodie record, 324 | and what was their average value? 325 | 326 | ::::::::::::::: solution 327 | 328 | ## Solution 329 | 330 | ```sql 331 | SELECT count(reading), avg(reading) FROM Survey WHERE quant = 'temp' AND person = 'pb'; 332 | ``` 333 | 334 | | count(reading) | avg(reading) | 335 | | ---------------- | -------------- | 336 | | 2 | \-20.0 | 337 | 338 | ::::::::::::::::::::::::: 339 | 340 | :::::::::::::::::::::::::::::::::::::::::::::::::: 341 | 342 | ::::::::::::::::::::::::::::::::::::::: challenge 343 | 344 | ## Averaging with NULL 345 | 346 | The average of a set of values is the sum of the values 347 | divided by the number of values. 348 | Does this mean that the `avg` function returns 2.0 or 3.0 349 | when given the values 1.0, `null`, and 5.0? 350 | 351 | ::::::::::::::: solution 352 | 353 | ## Solution 354 | 355 | The answer is 3.0. 356 | `NULL` is not a value; it is the absence of a value. 357 | As such it is not included in the calculation. 358 | 359 | You can confirm this, by executing this code: 360 | 361 | ```sql 362 | SELECT AVG(a) FROM ( 363 | SELECT 1 AS a 364 | UNION ALL SELECT NULL 365 | UNION ALL SELECT 5); 366 | ``` 367 | 368 | ::::::::::::::::::::::::: 369 | 370 | :::::::::::::::::::::::::::::::::::::::::::::::::: 371 | 372 | ::::::::::::::::::::::::::::::::::::::: challenge 373 | 374 | ## What Does This Query Do? 375 | 376 | We want to calculate the difference between 377 | each individual radiation reading 378 | and the average of all the radiation readings. 379 | We write the query: 380 | 381 | ```sql 382 | SELECT reading - avg(reading) FROM Survey WHERE quant = 'rad'; 383 | ``` 384 | 385 | What does this actually produce, and can you think of why? 386 | 387 | ::::::::::::::: solution 388 | 389 | ## Solution 390 | 391 | The query produces only one row of results when we what we really want is a result for each of the readings. 392 | The `avg()` function produces only a single value, and because it is run first, the table is reduced to a single row. 393 | The `reading` value is simply an arbitrary one. 394 | 395 | To achieve what we wanted, we would have to run two queries: 396 | 397 | ```sql 398 | SELECT avg(reading) FROM Survey WHERE quant='rad'; 399 | ``` 400 | 401 | This produces the average value (6.5625), which we can then insert into a second query: 402 | 403 | ```sql 404 | SELECT reading - 6.5625 FROM Survey WHERE quant = 'rad'; 405 | ``` 406 | 407 | This produces what we want, but we can combine this into a single query using subqueries. 408 | 409 | ```sql 410 | SELECT reading - (SELECT avg(reading) FROM Survey WHERE quant='rad') FROM Survey WHERE quant = 'rad'; 411 | ``` 412 | 413 | This way we don't have execute two queries. 414 | 415 | In summary what we have done is to replace `avg(reading)` with `(SELECT avg(reading) FROM Survey WHERE quant='rad')` in the original query. 416 | 417 | ::::::::::::::::::::::::: 418 | 419 | :::::::::::::::::::::::::::::::::::::::::::::::::: 420 | 421 | ::::::::::::::::::::::::::::::::::::::: challenge 422 | 423 | ## Using the group\_concat function 424 | 425 | The function `group_concat(field, separator)` 426 | concatenates all the values in a field 427 | using the specified separator character 428 | (or ',' if the separator isn't specified). 429 | Use this to produce a one-line list of scientists' names, 430 | such as: 431 | 432 | ```sql 433 | William Dyer, Frank Pabodie, Anderson Lake, Valentina Roerich, Frank Danforth 434 | ``` 435 | 436 | Can you find a way to list all the scientists family names separated by a comma? 437 | Can you find a way to list all the scientists personal and family names separated by a comma? 438 | 439 | ::::::::::::::: solution 440 | 441 | List all the family names separated by a comma: 442 | 443 | ```sql 444 | SELECT group_concat(family, ',') FROM Person; 445 | ``` 446 | 447 | List all the full names separated by a comma: 448 | 449 | ```sql 450 | SELECT group_concat(personal || ' ' || family, ',') FROM Person; 451 | ``` 452 | 453 | ::::::::::::::::::::::::: 454 | 455 | :::::::::::::::::::::::::::::::::::::::::::::::::: 456 | 457 | :::::::::::::::::::::::::::::::::::::::: keypoints 458 | 459 | - Use aggregation functions to combine multiple values. 460 | - Aggregation functions ignore `null` values. 461 | - Aggregation happens after filtering. 462 | - Use GROUP BY to combine subsets separately. 463 | - If no aggregation function is specified for a field, the query may return an arbitrary value for that field. 464 | 465 | :::::::::::::::::::::::::::::::::::::::::::::::::: 466 | 467 | 468 | --------------------------------------------------------------------------------