├── .editorconfig
├── .github
    └── workflows
    │   ├── README.md
    │   ├── pr-close-signal.yaml
    │   ├── pr-comment.yaml
    │   ├── pr-post-remove-branch.yaml
    │   ├── pr-preflight.yaml
    │   ├── pr-receive.yaml
    │   ├── sandpaper-main.yaml
    │   ├── sandpaper-version.txt
    │   ├── update-cache.yaml
    │   └── update-workflows.yaml
├── .gitignore
├── .zenodo.json
├── AUTHORS.md
├── CITATION
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── config.yaml
├── episodes
    ├── 01-introduction.md
    ├── 02-basics.md
    ├── 03-control-structures.md
    ├── 04-reusable.md
    ├── 05-processing-data-from-file.md
    ├── 06-date-and-time.md
    ├── 07-json.md
    ├── 08-Pandas.md
    ├── 09-extracting-data.md
    ├── 10-aggregations.md
    ├── 11-joins.md
    ├── 12-long-and-wide.md
    ├── 13-matplotlib.md
    ├── 14-sqlite.md
    ├── data
    │   ├── Newspapers.csv
    │   ├── Newspapers.txt
    │   ├── Q1.txt
    │   ├── Q6.txt
    │   ├── Q7.txt
    │   ├── SAFI.json
    │   ├── SAFI_clean.csv
    │   ├── SAFI_crops.csv
    │   ├── SAFI_full_shortname.csv
    │   ├── SAFI_grass_roof.csv
    │   ├── SAFI_grass_roof_burntbricks.csv
    │   ├── SAFI_grass_roof_muddaub.csv
    │   ├── SAFI_results.csv
    │   ├── SAFI_results_anon.csv
    │   ├── SN7577.sqlite
    │   ├── SN7577.tab
    │   ├── SN7577i_a.csv
    │   ├── SN7577i_aa.csv
    │   ├── SN7577i_b.csv
    │   ├── SN7577i_bb.csv
    │   ├── SN7577i_c.csv
    │   └── SN7577i_d.csv
    └── fig
    │   ├── Python_date_format_01.png
    │   ├── Python_function_parameters_9.png
    │   ├── Python_install_1.png
    │   ├── Python_install_2.png
    │   ├── Python_jupyter_6.png
    │   ├── Python_jupyter_7.png
    │   ├── Python_jupyter_8.png
    │   ├── Python_jupyterl_6.png
    │   ├── Python_repl_3.png
    │   ├── Python_repl_4.png
    │   ├── Python_repl_5.png
    │   ├── Python_repll_3.png
    │   ├── barplot1.png
    │   ├── boxplot1.png
    │   ├── boxplot2.png
    │   ├── boxplot3.png
    │   ├── functionAnatomy.png
    │   ├── histogram1.png
    │   ├── histogram3.png
    │   ├── lm1.png
    │   ├── pandas_join_types.png
    │   ├── scatter1.png
    │   └── scatter2.png
├── index.md
├── instructors
    └── instructor-notes.md
├── learners
    ├── discuss.md
    ├── reference.md
    └── setup.md
├── profiles
    └── learner-profiles.md
└── site
    └── README.md


/.editorconfig:
--------------------------------------------------------------------------------
 1 | root = true
 2 | 
 3 | [*]
 4 | charset = utf-8
 5 | insert_final_newline = true
 6 | trim_trailing_whitespace = true
 7 | 
 8 | [*.md]
 9 | indent_size = 2
10 | indent_style = space
11 | max_line_length = 100  # Please keep this in sync with bin/lesson_check.py!
12 | trim_trailing_whitespace = false  # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (<br/>)
13 | 
14 | [*.r]
15 | max_line_length = 80
16 | 
17 | [*.py]
18 | indent_size = 4
19 | indent_style = space
20 | max_line_length = 79
21 | 
22 | [*.sh]
23 | end_of_line = lf
24 | 
25 | [Makefile]
26 | indent_style = tab
27 | 


--------------------------------------------------------------------------------
/.github/workflows/README.md:
--------------------------------------------------------------------------------
  1 | # Carpentries Workflows
  2 | 
  3 | This directory contains workflows to be used for Lessons using the {sandpaper}
  4 | lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml`
  5 | and `pr-receive.yaml`) and the rest are bots to handle pull request management.
  6 | 
  7 | These workflows will likely change as {sandpaper} evolves, so it is important to
  8 | keep them up-to-date. To do this in your lesson you can do the following in your
  9 | R console:
 10 | 
 11 | ```r
 12 | # Install/Update sandpaper
 13 | options(repos = c(carpentries = "https://carpentries.r-universe.dev/", 
 14 |   CRAN = "https://cloud.r-project.org"))
 15 | install.packages("sandpaper")
 16 | 
 17 | # update the workflows in your lesson
 18 | library("sandpaper")
 19 | update_github_workflows()
 20 | ```
 21 | 
 22 | Inside this folder, you will find a file called `sandpaper-version.txt`, which
 23 | will contain a version number for sandpaper. This will be used in the future to
 24 | alert you if a workflow update is needed.
 25 | 
 26 | What follows are the descriptions of the workflow files:
 27 | 
 28 | ## Deployment
 29 | 
 30 | ### 01 Build and Deploy (sandpaper-main.yaml)
 31 | 
 32 | This is the main driver that will only act on the main branch of the repository.
 33 | This workflow does the following:
 34 | 
 35 |  1. checks out the lesson
 36 |  2. provisions the following resources
 37 |    - R
 38 |    - pandoc
 39 |    - lesson infrastructure (stored in a cache)
 40 |    - lesson dependencies if needed (stored in a cache)
 41 |  3. builds the lesson via `sandpaper:::ci_deploy()`
 42 | 
 43 | #### Caching
 44 | 
 45 | This workflow has two caches; one cache is for the lesson infrastructure and 
 46 | the other is for the the lesson dependencies if the lesson contains rendered
 47 | content. These caches are invalidated by new versions of the infrastructure and
 48 | the `renv.lock` file, respectively. If there is a problem with the cache, 
 49 | manual invaliation is necessary. You will need maintain access to the repository
 50 | and you can either go to the actions tab and [click on the caches button to find
 51 | and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/) 
 52 | or by setting the `CACHE_VERSION` secret to the current date (which will
 53 | invalidate all of the caches).
 54 | 
 55 | ## Updates
 56 | 
 57 | ### Setup Information
 58 | 
 59 | These workflows run on a schedule and at the maintainer's request. Because they
 60 | create pull requests that update workflows/require the downstream actions to run,
 61 | they need a special repository/organization secret token called 
 62 | `SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope. 
 63 | 
 64 | This can be an individual user token, OR it can be a trusted bot account. If you
 65 | have a repository in one of the official Carpentries accounts, then you do not
 66 | need to worry about this token being present because the Carpentries Core Team
 67 | will take care of supplying this token.
 68 | 
 69 | If you want to use your personal account: you can go to 
 70 | <https://github.com/settings/tokens/new?scopes=public_repo,workflow&description=Sandpaper%20Token>
 71 | to create a token. Once you have created your token, you should copy it to your
 72 | clipboard and then go to your repository's settings > secrets > actions and
 73 | create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token.
 74 | 
 75 | If you do not specify your token correctly, the runs will not fail and they will
 76 | give you instructions to provide the token for your repository. 
 77 | 
 78 | ### 02 Maintain: Update Workflow Files (update-workflow.yaml)
 79 | 
 80 | The {sandpaper} repository was designed to do as much as possible to separate 
 81 | the tools from the content. For local builds, this is absolutely true, but 
 82 | there is a minor issue when it comes to workflow files: they must live inside 
 83 | the repository. 
 84 | 
 85 | This workflow ensures that the workflow files are up-to-date. The way it work is
 86 | to download the update-workflows.sh script from GitHub and run it. The script 
 87 | will do the following:
 88 | 
 89 | 1. check the recorded version of sandpaper against the current version on github
 90 | 2. update the files if there is a difference in versions
 91 | 
 92 | After the files are updated, if there are any changes, they are pushed to a
 93 | branch called `update/workflows` and a pull request is created. Maintainers are
 94 | encouraged to review the changes and accept the pull request if the outputs
 95 | are okay.
 96 | 
 97 | This update is run weekly or on demand.
 98 | 
 99 | ### 03 Maintain: Update Package Cache (update-cache.yaml)
100 | 
101 | For lessons that have generated content, we use {renv} to ensure that the output
102 | is stable. This is controlled by a single lockfile which documents the packages
103 | needed for the lesson and the version numbers. This workflow is skipped in 
104 | lessons that do not have generated content.
105 | 
106 | Because the lessons need to remain current with the package ecosystem, it's a
107 | good idea to make sure these packages can be updated periodically. The 
108 | update cache workflow will do this by checking for updates, applying them in a
109 | branch called `updates/packages` and creating a pull request with _only the
110 | lockfile changed_. 
111 | 
112 | From here, the markdown documents will be rebuilt and you can inspect what has
113 | changed based on how the packages have updated. 
114 | 
115 | ## Pull Request and Review Management
116 | 
117 | Because our lessons execute code, pull requests are a secruity risk for any
118 | lesson and thus have security measures associted with them. **Do not merge any
119 | pull requests that do not pass checks and do not have bots commented on them.**
120 | 
121 | This series of workflows all go together and are described in the following 
122 | diagram and the below sections:
123 | 
124 | ![Graph representation of a pull request](https://carpentries.github.io/sandpaper/articles/img/pr-flow.dot.svg)
125 | 
126 | ### Pre Flight Pull Request Validation (pr-preflight.yaml)
127 | 
128 | This workflow runs every time a pull request is created and its purpose is to
129 | validate that the pull request is okay to run. This means the following things:
130 | 
131 | 1. The pull request does not contain modified workflow files
132 | 2. If the pull request contains modified workflow files, it does not contain 
133 |    modified content files (such as a situation where @carpentries-bot will
134 |    make an automated pull request)
135 | 3. The pull request does not contain an invalid commit hash (e.g. from a fork
136 |    that was made before a lesson was transitioned from styles to use the
137 |    workbench).
138 | 
139 | Once the checks are finished, a comment is issued to the pull request, which 
140 | will allow maintainers to determine if it is safe to run the 
141 | "Receive Pull Request" workflow from new contributors.
142 | 
143 | ### Receive Pull Request (pr-receive.yaml)
144 | 
145 | **Note of caution:** This workflow runs arbitrary code by anyone who creates a
146 | pull request. GitHub has safeguarded the token used in this workflow to have no
147 | priviledges in the repository, but we have taken precautions to protect against
148 | spoofing.
149 | 
150 | This workflow is triggered with every push to a pull request. If this workflow
151 | is already running and a new push is sent to the pull request, the workflow
152 | running from the previous push will be cancelled and a new workflow run will be
153 | started.
154 | 
155 | The first step of this workflow is to check if it is valid (e.g. that no
156 | workflow files have been modified). If there are workflow files that have been
157 | modified, a comment is made that indicates that the workflow is not run. If 
158 | both a workflow file and lesson content is modified, an error will occurr.
159 | 
160 | The second step (if valid) is to build the generated content from the pull
161 | request. This builds the content and uploads three artifacts:
162 | 
163 | 1. The pull request number (pr)
164 | 2. A summary of changes after the rendering process (diff)
165 | 3. The rendered files (build)
166 | 
167 | Because this workflow builds generated content, it follows the same general 
168 | process as the `sandpaper-main` workflow with the same caching mechanisms.
169 | 
170 | The artifacts produced are used by the next workflow.
171 | 
172 | ### Comment on Pull Request (pr-comment.yaml)
173 | 
174 | This workflow is triggered if the `pr-receive.yaml` workflow is successful.
175 | The steps in this workflow are:
176 | 
177 | 1. Test if the workflow is valid and comment the validity of the workflow to the
178 |    pull request.
179 | 2. If it is valid: create an orphan branch with two commits: the current state
180 |    of the repository and the proposed changes.
181 | 3. If it is valid: update the pull request comment with the summary of changes
182 | 
183 | Importantly: if the pull request is invalid, the branch is not created so any
184 | malicious code is not published.
185 | 
186 | From here, the maintainer can request changes from the author and eventually 
187 | either merge or reject the PR. When this happens, if the PR was valid, the 
188 | preview branch needs to be deleted. 
189 | 
190 | ### Send Close PR Signal (pr-close-signal.yaml)
191 | 
192 | Triggered any time a pull request is closed. This emits an artifact that is the
193 | pull request number for the next action
194 | 
195 | ### Remove Pull Request Branch (pr-post-remove-branch.yaml)
196 | 
197 | Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with
198 | the pull request (if it was created).
199 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-close-signal.yaml:
--------------------------------------------------------------------------------
 1 | name: "Bot: Send Close Pull Request Signal"
 2 | 
 3 | on:
 4 |   pull_request:
 5 |     types:
 6 |       [closed]
 7 | 
 8 | jobs:
 9 |   send-close-signal:
10 |     name: "Send closing signal"
11 |     runs-on: ubuntu-22.04
12 |     if: ${{ github.event.action == 'closed' }}
13 |     steps:
14 |       - name: "Create PRtifact"
15 |         run: |
16 |           mkdir -p ./pr
17 |           printf ${{ github.event.number }} > ./pr/NUM
18 |       - name: Upload Diff
19 |         uses: actions/upload-artifact@v4
20 |         with:
21 |           name: pr
22 |           path: ./pr
23 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-comment.yaml:
--------------------------------------------------------------------------------
  1 | name: "Bot: Comment on the Pull Request"
  2 | 
  3 | # read-write repo token
  4 | # access to secrets
  5 | on:
  6 |   workflow_run:
  7 |     workflows: ["Receive Pull Request"]
  8 |     types:
  9 |       - completed
 10 | 
 11 | concurrency:
 12 |   group: pr-${{ github.event.workflow_run.pull_requests[0].number }}
 13 |   cancel-in-progress: true
 14 | 
 15 | 
 16 | jobs:
 17 |   # Pull requests are valid if:
 18 |   #  - they match the sha of the workflow run head commit
 19 |   #  - they are open
 20 |   #  - no .github files were committed
 21 |   test-pr:
 22 |     name: "Test if pull request is valid"
 23 |     runs-on: ubuntu-22.04
 24 |     if: >
 25 |       github.event.workflow_run.event == 'pull_request' &&
 26 |       github.event.workflow_run.conclusion == 'success'
 27 |     outputs:
 28 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
 29 |       payload: ${{ steps.check-pr.outputs.payload }}
 30 |       number: ${{ steps.get-pr.outputs.NUM }}
 31 |       msg: ${{ steps.check-pr.outputs.MSG }}
 32 |     steps:
 33 |       - name: 'Download PR artifact'
 34 |         id: dl
 35 |         uses: carpentries/actions/download-workflow-artifact@main
 36 |         with:
 37 |           run: ${{ github.event.workflow_run.id }}
 38 |           name: 'pr'
 39 | 
 40 |       - name: "Get PR Number"
 41 |         if: ${{ steps.dl.outputs.success == 'true' }}
 42 |         id: get-pr
 43 |         run: |
 44 |           unzip pr.zip
 45 |           echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT
 46 | 
 47 |       - name: "Fail if PR number was not present"
 48 |         id: bad-pr
 49 |         if: ${{ steps.dl.outputs.success != 'true' }}
 50 |         run: |
 51 |           echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.'
 52 |           exit 1
 53 |       - name: "Get Invalid Hashes File"
 54 |         id: hash
 55 |         run: |
 56 |           echo "json<<EOF
 57 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
 58 |           EOF" >> $GITHUB_OUTPUT
 59 |       - name: "Check PR"
 60 |         id: check-pr
 61 |         if: ${{ steps.dl.outputs.success == 'true' }}
 62 |         uses: carpentries/actions/check-valid-pr@main
 63 |         with:
 64 |           pr: ${{ steps.get-pr.outputs.NUM }}
 65 |           sha: ${{ github.event.workflow_run.head_sha }}
 66 |           headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire
 67 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
 68 |           fail_on_error: true
 69 | 
 70 |   # Create an orphan branch on this repository with two commits
 71 |   #  - the current HEAD of the md-outputs branch
 72 |   #  - the output from running the current HEAD of the pull request through
 73 |   #    the md generator
 74 |   create-branch:
 75 |     name: "Create Git Branch"
 76 |     needs: test-pr
 77 |     runs-on: ubuntu-22.04
 78 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
 79 |     env:
 80 |       NR: ${{ needs.test-pr.outputs.number }}
 81 |     permissions:
 82 |       contents: write
 83 |     steps:
 84 |       - name: 'Checkout md outputs'
 85 |         uses: actions/checkout@v4
 86 |         with:
 87 |           ref: md-outputs
 88 |           path: built
 89 |           fetch-depth: 1
 90 | 
 91 |       - name: 'Download built markdown'
 92 |         id: dl
 93 |         uses: carpentries/actions/download-workflow-artifact@main
 94 |         with:
 95 |           run: ${{ github.event.workflow_run.id }}
 96 |           name: 'built'
 97 | 
 98 |       - if: ${{ steps.dl.outputs.success == 'true' }}
 99 |         run: unzip built.zip
100 | 
101 |       - name: "Create orphan and push"
102 |         if: ${{ steps.dl.outputs.success == 'true' }}
103 |         run: |
104 |           cd built/
105 |           git config --local user.email "actions@github.com"
106 |           git config --local user.name "GitHub Actions"
107 |           CURR_HEAD=$(git rev-parse HEAD)
108 |           git checkout --orphan md-outputs-PR-${NR}
109 |           git add -A
110 |           git commit -m "source commit: ${CURR_HEAD}"
111 |           ls -A | grep -v '^.git$' | xargs -I _ rm -r '_'
112 |           cd ..
113 |           unzip -o -d built built.zip
114 |           cd built
115 |           git add -A
116 |           git commit --allow-empty -m "differences for PR #${NR}"
117 |           git push -u --force --set-upstream origin md-outputs-PR-${NR}
118 | 
119 |   # Comment on the Pull Request with a link to the branch and the diff
120 |   comment-pr:
121 |     name: "Comment on Pull Request"
122 |     needs: [test-pr, create-branch]
123 |     runs-on: ubuntu-22.04
124 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
125 |     env:
126 |       NR: ${{ needs.test-pr.outputs.number }}
127 |     permissions:
128 |       pull-requests: write
129 |     steps:
130 |       - name: 'Download comment artifact'
131 |         id: dl
132 |         uses: carpentries/actions/download-workflow-artifact@main
133 |         with:
134 |           run: ${{ github.event.workflow_run.id }}
135 |           name: 'diff'
136 | 
137 |       - if: ${{ steps.dl.outputs.success == 'true' }}
138 |         run: unzip ${{ github.workspace }}/diff.zip
139 | 
140 |       - name: "Comment on PR"
141 |         id: comment-diff
142 |         if: ${{ steps.dl.outputs.success == 'true' }}
143 |         uses: carpentries/actions/comment-diff@main
144 |         with:
145 |           pr: ${{ env.NR }}
146 |           path: ${{ github.workspace }}/diff.md
147 | 
148 |   # Comment if the PR is open and matches the SHA, but the workflow files have
149 |   # changed
150 |   comment-changed-workflow:
151 |     name: "Comment if workflow files have changed"
152 |     needs: test-pr
153 |     runs-on: ubuntu-22.04
154 |     if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }}
155 |     env:
156 |       NR: ${{ github.event.workflow_run.pull_requests[0].number }}
157 |       body: ${{ needs.test-pr.outputs.msg }}
158 |     permissions:
159 |       pull-requests: write
160 |     steps:
161 |       - name: 'Check for spoofing'
162 |         id: dl
163 |         uses: carpentries/actions/download-workflow-artifact@main
164 |         with:
165 |           run: ${{ github.event.workflow_run.id }}
166 |           name: 'built'
167 | 
168 |       - name: 'Alert if spoofed'
169 |         id: spoof
170 |         if: ${{ steps.dl.outputs.success == 'true' }}
171 |         run: |
172 |           echo 'body<<EOF' >> $GITHUB_ENV
173 |           echo '' >> $GITHUB_ENV
174 |           echo '## :x: DANGER :x:' >> $GITHUB_ENV
175 |           echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV
176 |           echo '' >> $GITHUB_ENV
177 |           echo 'EOF' >> $GITHUB_ENV
178 | 
179 |       - name: "Comment on PR"
180 |         id: comment-diff
181 |         uses: carpentries/actions/comment-diff@main
182 |         with:
183 |           pr: ${{ env.NR }}
184 |           body: ${{ env.body }}
185 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-post-remove-branch.yaml:
--------------------------------------------------------------------------------
 1 | name: "Bot: Remove Temporary PR Branch"
 2 | 
 3 | on:
 4 |   workflow_run:
 5 |     workflows: ["Bot: Send Close Pull Request Signal"]
 6 |     types:
 7 |       - completed
 8 | 
 9 | jobs:
10 |   delete:
11 |     name: "Delete branch from Pull Request"
12 |     runs-on: ubuntu-22.04
13 |     if: >
14 |       github.event.workflow_run.event == 'pull_request' &&
15 |       github.event.workflow_run.conclusion == 'success'
16 |     permissions:
17 |       contents: write
18 |     steps:
19 |       - name: 'Download artifact'
20 |         uses: carpentries/actions/download-workflow-artifact@main
21 |         with:
22 |           run: ${{ github.event.workflow_run.id }}
23 |           name: pr
24 |       - name: "Get PR Number"
25 |         id: get-pr
26 |         run: |
27 |           unzip pr.zip
28 |           echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT
29 |       - name: 'Remove branch'
30 |         uses: carpentries/actions/remove-branch@main
31 |         with:
32 |           pr: ${{ steps.get-pr.outputs.NUM }}
33 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-preflight.yaml:
--------------------------------------------------------------------------------
 1 | name: "Pull Request Preflight Check"
 2 | 
 3 | on:
 4 |   pull_request_target:
 5 |     branches:
 6 |       ["main"]
 7 |     types:
 8 |       ["opened", "synchronize", "reopened"]
 9 | 
10 | jobs:
11 |   test-pr:
12 |     name: "Test if pull request is valid"
13 |     if: ${{ github.event.action != 'closed' }}
14 |     runs-on: ubuntu-22.04
15 |     outputs:
16 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
17 |     permissions:
18 |       pull-requests: write
19 |     steps:
20 |       - name: "Get Invalid Hashes File"
21 |         id: hash
22 |         run: |
23 |           echo "json<<EOF
24 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
25 |           EOF" >> $GITHUB_OUTPUT
26 |       - name: "Check PR"
27 |         id: check-pr
28 |         uses: carpentries/actions/check-valid-pr@main
29 |         with:
30 |           pr: ${{ github.event.number }}
31 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
32 |           fail_on_error: true
33 |       - name: "Comment result of validation"
34 |         id: comment-diff
35 |         if: ${{ always() }}
36 |         uses: carpentries/actions/comment-diff@main
37 |         with:
38 |           pr: ${{ github.event.number }}
39 |           body: ${{ steps.check-pr.outputs.MSG }}
40 | 


--------------------------------------------------------------------------------
/.github/workflows/pr-receive.yaml:
--------------------------------------------------------------------------------
  1 | name: "Receive Pull Request"
  2 | 
  3 | on:
  4 |   pull_request:
  5 |     types:
  6 |       [opened, synchronize, reopened]
  7 | 
  8 | concurrency:
  9 |   group: ${{ github.ref }}
 10 |   cancel-in-progress: true
 11 | 
 12 | jobs:
 13 |   test-pr:
 14 |     name: "Record PR number"
 15 |     if: ${{ github.event.action != 'closed' }}
 16 |     runs-on: ubuntu-22.04
 17 |     outputs:
 18 |       is_valid: ${{ steps.check-pr.outputs.VALID }}
 19 |     steps:
 20 |       - name: "Record PR number"
 21 |         id: record
 22 |         if: ${{ always() }}
 23 |         run: |
 24 |           echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR
 25 |       - name: "Upload PR number"
 26 |         id: upload
 27 |         if: ${{ always() }}
 28 |         uses: actions/upload-artifact@v4
 29 |         with:
 30 |           name: pr
 31 |           path: ${{ github.workspace }}/NR
 32 |       - name: "Get Invalid Hashes File"
 33 |         id: hash
 34 |         run: |
 35 |           echo "json<<EOF
 36 |           $(curl -sL https://files.carpentries.org/invalid-hashes.json)
 37 |           EOF" >> $GITHUB_OUTPUT
 38 |       - name: "echo output"
 39 |         run: |
 40 |           echo "${{ steps.hash.outputs.json }}"
 41 |       - name: "Check PR"
 42 |         id: check-pr
 43 |         uses: carpentries/actions/check-valid-pr@main
 44 |         with:
 45 |           pr: ${{ github.event.number }}
 46 |           invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
 47 | 
 48 |   build-md-source:
 49 |     name: "Build markdown source files if valid"
 50 |     needs: test-pr
 51 |     runs-on: ubuntu-22.04
 52 |     if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
 53 |     env:
 54 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 55 |       RENV_PATHS_ROOT: ~/.local/share/renv/
 56 |       CHIVE: ${{ github.workspace }}/site/chive
 57 |       PR: ${{ github.workspace }}/site/pr
 58 |       MD: ${{ github.workspace }}/site/built
 59 |     steps:
 60 |       - name: "Check Out Main Branch"
 61 |         uses: actions/checkout@v4
 62 | 
 63 |       - name: "Check Out Staging Branch"
 64 |         uses: actions/checkout@v4
 65 |         with:
 66 |           ref: md-outputs
 67 |           path: ${{ env.MD }}
 68 | 
 69 |       - name: "Set up R"
 70 |         uses: r-lib/actions/setup-r@v2
 71 |         with:
 72 |           use-public-rspm: true
 73 |           install-r: false
 74 | 
 75 |       - name: "Set up Pandoc"
 76 |         uses: r-lib/actions/setup-pandoc@v2
 77 | 
 78 |       - name: "Setup Lesson Engine"
 79 |         uses: carpentries/actions/setup-sandpaper@main
 80 |         with:
 81 |           cache-version: ${{ secrets.CACHE_VERSION }}
 82 | 
 83 |       - name: "Setup Package Cache"
 84 |         uses: carpentries/actions/setup-lesson-deps@main
 85 |         with:
 86 |           cache-version: ${{ secrets.CACHE_VERSION }}
 87 | 
 88 |       - name: "Validate and Build Markdown"
 89 |         id: build-site
 90 |         run: |
 91 |           sandpaper::package_cache_trigger(TRUE)
 92 |           sandpaper::validate_lesson(path = '${{ github.workspace }}')
 93 |           sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE)
 94 |         shell: Rscript {0}
 95 | 
 96 |       - name: "Generate Artifacts"
 97 |         id: generate-artifacts
 98 |         run: |
 99 |           sandpaper:::ci_bundle_pr_artifacts(
100 |             repo         = '${{ github.repository }}',
101 |             pr_number    = '${{ github.event.number }}',
102 |             path_md      = '${{ env.MD }}',
103 |             path_pr      = '${{ env.PR }}',
104 |             path_archive = '${{ env.CHIVE }}',
105 |             branch       = 'md-outputs'
106 |           )
107 |         shell: Rscript {0}
108 | 
109 |       - name: "Upload PR"
110 |         uses: actions/upload-artifact@v4
111 |         with:
112 |           name: pr
113 |           path: ${{ env.PR }}
114 |           overwrite: true
115 | 
116 |       - name: "Upload Diff"
117 |         uses: actions/upload-artifact@v4
118 |         with:
119 |           name: diff
120 |           path: ${{ env.CHIVE }}
121 |           retention-days: 1
122 | 
123 |       - name: "Upload Build"
124 |         uses: actions/upload-artifact@v4
125 |         with:
126 |           name: built
127 |           path: ${{ env.MD }}
128 |           retention-days: 1
129 | 
130 |       - name: "Teardown"
131 |         run: sandpaper::reset_site()
132 |         shell: Rscript {0}
133 | 


--------------------------------------------------------------------------------
/.github/workflows/sandpaper-main.yaml:
--------------------------------------------------------------------------------
 1 | name: "01 Build and Deploy Site"
 2 | 
 3 | on:
 4 |   push:
 5 |     branches:
 6 |       - main
 7 |       - master
 8 |   schedule:
 9 |     - cron: '0 0 * * 2'
10 |   workflow_dispatch:
11 |     inputs:
12 |       name:
13 |         description: 'Who triggered this build?'
14 |         required: true
15 |         default: 'Maintainer (via GitHub)'
16 |       reset:
17 |         description: 'Reset cached markdown files'
18 |         required: false
19 |         default: false
20 |         type: boolean
21 | jobs:
22 |   full-build:
23 |     name: "Build Full Site"
24 | 
25 |     # 2024-10-01: ubuntu-latest is now 24.04 and R is not installed by default in the runner image
26 |     # pin to 22.04 for now
27 |     runs-on: ubuntu-22.04
28 |     permissions:
29 |       checks: write
30 |       contents: write
31 |       pages: write
32 |     env:
33 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
34 |       RENV_PATHS_ROOT: ~/.local/share/renv/
35 |     steps:
36 | 
37 |       - name: "Checkout Lesson"
38 |         uses: actions/checkout@v4
39 | 
40 |       - name: "Set up R"
41 |         uses: r-lib/actions/setup-r@v2
42 |         with:
43 |           use-public-rspm: true
44 |           install-r: false
45 | 
46 |       - name: "Set up Pandoc"
47 |         uses: r-lib/actions/setup-pandoc@v2
48 | 
49 |       - name: "Setup Lesson Engine"
50 |         uses: carpentries/actions/setup-sandpaper@main
51 |         with:
52 |           cache-version: ${{ secrets.CACHE_VERSION }}
53 | 
54 |       - name: "Setup Package Cache"
55 |         uses: carpentries/actions/setup-lesson-deps@main
56 |         with:
57 |           cache-version: ${{ secrets.CACHE_VERSION }}
58 | 
59 |       - name: "Deploy Site"
60 |         run: |
61 |           reset <- "${{ github.event.inputs.reset }}" == "true"
62 |           sandpaper::package_cache_trigger(TRUE)
63 |           sandpaper:::ci_deploy(reset = reset)
64 |         shell: Rscript {0}
65 | 


--------------------------------------------------------------------------------
/.github/workflows/sandpaper-version.txt:
--------------------------------------------------------------------------------
1 | 0.16.11
2 | 


--------------------------------------------------------------------------------
/.github/workflows/update-cache.yaml:
--------------------------------------------------------------------------------
  1 | name: "03 Maintain: Update Package Cache"
  2 | 
  3 | on:
  4 |   workflow_dispatch:
  5 |     inputs:
  6 |       name:
  7 |         description: 'Who triggered this build (enter github username to tag yourself)?'
  8 |         required: true
  9 |         default: 'monthly run'
 10 |   schedule:
 11 |     # Run every tuesday
 12 |     - cron: '0 0 * * 2'
 13 | 
 14 | jobs:
 15 |   preflight:
 16 |     name: "Preflight Check"
 17 |     runs-on: ubuntu-22.04
 18 |     outputs:
 19 |       ok: ${{ steps.check.outputs.ok }}
 20 |     steps:
 21 |       - id: check
 22 |         run: |
 23 |           if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then
 24 |             echo "ok=true" >> $GITHUB_OUTPUT
 25 |             echo "Running on request"
 26 |           # using single brackets here to avoid 08 being interpreted as octal
 27 |           # https://github.com/carpentries/sandpaper/issues/250
 28 |           elif [ `date +%d` -le 7 ]; then
 29 |             # If the Tuesday lands in the first week of the month, run it
 30 |             echo "ok=true" >> $GITHUB_OUTPUT
 31 |             echo "Running on schedule"
 32 |           else
 33 |             echo "ok=false" >> $GITHUB_OUTPUT
 34 |             echo "Not Running Today"
 35 |           fi
 36 | 
 37 |   check_renv:
 38 |     name: "Check if We Need {renv}"
 39 |     runs-on: ubuntu-22.04
 40 |     needs: preflight
 41 |     if: ${{ needs.preflight.outputs.ok == 'true'}}
 42 |     outputs:
 43 |       needed: ${{ steps.renv.outputs.exists }}
 44 |     steps:
 45 |       - name: "Checkout Lesson"
 46 |         uses: actions/checkout@v4
 47 |       - id: renv
 48 |         run: |
 49 |           if [[ -d renv ]]; then
 50 |             echo "exists=true" >> $GITHUB_OUTPUT
 51 |           fi
 52 | 
 53 |   check_token:
 54 |     name: "Check SANDPAPER_WORKFLOW token"
 55 |     runs-on: ubuntu-22.04
 56 |     needs: check_renv
 57 |     if: ${{ needs.check_renv.outputs.needed == 'true' }}
 58 |     outputs:
 59 |       workflow: ${{ steps.validate.outputs.wf }}
 60 |       repo: ${{ steps.validate.outputs.repo }}
 61 |     steps:
 62 |       - name: "validate token"
 63 |         id: validate
 64 |         uses: carpentries/actions/check-valid-credentials@main
 65 |         with:
 66 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
 67 | 
 68 |   update_cache:
 69 |     name: "Update Package Cache"
 70 |     needs: check_token
 71 |     if: ${{ needs.check_token.outputs.repo== 'true' }}
 72 |     runs-on: ubuntu-22.04
 73 |     env:
 74 |       GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
 75 |       RENV_PATHS_ROOT: ~/.local/share/renv/
 76 |     steps:
 77 | 
 78 |       - name: "Checkout Lesson"
 79 |         uses: actions/checkout@v4
 80 | 
 81 |       - name: "Set up R"
 82 |         uses: r-lib/actions/setup-r@v2
 83 |         with:
 84 |           use-public-rspm: true
 85 |           install-r: false
 86 | 
 87 |       - name: "Update {renv} deps and determine if a PR is needed"
 88 |         id: update
 89 |         uses: carpentries/actions/update-lockfile@main
 90 |         with:
 91 |           cache-version: ${{ secrets.CACHE_VERSION }}
 92 | 
 93 |       - name: Create Pull Request
 94 |         id: cpr
 95 |         if: ${{ steps.update.outputs.n > 0 }}
 96 |         uses: carpentries/create-pull-request@main
 97 |         with:
 98 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
 99 |           delete-branch: true
100 |           branch: "update/packages"
101 |           commit-message: "[actions] update ${{ steps.update.outputs.n }} packages"
102 |           title: "Update ${{ steps.update.outputs.n }} packages"
103 |           body: |
104 |             :robot: This is an automated build
105 | 
106 |             This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions:
107 | 
108 |             ```
109 |             ${{ steps.update.outputs.report }}
110 |             ```
111 | 
112 |             :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates.
113 | 
114 |             If you want to inspect these changes locally, you can use the following code to check out a new branch:
115 | 
116 |             ```bash
117 |             git fetch origin update/packages
118 |             git checkout update/packages
119 |             ```
120 | 
121 |             - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
122 | 
123 |             [1]: https://github.com/carpentries/create-pull-request/tree/main
124 |           labels: "type: package cache"
125 |           draft: false
126 | 


--------------------------------------------------------------------------------
/.github/workflows/update-workflows.yaml:
--------------------------------------------------------------------------------
 1 | name: "02 Maintain: Update Workflow Files"
 2 | 
 3 | on:
 4 |   workflow_dispatch:
 5 |     inputs:
 6 |       name:
 7 |         description: 'Who triggered this build (enter github username to tag yourself)?'
 8 |         required: true
 9 |         default: 'weekly run'
10 |       clean:
11 |         description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)'
12 |         required: false
13 |         default: '.yaml'
14 |   schedule:
15 |     # Run every Tuesday
16 |     - cron: '0 0 * * 2'
17 | 
18 | jobs:
19 |   check_token:
20 |     name: "Check SANDPAPER_WORKFLOW token"
21 |     runs-on: ubuntu-22.04
22 |     outputs:
23 |       workflow: ${{ steps.validate.outputs.wf }}
24 |       repo: ${{ steps.validate.outputs.repo }}
25 |     steps:
26 |       - name: "validate token"
27 |         id: validate
28 |         uses: carpentries/actions/check-valid-credentials@main
29 |         with:
30 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
31 | 
32 |   update_workflow:
33 |     name: "Update Workflow"
34 |     runs-on: ubuntu-22.04
35 |     needs: check_token
36 |     if: ${{ needs.check_token.outputs.workflow == 'true' }}
37 |     steps:
38 |       - name: "Checkout Repository"
39 |         uses: actions/checkout@v4
40 | 
41 |       - name: Update Workflows
42 |         id: update
43 |         uses: carpentries/actions/update-workflows@main
44 |         with:
45 |           clean: ${{ github.event.inputs.clean }}
46 | 
47 |       - name: Create Pull Request
48 |         id: cpr
49 |         if: "${{ steps.update.outputs.new }}"
50 |         uses: carpentries/create-pull-request@main
51 |         with:
52 |           token: ${{ secrets.SANDPAPER_WORKFLOW }}
53 |           delete-branch: true
54 |           branch: "update/workflows"
55 |           commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}"
56 |           title: "Update Workflows to Version ${{ steps.update.outputs.new }}"
57 |           body: |
58 |             :robot: This is an automated build
59 | 
60 |             Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }}
61 | 
62 |             - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
63 | 
64 |             [1]: https://github.com/carpentries/create-pull-request/tree/main
65 |           labels: "type: template and tools"
66 |           draft: false
67 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # sandpaper files
 2 | episodes/*html
 3 | site/*
 4 | !site/README.md
 5 | 
 6 | # History files
 7 | .Rhistory
 8 | .Rapp.history
 9 | # Session Data files
10 | .RData
11 | # User-specific files
12 | .Ruserdata
13 | # Example code in package build process
14 | *-Ex.R
15 | # Output files from R CMD build
16 | /*.tar.gz
17 | # Output files from R CMD check
18 | /*.Rcheck/
19 | # RStudio files
20 | .Rproj.user/
21 | # produced vignettes
22 | vignettes/*.html
23 | vignettes/*.pdf
24 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
25 | .httr-oauth
26 | # knitr and R markdown default cache directories
27 | *_cache/
28 | /cache/
29 | # Temporary files created by R markdown
30 | *.utf8.md
31 | *.knit.md
32 | # R Environment Variables
33 | .Renviron
34 | # pkgdown site
35 | docs/
36 | # translation temp files
37 | po/*~
38 | # renv detritus
39 | renv/sandbox/
40 | *.pyc
41 | *~
42 | .DS_Store
43 | .ipynb_checkpoints
44 | .sass-cache
45 | .jekyll-cache/
46 | .jekyll-metadata
47 | __pycache__
48 | _site
49 | .Rproj.user
50 | .bundle/
51 | .vendor/
52 | vendor/
53 | .docker-vendor/
54 | Gemfile.lock
55 | .*history
56 | 


--------------------------------------------------------------------------------
/.zenodo.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "contributors": [
  3 |     {
  4 |       "type": "Editor",
  5 |       "name": "Annajiat Alim Rasel",
  6 |       "orcid": "0000-0003-0198-3734"
  7 |     }
  8 |   ],
  9 |   "creators": [
 10 |     {
 11 |       "name": "Annajiat Alim Rasel",
 12 |       "orcid": "0000-0003-0198-3734"
 13 |     },
 14 |     {
 15 |       "name": "Peter Smyth"
 16 |     },
 17 |     {
 18 |       "name": "Sarah M Brown",
 19 |       "orcid": "0000-0001-5728-0822"
 20 |     },
 21 |     {
 22 |       "name": "Stephen Edward Childs",
 23 |       "orcid": "0000-0002-4450-4281"
 24 |     },
 25 |     {
 26 |       "name": "Vini Salazar",
 27 |       "orcid": "0000-0002-8362-3195"
 28 |     },
 29 |     {
 30 |       "name": "Scott Carl Peterson",
 31 |       "orcid": "0000-0002-1920-616X"
 32 |     },
 33 |     {
 34 |       "name": "Geoffrey Boushey"
 35 |     },
 36 |     {
 37 |       "name": "Christopher Erdmann",
 38 |       "orcid": "0000-0003-2554-180X"
 39 |     },
 40 |     {
 41 |       "name": "Katrin Tirok",
 42 |       "orcid": "0000-0002-5040-9838"
 43 |     },
 44 |     {
 45 |       "name": "Katrin Leinweber",
 46 |       "orcid": "0000-0001-5135-5758"
 47 |     },
 48 |     {
 49 |       "name": "joelostblom"
 50 |     },
 51 |     {
 52 |       "name": "tg340"
 53 |     },
 54 |     {
 55 |       "name": "Tejaswinee Kelkar",
 56 |       "orcid": "0000-0002-2324-6850"
 57 |     },
 58 |     {
 59 |       "name": "Yee Mey Seah",
 60 |       "orcid": "0000-0002-5616-021X"
 61 |     },
 62 |     {
 63 |       "name": "Benjamin Tovar",
 64 |       "orcid": "0000-0002-5294-2281"
 65 |     },
 66 |     {
 67 |       "name": "Tadiwanashe Gutsa",
 68 |       "orcid": "0000-0002-6871-0899"
 69 |     },
 70 |     {
 71 |       "name": "crahal"
 72 |     },
 73 |     {
 74 |       "name": "Jacob Deppen"
 75 |     },
 76 |     {
 77 |       "name": "Karen Word",
 78 |       "orcid": "0000-0002-7294-7231"
 79 |     },
 80 |     {
 81 |       "name": "Katrin Tirok"
 82 |     },
 83 |     {
 84 |       "name": "Kevan Swanberg"
 85 |     },
 86 |     {
 87 |       "name": "Tim Young"
 88 |     },
 89 |     {
 90 |       "name": "Steve Haddock"
 91 |     },
 92 |     {
 93 |       "name": "sadkate"
 94 |     },
 95 |     {
 96 |       "name": "Sanjay Fuloria",
 97 |       "orcid": "0000-0002-4185-0541"
 98 |     }
 99 |   ],
100 |   "license": {
101 |     "id": "CC-BY-4.0"
102 |   }
103 | }


--------------------------------------------------------------------------------
/AUTHORS.md:
--------------------------------------------------------------------------------
 1 | * FIXME: list authors' names and email addresses.
 2 | * FIXME: https://github.com/datacarpentry/python-socialsci/blob/gh-pages/.github/workflows/generate_author_md_yml
 3 | * LOCALSOURCE: git log | grep Author: | sort | uniq
 4 | * DATED: 2021020402
 5 | # STATIC LIST:
 6 | * Abigail Cabunoc <abigail.cabunoc@gmail.com>
 7 | * Abigail Cabunoc <abigail.cabunoc@oicr.on.ca>
 8 | * Allen Lee <allen.lee@asu.edu>
 9 | * Andrew Sanchez <inbox.asanchez@gmail.com>
10 | * Andy Boughton <abought@users.noreply.github.com>
11 | * Annajiat Alim Rasel <annajiat@gmail.com>
12 | * Belinda Weaver <bweaver@carpentries.org>
13 | * beroe <haddock@mbari.org>
14 | * Bill Mills <mills.wj@gmail.com>
15 | * Brandon Curtis <brandon.curtis@gmail.com>
16 | * Chris Erdmann <chris@carpentries.org>
17 | * David Mawdsley <david.mawdsley@manchester.ac.uk>
18 | * David Perez Suarez <dps.helio@gmail.com>
19 | * EDUB <35309993+eemdub@users.noreply.github.com>
20 | * Erin Becker <erinstellabecker@gmail.com>
21 | * ErinBecker <erinstellabecker@gmail.com>
22 | * evanwill <evanpeterw@gmail.com>
23 | * Francois Michonneau <francois.michonneau@gmail.com>
24 | * Francois Michonneau <francois.michonneau@gmail.Com>
25 | * François Michonneau <francois.michonneau@gmail.com>
26 | * Gabriel A. Devenyi <gdevenyi@gmail.com>
27 | * Geoffrey Boushey <geoffrey.boushey@ucsf.edu>
28 | * Greg Wilson <gvwilson@software-carpentry.org>
29 | * Greg Wilson <gvwilson@third-bit.com>
30 | * Ian Carroll <icarroll@sesync.org>
31 | * Ian Lee <lee1001@llnl.gov>
32 | * Jacob Deppen <deppen.8@gmail.com>
33 | * James Allen <jamesallen0108@gmail.com>
34 | * jcoliver <jcoliver@email.arizona.edu>
35 | * Joel Nothman <joel.nothman@gmail.com>
36 | * joelostblom <joel.ostblom@gmail.com>
37 | * Jon Pipitone <jon.pipitone@utoronto.ca>
38 | * Jonah Duckles <jonah@duckles.org>
39 | * Joseph Stachelek <jsta@users.noreply.github.com>
40 | * jsta <stachel2@msu.edu>
41 | * karenword <krlizars@ucdavis.edu>
42 | * Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com>
43 | * Katrin Leinweber <kalei@posteo.de>
44 | * Katrin Tirok <11316788+katrintirok@users.noreply.github.com>
45 | * Katrin Tirok <Katrin@Katrins-MacBook-Air.local>
46 | * Katrin Tirok <katrintirok@gmail.com>
47 | * Katrin Tirok <tirok@ukzn.ac.za>
48 | * Maxim Belkin <maxim.belkin@gmail.com>
49 | * Maxim Belkin <maxim-belkin@users.noreply.github.com>
50 | * Michael Hansen <hansen.mike@gmail.com>
51 | * Michael R. Crusoe <1330696+mr-c@users.noreply.github.com>
52 | * Michael R. Crusoe <michael.crusoe@gmail.com>
53 | * mzc9 <m0nadd0g@gmail.com>
54 | * naught101 <naught101@gmail.com>
55 | * Nick Young <nbay92@gmail.com>
56 | * Nick Young <nick.young@auckland.ac.nz>
57 | * PeterSmyth12 <PeterSmyth12@users.noreply.github.com>
58 | * Piotr Banaszkiewicz <piotr@banaszkiewicz.org>
59 | * Raniere Silva <ra092767@ime.unicamp.br>
60 | * Raniere Silva <raniere@ime.unicamp.br>
61 | * Raniere Silva <raniere@rgaiacs.com>
62 | * Rémi Emonet <remi.emonet@reverse--com.heeere>
63 | * Rémi Emonet <twitwi@users.noreply.github.com>
64 | * Remi Rampin <remirampin@gmail.com>
65 | * sadkate <63014805+sadkate@users.noreply.github.com>
66 | * Sarah Brown <sarah_m_brown@brown.edu>
67 | * Sarah Brown <sarahmbrown@berkeley.edu>
68 | * Scott Peterson <speterso@berkeley.edu>
69 | * Stephen Childs <sechilds@gmail.com>
70 | * Tadigutsah <43744878+Tadigutsah@users.noreply.github.com>
71 | * Tejaswinee K <tejaswineek@gmail.com>
72 | * tg340 <33851795+tg340@users.noreply.github.com>
73 | * Tim Young <coursea@tsyoung.net>
74 | * Timothée Poisot <tim@poisotlab.io>
75 | * Toby Hodges <tbyhdgs@gmail.com>
76 | * Tracy Teal <tkteal@gmail.com>
77 | * Tracy Teal <tracyt@idyll.org>
78 | * trk <dev.trk.9001@gmail.com>
79 | * W. Trevor King <wking@tremily.us>
80 | * William L. Close <close.will@gmail.com>
81 | * William L. Close <wclose@users.noreply.github.com>
82 | * Yee Mey <yeemey@users.noreply.github.com>
83 | * ymseah <yeemey@users.noreply.github.com>
84 | 


--------------------------------------------------------------------------------
/CITATION:
--------------------------------------------------------------------------------
1 | FIXME: describe how to cite this lesson.
2 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Contributor Code of Conduct"
 3 | ---
 4 | 
 5 | As contributors and maintainers of this project,
 6 | we pledge to follow the [The Carpentries Code of Conduct][coc].
 7 | 
 8 | Instances of abusive, harassing, or otherwise unacceptable behavior
 9 | may be reported by following our [reporting guidelines][coc-reporting].
10 | 
11 | 
12 | [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
13 | [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
14 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
  1 | ## Contributing
  2 | 
  3 | [The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data
  4 | Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source
  5 | projects, and we welcome contributions of all kinds: new lessons, fixes to
  6 | existing material, bug reports, and reviews of proposed changes are all
  7 | welcome.
  8 | 
  9 | ### Contributor Agreement
 10 | 
 11 | By contributing, you agree that we may redistribute your work under [our
 12 | license](LICENSE.md). In exchange, we will address your issues and/or assess
 13 | your change proposal as promptly as we can, and help you become a member of our
 14 | community. Everyone involved in [The Carpentries][cp-site] agrees to abide by
 15 | our [code of conduct](CODE_OF_CONDUCT.md).
 16 | 
 17 | ### How to Contribute
 18 | 
 19 | The easiest way to get started is to file an issue to tell us about a spelling
 20 | mistake, some awkward wording, or a factual error. This is a good way to
 21 | introduce yourself and to meet some of our community members.
 22 | 
 23 | 1. If you do not have a [GitHub][github] account, you can [send us comments by
 24 |    email][contact]. However, we will be able to respond more quickly if you use
 25 |    one of the other methods described below.
 26 | 
 27 | 2. If you have a [GitHub][github] account, or are willing to [create
 28 |    one][github-join], but do not know how to use Git, you can report problems
 29 |    or suggest improvements by [creating an issue][issues]. This allows us to
 30 |    assign the item to someone and to respond to it in a threaded discussion.
 31 | 
 32 | 3. If you are comfortable with Git, and would like to add or change material,
 33 |    you can submit a pull request (PR). Instructions for doing this are
 34 |    [included below](#using-github).
 35 | 
 36 | Note: if you want to build the website locally, please refer to [The Workbench
 37 | documentation][template-doc].
 38 | 
 39 | ### Where to Contribute
 40 | 
 41 | 1. If you wish to change this lesson, add issues and pull requests here.
 42 | 2. If you wish to change the template used for workshop websites, please refer
 43 |    to [The Workbench documentation][template-doc].
 44 | 
 45 | 
 46 | ### What to Contribute
 47 | 
 48 | There are many ways to contribute, from writing new exercises and improving
 49 | existing ones to updating or filling in the documentation and submitting [bug
 50 | reports][issues] about things that do not work, are not clear, or are missing.
 51 | If you are looking for ideas, please see [the list of issues for this
 52 | repository][repo], or the issues for [Data Carpentry][dc-issues], [Library
 53 | Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects.
 54 | 
 55 | Comments on issues and reviews of pull requests are just as welcome: we are
 56 | smarter together than we are on our own. **Reviews from novices and newcomers
 57 | are particularly valuable**: it's easy for people who have been using these
 58 | lessons for a while to forget how impenetrable some of this material can be, so
 59 | fresh eyes are always welcome.
 60 | 
 61 | ### What *Not* to Contribute
 62 | 
 63 | Our lessons already contain more material than we can cover in a typical
 64 | workshop, so we are usually *not* looking for more concepts or tools to add to
 65 | them. As a rule, if you want to introduce a new idea, you must (a) estimate how
 66 | long it will take to teach and (b) explain what you would take out to make room
 67 | for it. The first encourages contributors to be honest about requirements; the
 68 | second, to think hard about priorities.
 69 | 
 70 | We are also not looking for exercises or other material that only run on one
 71 | platform. Our workshops typically contain a mixture of Windows, macOS, and
 72 | Linux users; in order to be usable, our lessons must run equally well on all
 73 | three.
 74 | 
 75 | ### Using GitHub
 76 | 
 77 | If you choose to contribute via GitHub, you may want to look at [How to
 78 | Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we
 79 | use [GitHub flow][github-flow] to manage changes:
 80 | 
 81 | 1. Create a new branch in your desktop copy of this repository for each
 82 |    significant change.
 83 | 2. Commit the change in that branch.
 84 | 3. Push that branch to your fork of this repository on GitHub.
 85 | 4. Submit a pull request from that branch to the [upstream repository][repo].
 86 | 5. If you receive feedback, make changes on your desktop and push to your
 87 |    branch on GitHub: the pull request will update automatically.
 88 | 
 89 | NB: The published copy of the lesson is usually in the `main` branch.
 90 | 
 91 | Each lesson has a team of maintainers who review issues and pull requests or
 92 | encourage others to do so. The maintainers are community volunteers, and have
 93 | final say over what gets merged into the lesson.
 94 | 
 95 | ### Other Resources
 96 | 
 97 | The Carpentries is a global organisation with volunteers and learners all over
 98 | the world. We share values of inclusivity and a passion for sharing knowledge,
 99 | teaching and learning. There are several ways to connect with The Carpentries
100 | community listed at <https://carpentries.org/connect/> including via social
101 | media, slack, newsletters, and email lists. You can also [reach us by
102 | email][contact].
103 | 
104 | [repo]: https://example.com/FIXME
105 | [contact]: mailto:team@carpentries.org
106 | [cp-site]: https://carpentries.org/
107 | [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry
108 | [dc-lessons]: https://datacarpentry.org/lessons/
109 | [dc-site]: https://datacarpentry.org/
110 | [discuss-list]: https://lists.software-carpentry.org/listinfo/discuss
111 | [github]: https://github.com
112 | [github-flow]: https://guides.github.com/introduction/flow/
113 | [github-join]: https://github.com/join
114 | [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github
115 | [issues]: https://carpentries.org/help-wanted-issues/
116 | [lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry
117 | [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry
118 | [swc-lessons]: https://software-carpentry.org/lessons/
119 | [swc-site]: https://software-carpentry.org/
120 | [lc-site]: https://librarycarpentry.org/
121 | [template-doc]: https://carpentries.github.io/workbench/
122 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: "Licenses"
 3 | ---
 4 | 
 5 | ## Instructional Material
 6 | 
 7 | All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
 8 | instructional material is made available under the [Creative Commons
 9 | Attribution license][cc-by-human]. The following is a human-readable summary of
10 | (and not a substitute for) the [full legal text of the CC BY 4.0
11 | license][cc-by-legal].
12 | 
13 | You are free:
14 | 
15 | - to **Share**---copy and redistribute the material in any medium or format
16 | - to **Adapt**---remix, transform, and build upon the material
17 | 
18 | for any purpose, even commercially.
19 | 
20 | The licensor cannot revoke these freedoms as long as you follow the license
21 | terms.
22 | 
23 | Under the following terms:
24 | 
25 | - **Attribution**---You must give appropriate credit (mentioning that your work
26 |   is derived from work that is Copyright (c) The Carpentries and, where
27 |   practical, linking to <https://carpentries.org/>), provide a [link to the
28 |   license][cc-by-human], and indicate if changes were made. You may do so in
29 |   any reasonable manner, but not in any way that suggests the licensor endorses
30 |   you or your use.
31 | 
32 | - **No additional restrictions**---You may not apply legal terms or
33 |   technological measures that legally restrict others from doing anything the
34 |   license permits.  With the understanding that:
35 | 
36 | Notices:
37 | 
38 | * You do not have to comply with the license for elements of the material in
39 |   the public domain or where your use is permitted by an applicable exception
40 |   or limitation.
41 | * No warranties are given. The license may not give you all of the permissions
42 |   necessary for your intended use. For example, other rights such as publicity,
43 |   privacy, or moral rights may limit how you use the material.
44 | 
45 | ## Software
46 | 
47 | Except where otherwise noted, the example programs and other software provided
48 | by The Carpentries are made available under the [OSI][osi]-approved [MIT
49 | license][mit-license].
50 | 
51 | Permission is hereby granted, free of charge, to any person obtaining a copy of
52 | this software and associated documentation files (the "Software"), to deal in
53 | the Software without restriction, including without limitation the rights to
54 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
55 | of the Software, and to permit persons to whom the Software is furnished to do
56 | so, subject to the following conditions:
57 | 
58 | The above copyright notice and this permission notice shall be included in all
59 | copies or substantial portions of the Software.
60 | 
61 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
62 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
63 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
64 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
65 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
66 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
67 | SOFTWARE.
68 | 
69 | ## Trademark
70 | 
71 | "The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
72 | Carpentry" and their respective logos are registered trademarks of
73 | [The Carpentries, Inc.][carpentries].
74 | 
75 | [cc-by-human]: https://creativecommons.org/licenses/by/4.0/
76 | [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
77 | [mit-license]: https://opensource.org/licenses/mit-license.html
78 | [carpentries]: https://carpentries.org
79 | [osi]: https://opensource.org
80 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://slack-invite.carpentries.org/)
 2 | [![Slack Status](https://img.shields.io/badge/Slack_Channel-dc--socsci--py-E01563.svg)](https://carpentries.slack.com/messages/C9WJEBW01)
 3 | 
 4 | # Data Carpentry Python Lesson with Social Science Data
 5 | 
 6 | Data Carpentry Lesson on Python for social scientists and others based on the data sources mentioned below. Please see our [contribution guidelines](CONTRIBUTING.md) for information on how to contribute updates, bug fixes, or other corrections.
 7 | 
 8 | ### Data
 9 | 
10 | - [The SAFI Teaching Database](https://datacarpentry.org/socialsci-workshop/data/)
11 | - Hansard Society, Parliament and Government Programme. (2014). Audit of Political Engagement 11, 2013. [data collection]. UK Data Service. SN: 7577, [http://doi.org/10.5255/UKDA-SN-7577-1](https://doi.org/10.5255/UKDA-SN-7577-1). Contains public sector information licensed under the [Open Government Licence v2.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/2.0)
12 | 
13 | ### Maintainers
14 | 
15 | - Stephen Childs ([@sechilds](https://github.com/sechilds))
16 | - Geoffrey Boushey ([@gboushey](https://github.com/gboushey))
17 | - Annajiat Alim Rasel ([@annajiat](https://github.com/annajiat))
18 | 
19 | 
20 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | #------------------------------------------------------------
 2 | # Values for this lesson.
 3 | #------------------------------------------------------------
 4 | 
 5 | # Which carpentry is this (swc, dc, lc, or cp)?
 6 | # swc: Software Carpentry
 7 | # dc: Data Carpentry
 8 | # lc: Library Carpentry
 9 | # cp: Carpentries (to use for instructor training for instance)
10 | # incubator: The Carpentries Incubator
11 | carpentry: 'dc'
12 | 
13 | # Overall title for pages.
14 | title: 'Data Analysis and Visualization with Python for Social Scientists *alpha*'
15 | 
16 | # Date the lesson was created (YYYY-MM-DD, this is empty by default)
17 | created: '2017-05-25'
18 | 
19 | # Comma-separated list of keywords for the lesson
20 | keywords: 'software, data, lesson, The Carpentries'
21 | 
22 | # Life cycle stage of the lesson
23 | # possible values: pre-alpha, alpha, beta, stable
24 | life_cycle: 'alpha'
25 | 
26 | # License of the lesson materials (recommended CC-BY 4.0)
27 | license: 'CC-BY 4.0'
28 | 
29 | # Link to the source repository for this lesson
30 | source: 'https://github.com/datacarpentry/python-socialsci'
31 | 
32 | # Default branch of your lesson
33 | branch: 'main'
34 | 
35 | # Who to contact if there are any issues
36 | contact: 'team@carpentries.org'
37 | 
38 | # Navigation ------------------------------------------------
39 | #
40 | # Use the following menu items to specify the order of
41 | # individual pages in each dropdown section. Leave blank to
42 | # include all pages in the folder.
43 | #
44 | # Example -------------
45 | #
46 | # episodes:
47 | # - introduction.md
48 | # - first-steps.md
49 | #
50 | # learners:
51 | # - setup.md
52 | #
53 | # instructors:
54 | # - instructor-notes.md
55 | #
56 | # profiles:
57 | # - one-learner.md
58 | # - another-learner.md
59 | 
60 | # Order of episodes in your lesson
61 | episodes: 
62 | - 01-introduction.md
63 | - 02-basics.md
64 | - 03-control-structures.md
65 | - 04-reusable.md
66 | - 05-processing-data-from-file.md
67 | - 06-date-and-time.md
68 | - 07-json.md
69 | - 08-Pandas.md
70 | - 09-extracting-data.md
71 | - 10-aggregations.md
72 | - 11-joins.md
73 | - 12-long-and-wide.md
74 | - 13-matplotlib.md
75 | - 14-sqlite.md
76 | 
77 | # Information for Learners
78 | learners: 
79 | 
80 | # Information for Instructors
81 | instructors: 
82 | 
83 | # Learner Profiles
84 | profiles: 
85 | 
86 | # Customisation ---------------------------------------------
87 | #
88 | # This space below is where custom yaml items (e.g. pinning
89 | # sandpaper and varnish versions) should live
90 | 
91 | 
92 | url: 'https://datacarpentry.github.io/python-socialsci'
93 | analytics: carpentries
94 | lang: en
95 | 


--------------------------------------------------------------------------------
/episodes/01-introduction.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Introduction to Python
  3 | teaching: 15
  4 | exercises: 0
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Examine the Python interpreter
 10 | - Recognize the advantage of using the Python programming language
 11 | - Understand the concept and benefits of using notebooks for coding
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - Why learn Python?
 18 | - What are Jupyter notebooks?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | ## Introducing the Python programming language
 23 | 
 24 | Python is a general purpose programming language. It is an interpreted language,
 25 | which makes it suitable for rapid development and prototyping of programming segments or complete
 26 | small programs.
 27 | 
 28 | Python's main advantages:
 29 | 
 30 | - Open source software, supported by [Python Software
 31 |   Foundation](https://www.python.org/psf/)
 32 | - Available on all major platforms (Windows, macOS, Linux)
 33 | - It is a good language for new programmers to learn due to its straightforward,
 34 |   object-oriented style
 35 | - It is well-structured, which aids readability
 36 | - It is extensible (i.e. modifiable) and is supported by a large community who
 37 |   provide a comprehensive range of 3rd party packages
 38 | 
 39 | ## Interpreted vs. compiled languages
 40 | 
 41 | In any programming language, the code must be translated into "machine code"
 42 | before running it. It is the machine code which is executed and produces
 43 | results. In a language like C++ your code is translated into machine code and
 44 | stored in a separate file, in a process referred to as **compiling** the code.
 45 | You then execute the machine code from the file as a separate step. This is
 46 | efficient if you intend to run the same machine code many times as you only have
 47 | to compile it once and it is very fast to run the compiled machine code.
 48 | 
 49 | On the other hand, if you are experimenting, then your
 50 | code will change often and you would have to compile it again every time before
 51 | the machine can execute it. This is where **interpreted** languages have the
 52 | advantage. You don't need a complete compiled program to "run" what has been
 53 | written so far and see the results. This rapid prototyping is helped further by
 54 | use of a system called REPL.
 55 | 
 56 | ## REPL
 57 | 
 58 | **REPL** is an acronym which stands for Read, Evaluate, Print and Loop.
 59 | 
 60 | REPL allows you to write single statements of code, have them executed, and if
 61 | there are any results to show, they are displayed and then the interpreter loops
 62 | back to the beginning and waits for the next program statement.
 63 | 
 64 | ![](fig/Python_repl_3.png){alt='Python\_Repl'}
 65 | 
 66 | In the example above, two variables `a` and `b` have been created, assigned to values
 67 | `2` and `3`, and then multiplied together.
 68 | 
 69 | Every time you press <kbd>Return</kbd>, the line is interpreted. The assignment statements don't produce any
 70 | output so you get only the standard `>>>` prompt.
 71 | 
 72 | For the `a*b` statement (it is more of an expression than a statement), because
 73 | the result is not being assigned to a variable, the REPL displays the result of
 74 | the calculation on screen and then waits for the next input.
 75 | 
 76 | The REPL system makes it very easy to try out small chunks of code.
 77 | 
 78 | You are not restricted to single line statements. If the Python interpreter
 79 | decides that what you have written on a line cannot be a complete statement it
 80 | will give you a continuation prompt of `...` until you complete the statement.
 81 | 
 82 | ## Introducing Jupyter notebooks
 83 | 
 84 | [**Jupyter**](https://jupyter.org/) originates from IPython, an effort to make Python
 85 | development more interactive. Since its inception, the scope of the project
 86 | has expanded to include **Ju**lia, **Pyt**hon, and **R**, so the name was changed to "Jupyter"
 87 | as a reference to these core languages. Today, Jupyter supports even more
 88 | languages, but we will be using it only for Python code. Specifically, we will
 89 | be using **Jupyter notebooks**, which allows us to easily take notes about
 90 | our analysis and view plots within the same document where we code. This
 91 | facilitates sharing and reproducibility of analyses, and the notebook interface
 92 | is easily accessible through any web browser. Jupyter notebooks are started
 93 | from the terminal using
 94 | 
 95 | ```bash
 96 | $ jupyter notebook
 97 | ```
 98 | 
 99 | Your browser should start automatically and look
100 | something like this:
101 | 
102 | ![](fig/Python_jupyter_6.png){alt='Jupyter\_notebook\_list'}
103 | 
104 | When you create a notebook from the *New* option, the new notebook will be displayed in a new
105 | browser tab and look like this.
106 | 
107 | ![](fig/Python_jupyter_7.png){alt='Jupyter\_notebook'}
108 | 
109 | Initially the notebook has no name other than 'Untitled'. If you click on 'Untitled' you will be
110 | given the option of changing the name to whatever you want.
111 | 
112 | The notebook is divided into **cells**. Initially there will be a single input cell marked by `In  [ ]:`.
113 | 
114 | You can type Python code directly into the cell. You can split the code across
115 | several lines as needed. Unlike the REPL we looked at before, the code is not
116 | interpreted line by line. To interpret the code in a cell, you can click the
117 | *Run* button in the toolbar or from the *Cell* menu option, or use keyboard
118 | shortcuts (e.g., <kbd>Shift</kbd>\+<kbd>Return</kbd>). All of the code in that cell will then be
119 | executed.
120 | 
121 | The results are shown in a separate `Out [1]:` cell immediately below. A new input
122 | cell (`In [ ]:`) is created for you automatically.
123 | 
124 | ![](fig/Python_jupyter_8.png){alt='Jupyter\_notebook\_cell'}
125 | 
126 | When a cell is run, it is given a number along with the corresponding output
127 | cell. If you have a notebook with many cells in it you can run the cells in any
128 | order and also run the same cell many times. The number on the left hand side of
129 | the input cells increments, so you can always tell the order in which they were
130 | run. For example, a cell marked `In [5]:` was the fifth cell run in the sequence.
131 | 
132 | Although there is an option to do so on the toolbar, you do not have to manually
133 | save the notebook. This is done automatically by the Jupyter system.
134 | 
135 | Not only are the contents of the `In [ ]:` cells saved, but so are the `Out [ ]:` cells.
136 | This allows you to create complete documents with both your code and the output
137 | of the code in a single place.  You can also change the cell type from
138 | Python code to Markdown using the *Cell > Cell Type* option. [**Markdown**](https://en.wikipedia.org/wiki/Markdown) is
139 | a simple formatting system which allows you to create documentation for your
140 | code, again all within the same notebook structure.
141 | 
142 | The notebook itself is stored as specially-formatted text file with an `.ipynb`
143 | extension. These files can be opened and run by others with Jupyter installed. This allows you to
144 | share your code inputs, outputs, and
145 | Markdown documentation with others. You can also export the notebook to HTML, PDF, and
146 | many other formats to make sharing even easier.
147 | 
148 | :::::::::::::::::::::::::::::::::::::::: keypoints
149 | 
150 | - Python is an interpreted language
151 | - The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments
152 | - Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared
153 | - Jupyter notebooks is a complete IDE (Integrated Development Environment)
154 | 
155 | ::::::::::::::::::::::::::::::::::::::::::::::::::
156 | 
157 | 
158 | 


--------------------------------------------------------------------------------
/episodes/03-control-structures.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Python control structures
  3 | teaching: 20
  4 | exercises: 25
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Change program flow using available language constructs
 10 | - Demonstrate how to execute a section of code a fixed number of times
 11 | - Demonstrate how to conditionally execute a section of code
 12 | - Demonstrate how to execute a section of code on a list of items
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - What constructs are available for changing the flow of a program?
 19 | - How can I repeat an action many times?
 20 | - How can I perform the same task(s) on a set of items?
 21 | 
 22 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 23 | 
 24 | ## Programs are rarely linear
 25 | 
 26 | Most programs do not work by executing a simple sequential set of statements. The code is constructed so that decisions and different paths through the program can be taken based on changes in variable values.
 27 | 
 28 | To make this possible all programming language have a set of control structures which allow this to happen.
 29 | 
 30 | In this episode we are going to look at how we can create loops and branches in our Python code.
 31 | Specifically we will look at three control structures, namely:
 32 | 
 33 | - if..else..
 34 | - while...
 35 | - for ...
 36 | 
 37 | ## The `if` statement and variants
 38 | 
 39 | The simple `if` statement allows the program to branch based on the evaluation of an expression
 40 | 
 41 | The basic format of the `if` statement is:
 42 | 
 43 | ```python
 44 | if expression :
 45 |     statement 1
 46 |     statement 2
 47 |     ...
 48 |     statement n
 49 | 
 50 | statement always executed
 51 | ```
 52 | 
 53 | If the expression evaluates to `True` then the statements 1 to n will be executed followed by `statement always executed` . If the expression is `False`, only `statement always executed` is executed. Python knows which lines of code are related to the `if` statement by the indentation, no extra syntax is necessary.
 54 | 
 55 | Below are some examples:
 56 | 
 57 | ```python
 58 | print("\nExample 1\n")
 59 | 
 60 | value = 5
 61 | threshold= 4
 62 | print("value is", value, "threshold is ",threshold)
 63 | if value > threshold :
 64 |     print(value, "is bigger than ", threshold)
 65 | 
 66 | print("\nExample 2\n")
 67 | 
 68 | 
 69 | high_threshold = 6
 70 | print("value is", value, "new threshold is ",high_threshold)
 71 | if value > high_threshold :
 72 |     print(value , "is above ", high_threshold, "threshold")
 73 | 
 74 | print("\nExample 3\n")
 75 | 
 76 | 
 77 | mid_threshold = 5
 78 | print("value is", value, "final threshold is ",mid_threshold)
 79 | if value == mid_threshold :
 80 |     print("value, ", value, " and threshold,", mid_threshold, ", are equal")
 81 | ```
 82 | 
 83 | ```output
 84 | Example 1
 85 | 
 86 | value is 5 threshold is 4
 87 | 5 is bigger than 4
 88 | 
 89 | Example 2
 90 | 
 91 | value is 5 new threshold is 6
 92 | 
 93 | Example 3
 94 | 
 95 | value is 5 final threshold is 5
 96 | value, 5, and threshold, 5, are equal
 97 | ```
 98 | 
 99 | In the examples above there are three things to notice:
100 | 
101 | 1. The colon `:` at the end of the `if` line. Leaving this out is a common error.
102 | 2. The indentation of the print statement. If you remembered the `:` on the line before, Jupyter (or any other Python IDE) will automatically do the indentation for you. All of the statements indented at this level are considered to be part of the `if` statement. This is a feature fairly unique to Python, that it cares about the indentation. If there is too much, or too little indentation, you will get an error.
103 | 3. The `if` statement is ended by removing the indent. There is no explicit end to the `if` statement as there is in many other programming languages
104 | 
105 | In the last example, notice that in Python the operator used to check equality is `==`.
106 | 
107 | :::::::::::::::::::::::::::::::::::::::  challenge
108 | 
109 | ## Exercise
110 | 
111 | Add another if statement to example 2 that will check if b is greater than or equal to a
112 | 
113 | :::::::::::::::  solution
114 | 
115 | ## Solution
116 | 
117 | ```python
118 | print("\nExample 2a\n")
119 | 
120 | a= 3
121 | b= 4
122 | print("a is", a, "b is",b)
123 | if a > b :
124 |     print(a, "is bigger than ", b)
125 | if a <= b :
126 |     print(b, "is bigger than or equal to ", a)
127 | ```
128 | 
129 | :::::::::::::::::::::::::
130 | 
131 | ::::::::::::::::::::::::::::::::::::::::::::::::::
132 | 
133 | Instead of using two separate `if` statements to decide which is larger we can use the `if ... else ...` construct
134 | 
135 | ```python
136 | # if ... else ...
137 | 
138 | value = 4
139 | threshold = 5
140 | print("value = ", value, "and threshold = ", threshold)
141 | 
142 | if value > threshold :
143 |     print("above threshold")
144 | else :
145 |     print("below threshold")
146 | ```
147 | 
148 | ```output
149 | value = 4 and threshold = 5
150 | below threshold
151 | ```
152 | 
153 | :::::::::::::::::::::::::::::::::::::::  challenge
154 | 
155 | ## Exercise
156 | 
157 | Repeat above with different operators '\<' , '=='
158 | 
159 | 
160 | ::::::::::::::::::::::::::::::::::::::::::::::::::
161 | 
162 | A further extension of the `if` statement is the `if ... elif ...else` version.
163 | 
164 | The example below allows you to be more specific about the comparison of a and b.
165 | 
166 | ```python
167 | # if ... elif ... else ... endIf
168 | 
169 | a = 5
170 | b = 4
171 | print("a = ", a, "and b = ", b)
172 | 
173 | if a > b :
174 |     print(a, " is greater than ", b)
175 | elif a == b :
176 |     print(a, " equals ", b)
177 | else :
178 |     print(a, " is less than ", b)
179 | ```
180 | 
181 | ```output
182 | a = 5 and b = 4
183 | 5 is greater than 4
184 | ```
185 | 
186 | The overall structure is similar to the `if ... else` statement. There are three additional things to notice:
187 | 
188 | 1. Each `elif` clause has its own test expression.
189 | 2. You can have as many `elif` clauses as you need
190 | 3. Execution of the whole statement stops after an `elif` expression is found to be True. Therefore the ordering of the `elif` clause can be significant.
191 | 
192 | ## The `while` loop
193 | 
194 | The while loop is used to repeatedly execute lines of code until some condition becomes False.
195 | 
196 | For the loop to terminate, there has to be something in the code which will potentially change the condition.
197 | 
198 | ```python
199 | # while loop
200 | n = 10
201 | cur_sum = 0
202 | # sum of n  numbers
203 | i = 1
204 | while  i <= n :
205 |     cur_sum = cur_sum + i
206 |     i = i + 1
207 | print("The sum of the numbers from 1 to", n, "is ", cur_sum)
208 | ```
209 | 
210 | ```output
211 | The sum of the numbers from 1 to 10 is 55
212 | ```
213 | 
214 | Points to note:
215 | 
216 | 1. The condition clause (i \<= n) in the while statement can be anything which when evaluated would return a Boolean value of either True of False. Initially i has been set to 1 (before the start of the loop) and therefore the condition is `True`.
217 | 2. The clause can be made more complex by use of parentheses and `and` and `or`  operators amongst others
218 | 3. The statements after the while clause are only executed if the condition evaluates as True.
219 | 4. Within the statements after the while clause there should be something which potentially will make the condition evaluate as `False` next time around. If not the loop will never end.
220 | 5. In this case the last statement in the loop changes the value of i which is part of the condition clause, so hopefully the loop will end.
221 | 6. We called our variable `cur_sum` and not `sum` because `sum` is a builtin function (try typing it in, notice the editor
222 |   changes it to green).  If we define `sum = 0` now we can't use the function `sum` in this Python session.
223 | 
224 | :::::::::::::::::::::::::::::::::::::::  challenge
225 | 
226 | ## Exercise - Things that can go wrong with while loops
227 | 
228 | In the examples below, without running them try to decide why we will not get the required answer.
229 | Run each, one at a time, and then correct them. Remember that when the input next to a notebook cell
230 | is [\*] your Python interpreter is still working.
231 | 
232 | ```python
233 | # while loop - summing the numbers 1 to 10
234 | n = 10
235 | cur_sum = 0
236 | # sum of n  numbers
237 | i = 0
238 | while  i <= n :
239 |     i = i + 1
240 |     cur_sum = cur_sum + i
241 |     
242 | print("The sum of the numbers from 1 to", n, "is ", cur_sum)
243 | ```
244 | 
245 | ```python
246 | # while loop - summing the numbers 1 to 10
247 | n = 10
248 | cur_sum = 0
249 | boolvalue = False
250 | # sum of n  numbers
251 | i = 0
252 | while  i <= n and boolvalue:
253 |     cur_sum = cur_sum + i
254 |     i = i + 1
255 |     
256 | print("The sum of the numbers from 1 to", n, "is ", cur_sum)
257 | ```
258 | 
259 | ```python
260 | # while loop - summing the numbers 1 to 10
261 | n = 10
262 | cur_sum = 0
263 | # sum of n  numbers
264 | i = 0
265 | while  i != n :
266 |     cur_sum = cur_sum + i
267 |     i = i + 1
268 | 
269 | print("The sum of the numbers from 1 to", n, "is ", cur_sum)
270 | ```
271 | 
272 | ```python
273 | # while loop - summing the numbers 1.1 to 9.9 i. steps of 1.1
274 | n = 9.9
275 | cur_sum = 0
276 | # sum of n  numbers
277 | i = 0
278 | while  i != n :
279 |     cur_sum = cur_sum + i
280 |     i = i + 1.1
281 |     print(i)
282 |     
283 | print("The sum of the numbers from 1.1 to", n, "is ", sum)
284 | ```
285 | 
286 | :::::::::::::::  solution
287 | 
288 | ## Solution
289 | 
290 | 1. Because i is incremented before the sum, you are summing 1 to 11.
291 | 2. The Boolean value is set to False the loop will never be executed.
292 | 3. When i does equal 10 the expression is False and the loop does not execute so we have only summed 1 to 9
293 | 4. Because you cannot guarantee the internal representation of Float, you should never try to compare them for equality. In this particular case the i never 'equals' n and so the loop never ends. - If you did try running this, you can stop it using <kbd>Ctrl</kbd>\+<kbd>c</kbd> in a terminal or going to the kernel menu of a notebook and choosing interrupt.
294 |   
295 |   
296 | 
297 | :::::::::::::::::::::::::
298 | 
299 | ::::::::::::::::::::::::::::::::::::::::::::::::::
300 | 
301 | ## The `for` loop
302 | 
303 | The for loop, like the while loop repeatedly executes a set of statements. The difference is that in the for loop we know in at the outset how often the statements in the loop will be executed. We don't have to rely on a variable being changed within the looping statements.
304 | 
305 | The basic format of the `for` statement is
306 | 
307 | ```python
308 | for variable_name in some_sequence :
309 |     statement1
310 |     statement2
311 |     ...
312 |     statementn
313 | ```
314 | 
315 | The key part of this is the `some_sequence`. The phrase used in the documentation is that it must be 'iterable'. That means, you can count through the sequence, starting at the beginning and stopping at the end.
316 | 
317 | There are many examples of things which are iterable some of which we have already come across.
318 | 
319 | - Lists are iterable - they don't have to contain numbers, you iterate over the elements in the list.
320 | - The `range()` function
321 | - The characters in a string
322 | 
323 | ```python
324 | print("\nExample 1\n")
325 | for i in [1,2,3] :
326 |     print(i)
327 | 
328 | print("\nExample 2\n")
329 | for name in ["Tom", "Dick", "Harry"] :
330 |     print(name)
331 | 
332 | print("\nExample 3\n")
333 | for name in ["Tom", 42, 3.142] :
334 |     print(name)
335 | 
336 | print("\nExample 4\n")
337 | for i in range(3) :
338 |     print(i)
339 | 
340 | print("\nExample 5\n")
341 | for i in range(1,4) :
342 |     print(i)
343 | 
344 | print("\nExample 6\n")
345 | for i in range(2, 11, 2) :
346 |     print(i)
347 | 
348 | print("\nExample 7\n")
349 | for i in "ABCDE" :
350 |     print(i)
351 | 
352 | print("\nExample 8\n")
353 | longString = "The quick brown fox jumped over the lazy sleeping dog"
354 | for word in longString.split() :
355 |     print(word)
356 | ```
357 | 
358 | ```output
359 | Example 1
360 | 
361 | 1
362 | 2
363 | 3
364 | 
365 | Example 2
366 | 
367 | Tom
368 | Dick
369 | Harry
370 | 
371 | Example 3
372 | 
373 | Tom
374 | 42
375 | 3.142
376 | 
377 | Example 4
378 | 
379 | 0
380 | 1
381 | 2
382 | 
383 | Example 5
384 | 
385 | 1
386 | 2
387 | 3
388 | 
389 | Example 6
390 | 
391 | 2
392 | 4
393 | 6
394 | 8
395 | 10
396 | 
397 | Example 7
398 | 
399 | A
400 | B
401 | C
402 | D
403 | E
404 | 
405 | Example 8
406 | 
407 | The
408 | quick
409 | brown
410 | fox
411 | jumped
412 | over
413 | the
414 | lazy
415 | sleeping
416 | dog
417 | ```
418 | 
419 | :::::::::::::::::::::::::::::::::::::::  challenge
420 | 
421 | ## Exercise
422 | 
423 | Suppose that we have a string containing a set of 4 different types of values separated by `,`  like this:
424 | 
425 | ```python
426 | variablelist = "01/01/2010,34.5,Yellow,True"
427 | ```
428 | 
429 | Research the `split()` method and then rewrite example 8 from the `for` loop section above so that it prints the 4 components of `variablelist`
430 | 
431 | :::::::::::::::  solution
432 | 
433 | ## Solution
434 | 
435 | ```python
436 | # From the for loop section above
437 | variablelist = "01/01/2010,34.5,Yellow,True"
438 | for word in variablelist.split(",") :
439 |     print(word)
440 | ```
441 | 
442 | The format of `variablelist` is very much like that of a record in a csv file. In later episodes we will see how we can extract these values and assign them to variables for further processing rather than printing them out.
443 | 
444 | :::::::::::::::::::::::::
445 | 
446 | ::::::::::::::::::::::::::::::::::::::::::::::::::
447 | 
448 | :::::::::::::::::::::::::::::::::::::::: keypoints
449 | 
450 | - Most programs will require 'Loops' and 'Branching' constructs.
451 | - The `if`, `elif`, `else` statements allow for branching in code.
452 | - The `for` and `while` statements allow for looping through sections of code
453 | - The programmer must provide a condition to end a `while` loop.
454 | 
455 | ::::::::::::::::::::::::::::::::::::::::::::::::::
456 | 
457 | 
458 | 


--------------------------------------------------------------------------------
/episodes/04-reusable.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Creating re-usable code
  3 | teaching: 25
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Describe the syntax for a user defined function
 10 | - Create and use simple functions
 11 | - Explain the advantages of using functions
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - What are user defined functions?
 18 | - How can I automate my code for re-use?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | ## Defining a function
 23 | 
 24 | We have already made use of several Python builtin functions like `print`, `list` and `range`.
 25 | 
 26 | In addition to the functions provided by Python, you can write your own functions.
 27 | 
 28 | Functions are used when a section of code needs to be repeated at various different points in a program. It saves you re-writing it all. In reality you rarely need to repeat the exact same code. Usually there will be some variation in variable values needed. Because of this, when you create a function you are allowed to specify a set of `parameters` which represent variables in the function.
 29 | 
 30 | In our use of the `print` function, we have provided whatever we want to `print`, as a `parameter`. Typically whenever we use the `print` function, we pass a different `parameter` value.
 31 | 
 32 | The ability to specify parameters make functions very flexible.
 33 | 
 34 | ```python
 35 | def get_item_count(items_str,sep):
 36 |     '''
 37 |     This function takes a string with a list of items and the character that they're separated by and returns the number of items
 38 |     '''
 39 |     items_list = items_str.split(sep)
 40 |     num_items = len(items_list)
 41 |     return num_items
 42 | 
 43 | items_owned = "bicycle;television;solar_panel;table"
 44 | print(get_item_count(items_owned,';'))
 45 | ```
 46 | 
 47 | ```output
 48 | 4
 49 | ```
 50 | 
 51 | ![](fig/functionAnatomy.png){alt='Python\_Repl'}
 52 | 
 53 | Points to note:
 54 | 
 55 | 1. The definition of a function (or procedure) starts with the def keyword and is followed by the name of the function with any parameters used by the function in parentheses.
 56 | 2. The definition clause is terminated with a `:` which causes indentation on the next and subsequent lines. All of these lines form the statements which make up the function. The function ends after the indentation is removed.
 57 | 3. Within the function, the parameters behave as variables whose initial values will be those that they were given when the function was called.
 58 | 4. functions have a return statement which specifies the value to be returned. This is the value assigned to the variable on the left-hand side of the call to the function. (power in the example above)
 59 | 5. You call (run the code) of a function simply by providing its name and values for its parameters the same way you would for any builtin function.
 60 | 6. Once the definition of the function has been executed, it becomes part of Python for the current session and can be used anywhere.
 61 | 7. Like any other builtin function you can use `shift` + `tab` in Jupyter to see the parameters.
 62 | 8. At the beginning of the function code we have a multiline  `comment` denoted by the `'''` at the beginning and end. This kind of comment is known as a `docstring` and can be used anywhere in Python code as a documentation aid. It is particularly common, and indeed best practice, to use them to give a brief description of the function at the beginning of a function definition in this way. This is because this description will be displayed along with the parameters when you use the help() function or `shift` + `tab` in Jupyter.
 63 | 9. The variable `x` defined within the function only exists within the function, it cannot be used outside in the main program.
 64 | 
 65 | In our `get_item_count` function we have two parameters which must be provided every time the function is used. You need to  provide the parameters in the right order or to explicitly name the parameter you are referring to and use the `=` sign to give it a value.
 66 | 
 67 | In many cases of functions we want to provide default values for parameters so the user doesn't have to. We can do this in the following way
 68 | 
 69 | ```python
 70 | def get_item_count(items_str,sep=';'):
 71 |     '''
 72 |     This function takes a string with a list of items and the character that they're separated by and returns the number of items
 73 |     '''
 74 |     items_list = items_str.split(sep)
 75 |     num_items = len(items_list)
 76 |     return num_items
 77 | 
 78 | 
 79 | print(get_item_count(items_owned))
 80 | ```
 81 | 
 82 | ```output
 83 | 4
 84 | ```
 85 | 
 86 | The only change we have made is to provide a default value for the `sep` parameter. Now if the user does not provide a value, then the value of 2 will be used. Because `items_str` is the first parameter we can specify its value by position. We could however have explicitly named the parameters we were referring to.
 87 | 
 88 | ```python
 89 | print(get_item_count(items_owned, sep = ','))
 90 | print(get_item_count(items_str = items_owned, sep=';'))
 91 | ```
 92 | 
 93 | ```output
 94 | 1
 95 | 4
 96 | ```
 97 | 
 98 | :::::::::::::::::::::::::::::::::::::::  challenge
 99 | 
100 | ## Volume of a cube
101 | 
102 | 1. Write a function definition to calculate the volume of a cuboid. The function will use three parameters `h`, `w`
103 |   and `l` and return the volume.
104 | 
105 | 2. Supposing that in addition to the volume I also wanted to calculate the surface area and the sum of all of the edges. Would I (or should I) have three separate functions or could I write a single function to provide all three values together?
106 | 
107 | :::::::::::::::  solution
108 | 
109 | ## Solution
110 | 
111 | - A function to calculate the volume of a cuboid could be:
112 | 
113 | ```python
114 | def calculate_vol_cuboid(h, w, len):
115 |     """
116 |     Calculates the volume of a cuboid.
117 |     Takes in h, w, len, that represent height, width, and length of the cube.
118 |     Returns the volume.
119 |     """
120 |     volume = h * w * len
121 |     return volume
122 | ```
123 | 
124 | - It depends. As a rule-of-thumb, we want our function to **do one thing and one thing only, and to do it well.**
125 |   If we always have to calculate these three pieces of information, the 'one thing' could be
126 |   'calculate the volume, surface area, and sum of all edges of a cube'. Our function would look like this:
127 | 
128 | ```python
129 | # Method 1 - single function
130 | def calculate_cuboid(h, w, len):
131 |     """
132 |     Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth).
133 | 
134 |     Returns the volume, surface area, and sum of edges of the cuboid.
135 |     """
136 |     volume = h * w * len
137 |     surface_area = 2 * (h * w + h * len + len * w)
138 |     edges = 4 * (h + w + len)
139 |     return volume, surface_area, edges
140 | ```
141 | 
142 | It may be better, however, to break down our function into separate ones - one for each piece of information we are
143 | calculating. Our functions would look like this:
144 | 
145 | ```python
146 | # Method 2 - separate functions
147 | def calc_volume_of_cuboid(h, w, len):
148 |     """
149 |     Calculates the volume of a cuboid defined by the dimensions h(eight), w(idth), and len(gth).
150 |     """
151 |     volume = h * w * len
152 |     return volume
153 | 
154 | 
155 | def calc_surface_area_of_cuboid(h, w, len):
156 |     """
157 |     Calculates the surface area of a cuboid defined by the dimensions h(eight), w(idth), and len(gth).
158 |     """   
159 |     surface_area = 2 * (h * w + h * len + len * w)
160 |     return surface_area
161 | 
162 | 
163 | def calc_sum_of_edges_of_cuboid(h, w, len):
164 |     """
165 |     Calculates the sum of edges of a cuboid defined by the dimensions h(eight), w(idth), and len(gth).
166 |     """   
167 |     sum_of_edges = 4 * (h + w + len)
168 |     return sum_of_edges
169 | ```
170 | 
171 | We could then rewrite our first solution:
172 | 
173 | ```python
174 | def calculate_cuboid(h, w, len):
175 |     """
176 |     Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth).
177 | 
178 |     Returns the volume, surface area, and sum of edges of the cuboid.
179 |     """
180 |     volume = calc_volume_of_cuboid(h, w, len)
181 |     surface_area = calc_surface_area_of_cuboid(h, w, len)
182 |     edges = calc_sum_of_edges_of_cuboid(h, w, len)
183 | 
184 |     return volume, surface_area, edges
185 | ```
186 | 
187 | :::::::::::::::::::::::::
188 | 
189 | ::::::::::::::::::::::::::::::::::::::::::::::::::
190 | 
191 | ## Using libraries
192 | 
193 | The functions we have created above only exist for the duration of the session in which they have been defined. If you start a new Jupyter notebook you will have to run the code to define them again.
194 | 
195 | If all of your code is in a single file or notebook this isn't really a problem.
196 | 
197 | There are however many (thousands) of useful functions which other people have written and have made available to all Python users by creating libraries (also referred to as packages or modules) of functions.
198 | 
199 | You can find out what all of these libraries are and their contents by visiting the main (python.org) site.
200 | 
201 | We need to go through a 2-step process before we can use them in our own programs.
202 | 
203 | Step 1.  use the `pip` command from the commandline. `pip` is installed as part of the Python install and is used to fetch the package from the Internet and install it in your Python configuration.
204 | 
205 | ```bash
206 | $ pip install <package name>
207 | ```
208 | 
209 | pip stands for Python install package and is a commandline function. Because we are using the Anaconda distribution of Python, all of the packages that we will be using in this lesson are already installed for us, so we can move straight on to step 2.
210 | 
211 | Step 2. In your Python code include an `import package-name` statement. Once this is done, you can use all of the functions contained within the package.
212 | 
213 | As all of these packages are produced by 3rd parties independently of each other, there is the strong possibility that there may be clashes in function names. To allow for this, when you are calling a function from a package that you have imported, you do so by prefixing the function name with the package name. This can make for long-winded function names so the `import` statement allows you to specify an `alias` for the package name which you must then use instead of the package name.
214 | 
215 | In future episodes, we will be importing the `csv`, `json`, `pandas`, `numpy` and `matplotlib` modules. We will describe their use as we use them.
216 | 
217 | The code that we will use is shown below
218 | 
219 | ```python
220 | import csv
221 | import json
222 | import pandas as pd
223 | import numpy as np
224 | import matplotlib.pyplot as plt
225 | ```
226 | 
227 | The first two we don't alias as they have short names. The last three we do. Matplotlib is a very large library broken up into what can be thought of as sub-libraries. As we will only be using the functions contained in the `pyplot` sub-library we can specify that explicitly when we import. This saves time and space. It does not effect how we call the functions in our code.
228 | 
229 | The `alias` we use (specified after the `as` keyword) is entirely up to us. However those shown here for `pandas`, `numpy` and `matplotlib` are nearly universally adopted conventions used for these popular libraries. If you are searching for code examples for these libraries on the Internet, using these aliases will appear most of the time.
230 | 
231 | :::::::::::::::::::::::::::::::::::::::: keypoints
232 | 
233 | - Functions are used to create re-usable sections of code
234 | - Using parameters with functions make them more flexible
235 | - You can use functions written by others by importing the libraries containing them into your code
236 | 
237 | ::::::::::::::::::::::::::::::::::::::::::::::::::
238 | 
239 | 
240 | 


--------------------------------------------------------------------------------
/episodes/06-date-and-time.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Dates and Time
  3 | teaching: 15
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Describe some of the datetime functions available in Python
 10 | - Describe the use of format strings to describe the layout of a date and/or time string
 11 | - Make use of date arithmetic
 12 | 
 13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 14 | 
 15 | :::::::::::::::::::::::::::::::::::::::: questions
 16 | 
 17 | - How are dates and time represented in Python?
 18 | - How can I manipulate dates and times?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | ## Date and Times in Python
 23 | 
 24 | Python can be very flexible in how it interprets 'strings' which you want to be considered as a date, time, or date and time, but you have to tell Python how the various parts of the date and/or time are represented in your 'string'. You can do this by creating a `format`. In a `format`, different case sensitive characters preceded by the `%` character act as placeholders for parts of the date/time, for example `%Y` represents year formatted as 4 digit number such as 2014.
 25 | 
 26 | A full list of the characters used and what they represent can be found towards the end of the [datetime](https://docs.python.org/3/library/datetime.html) section of the official Python documentation.
 27 | 
 28 | There is a `today()` method which allows you to get the current date and time.
 29 | By default it is displayed in a format similar to the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) standard format.
 30 | 
 31 | To use the date and time functions you need to import the `datetime` module.
 32 | 
 33 | ```python
 34 | from datetime import datetime
 35 | 
 36 | today = datetime.today()
 37 | print('ISO     :', today)
 38 | ```
 39 | 
 40 | ```output
 41 | ISO     : 2018-04-12 16:19:17.177441
 42 | ```
 43 | 
 44 | We can use our own formatting instead. For example, if we wanted words instead of number and the 4 digit year at the end we could use the following.
 45 | 
 46 | ```python
 47 | format = "%a %b %d %H:%M:%S %Y"
 48 | 
 49 | today_str = today.strftime(format)
 50 | print('strftime:', today_str)
 51 | print(type(today_str))
 52 | 
 53 | today_date = datetime.strptime(today_str, format)
 54 | print('strptime:', today_date.strftime(format))
 55 | print(type(today_date))
 56 | ```
 57 | 
 58 | ```output
 59 | strftime: Thu Apr 12 16:19:17 2018
 60 | <class 'str'>
 61 | strptime: Thu Apr 12 16:19:17 2018
 62 | <class 'datetime.datetime'>
 63 | ```
 64 | 
 65 | `strftime` converts a datetime object to a string and `strptime` creates a datetime object from a string.
 66 | When you print them using the same format string, they look the same.
 67 | 
 68 | The format of the date fields in the SAFI\_results.csv file have been generated automatically to comform to the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) standard.
 69 | 
 70 | When we read the file and extract the date fields, they are of type string. Before we can use them as dates, we need to convert them into Python date objects.
 71 | 
 72 | In the format string we use below, the `-` , `:` , `T` and `Z` characters are just that, characters in the string representing the date/time.
 73 | Only the character preceded with `%` have special meanings.
 74 | 
 75 | Having converted the strings to datetime objects, there are a variety of methods that we can use to extract different components of the date/time.
 76 | 
 77 | ```python
 78 | from datetime import datetime
 79 | 
 80 | 
 81 | format = "%Y-%m-%dT%H:%M:%S.%fZ"
 82 | f = open('SAFI_results.csv', 'r')
 83 | 
 84 | #skip the header line
 85 | line = f.readline()
 86 | 
 87 | # next line has data
 88 | line = f.readline()
 89 | 
 90 | strdate_start = line.split(',')[3]  # A04_start
 91 | strdate_end = line.split(',')[4]    # A05_end
 92 | 
 93 | print(type(strdate_start), strdate_start)
 94 | print(type(strdate_end), strdate_end)
 95 | 
 96 | 
 97 | # the full date and time
 98 | datetime_start = datetime.strptime(strdate_start, format)
 99 | print(type(datetime_start))
100 | datetime_end = datetime.strptime(strdate_end, format)
101 | 
102 | print('formatted date and time', datetime_start)
103 | print('formatted date and time', datetime_end)
104 | 
105 | 
106 | # the date component
107 | date_start = datetime.strptime(strdate_start, format).date()
108 | print(type(date_start))
109 | date_end = datetime.strptime(strdate_end, format).date()
110 | 
111 | print('formatted start date', date_start)
112 | print('formatted end date', date_end)
113 | 
114 | # the time component
115 | time_start = datetime.strptime(strdate_start, format).time()
116 | print(type(time_start))
117 | time_end = datetime.strptime(strdate_end, format).time()
118 | 
119 | print('formatted start time', time_start)
120 | print('formatted end time', time_end)
121 | 
122 | 
123 | f.close()
124 | ```
125 | 
126 | ```output
127 | <class 'str'> 2017-03-23T09:49:57.000Z
128 | <class 'str'> 2017-04-02T17:29:08.000Z
129 | <class 'datetime.datetime'>
130 | formatted date and time 2017-03-23 09:49:57
131 | formatted date and time 2017-04-02 17:29:08
132 | <class 'datetime.date'>
133 | formatted start date 2017-03-23
134 | formatted end date 2017-04-02
135 | <class 'datetime.time'>
136 | formatted start time 09:49:57
137 | formatted end time 17:29:08
138 | ```
139 | 
140 | ## Components of dates and times
141 | 
142 | For a date or time we can also extract individual components of them.
143 | They are held internally in the datetime datastructure.
144 | 
145 | ```python
146 | # date parts.
147 | print('formatted end date', date_end)
148 | print(' end date year', date_end.year)
149 | print(' end date month', date_end.month)
150 | print(' end date day', date_end.day)
151 | print (type(date_end.day))
152 | 
153 | # time parts.
154 | 
155 | print('formatted end time', time_end)
156 | print(' end time hour', time_end.hour)
157 | print(' end time minutes', time_end.minute)
158 | print(' end time seconds', time_end.second)
159 | print(type(time_end.second))
160 | ```
161 | 
162 | ```output
163 | formatted end date 2017-04-02
164 |  end date year 2017
165 |  end date month 4
166 |  end date day 2
167 | <class 'int'>
168 | formatted end time 17:29:08
169 |  end time hour 17
170 |  end time minutes 29
171 |  end time seconds 8
172 | <class 'int'>
173 | ```
174 | 
175 | ## Date arithmetic
176 | 
177 | We can also do arithmetic with the dates.
178 | 
179 | ```python
180 | date_diff = datetime_end - datetime_start
181 | date_diff
182 | print(type(datetime_start))
183 | print(type(date_diff))
184 | print(date_diff)
185 | 
186 | date_diff = datetime_start - datetime_end
187 | print(type(date_diff))
188 | print(date_diff)
189 | ```
190 | 
191 | ```output
192 | <class 'datetime.datetime'>
193 | <class 'datetime.timedelta'>
194 | 10 days, 7:39:11
195 | <class 'datetime.timedelta'>
196 | -11 days, 16:20:49
197 | ```
198 | 
199 | :::::::::::::::::::::::::::::::::::::::  challenge
200 | 
201 | ## Exercise
202 | 
203 | How do you interpret the last result?
204 | 
205 | 
206 | ::::::::::::::::::::::::::::::::::::::::::::::::::
207 | 
208 | The code below calculates the time difference between supposedly starting the survey and ending the survey (for each respondent).
209 | 
210 | ```python
211 | from datetime import datetime
212 | 
213 | format = "%Y-%m-%dT%H:%M:%S.%fZ"
214 | 
215 | f = open('SAFI_results.csv', 'r')
216 | 
217 | line = f.readline()
218 | 
219 | for line in f:
220 |     #print(line)
221 |     strdate_start = line.split(',')[3]
222 |     strdate_end = line.split(',')[4]
223 | 
224 |     datetime_start = datetime.strptime(strdate_start, format)
225 |     datetime_end = datetime.strptime(strdate_end, format)
226 |     date_diff = datetime_end - datetime_start
227 |     print(datetime_start, datetime_end, date_diff )
228 | 
229 | 
230 | f.close()
231 | ```
232 | 
233 | ```output
234 | 2017-03-23 09:49:57 2017-04-02 17:29:08 10 days, 7:39:11
235 | 2017-04-02 09:48:16 2017-04-02 17:26:19 7:38:03
236 | 2017-04-02 14:35:26 2017-04-02 17:26:53 2:51:27
237 | 2017-04-02 14:55:18 2017-04-02 17:27:16 2:31:58
238 | 2017-04-02 15:10:35 2017-04-02 17:27:35 2:17:00
239 | 2017-04-02 15:27:25 2017-04-02 17:28:02 2:00:37
240 | 2017-04-02 15:38:01 2017-04-02 17:28:19 1:50:18
241 | 2017-04-02 15:59:52 2017-04-02 17:28:39 1:28:47
242 | 2017-04-02 16:23:36 2017-04-02 16:42:08 0:18:32
243 | ...
244 | ```
245 | 
246 | :::::::::::::::::::::::::::::::::::::::  challenge
247 | 
248 | ## Exercise
249 | 
250 | 1. In the `SAFI_results.csv` file the `A01_interview_date field` (index 1) contains a date in the form of 'dd/mm/yyyy'. Read the file and calculate the differences in days (because the interview date is only given to the day) between the `A01_interview_date` values and the `A04_start` values. You will need to create a format string for the `A01_interview_date` field.
251 | 
252 | 2. Looking at the results here and from the previous section of code. Do you think the use of the smartphone data entry system for the survey was being used in real time?
253 | 
254 | :::::::::::::::  solution
255 | 
256 | ## Solution
257 | 
258 | ```python
259 | from datetime import datetime
260 | 
261 | format1 = "%Y-%m-%dT%H:%M:%S.%fZ"
262 | format2 = "%d/%m/%Y"
263 | 
264 | f = open('SAFI_results.csv', 'r')
265 | 
266 | line = f.readline()
267 | 
268 | for line in f:
269 |     A01 = line.split(',')[1]
270 |     A04 = line.split(',')[3]
271 |    
272 |     datetime_A04 = datetime.strptime(A04, format1)
273 |     datetime_A01 = datetime.strptime(A01, format2)
274 |     date_diff = datetime_A04 - datetime_A01
275 |     print(datetime_A04, datetime_A01, date_diff.days )
276 |      
277 | f.close()
278 | ```
279 | 
280 | :::::::::::::::::::::::::
281 | 
282 | ::::::::::::::::::::::::::::::::::::::::::::::::::
283 | 
284 | :::::::::::::::::::::::::::::::::::::::: keypoints
285 | 
286 | - Date and Time functions in Python come from the datetime library, which needs to be imported
287 | - You can use format strings to have dates/times displayed in any representation you like
288 | - Internally date and times are stored in special data structures which allow you to access the component parts of dates and times
289 | 
290 | ::::::::::::::::::::::::::::::::::::::::::::::::::
291 | 
292 | 
293 | 


--------------------------------------------------------------------------------
/episodes/08-Pandas.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Reading data from a file using Pandas
  3 | teaching: 15
  4 | exercises: 5
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain what a module is and how they are used in Python
 10 | - Describe what the Python Data Analysis Library (pandas) is
 11 | - Load the Python Data Analysis Library (pandas)
 12 | - Use read\_csv to read tabular data into Python
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - What is Pandas?
 19 | - How do I read files using Pandas?
 20 | - What is the difference between reading files using Pandas and other methods of reading files?
 21 | 
 22 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 23 | 
 24 | ## What is Pandas?
 25 | 
 26 | pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way.
 27 | 
 28 | Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty.
 29 | 
 30 | Particular features of pandas that we will be looking at over this and the next couple of episodes include:
 31 | 
 32 | - Reading data stored in CSV files (other file formats can be read as well)
 33 | - Slicing and subsetting data in Dataframes (tables!)
 34 | - Dealing with missing data
 35 | - Reshaping data (long -> wide,  wide -> long)
 36 | - Inserting and deleting columns from data structures
 37 | - Aggregating data using data grouping facilities using the split-apply-combine paradigm
 38 | - Joining of datasets (after they have been loaded into Dataframes)
 39 | 
 40 | If you are wondering why I write pandas with a lower case 'p' it is because it is the name of the package and Python is case sensitive.
 41 | 
 42 | ## Importing the pandas library
 43 | 
 44 | Importing the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of `pd`. We will follow the same convention.
 45 | 
 46 | ```python
 47 | import pandas as pd
 48 | ```
 49 | 
 50 | ## Pandas data structures
 51 | 
 52 | There are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure.
 53 | 
 54 | We will mainly be looking at the Dataframe.
 55 | 
 56 | We can easily create a Pandas Dataframe by reading a .csv file
 57 | 
 58 | ## Reading a csv file
 59 | 
 60 | When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer.
 61 | 
 62 | The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size.
 63 | 
 64 | In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically.
 65 | 
 66 | For our examples in this episode we are going to use the SN7577.tab file. This is available for download [here](data/SN7577.tab) and the description of the file is available [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7577&type=data%20catalogue#!/details)
 67 | 
 68 | We are going to read in our SN7577.tab file. Although this is a tab delimited file we will still use the pandas `read_csv` method, but we will explicitly tell the method that the separator is the tab character and not a comma which is the default.
 69 | 
 70 | ```python
 71 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
 72 | ```
 73 | 
 74 | :::::::::::::::::::::::::::::::::::::::  challenge
 75 | 
 76 | ## Exercise
 77 | 
 78 | What happens if you forget to specify `sep='\t'` when reading a tab delimited dataset
 79 | 
 80 | :::::::::::::::  solution
 81 | 
 82 | ## Solution
 83 | 
 84 | ```python
 85 | df_SN7577_oops = pd.read_csv("SN7577.tab")
 86 | print(df_SN7577_oops.shape)
 87 | print(df_SN7577_oops)
 88 | ```
 89 | 
 90 | If you allow pandas to assume that your columns are separated by commas (the default) and there aren't any, then each record will be treated as a single column. So the shape is given as 1286 rows (correct) but only one column.
 91 | When the contents is display the only column name is the complete first record. Notice the `\t` used to represent the tab characters in the output. This is the same format we used to specify the tab separator when we correctly read in the file.
 92 | 
 93 | :::::::::::::::::::::::::
 94 | 
 95 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 96 | 
 97 | ## Getting information about a Dataframe
 98 | 
 99 | You can find out the type of the variable `df_SN7577` by using the `type` function.
100 | 
101 | ```python
102 | print(type(df_SN7577))
103 | ```
104 | 
105 | ```output
106 | <class 'pandas.core.frame.DataFrame'>
107 | ```
108 | 
109 | You can see the contents by simply entering the variable name. You can see from the output that it is a tabular format. The column names have been taken from the first record of the file. On the left hand side is a column with no name. The entries here have been provided by pandas and act as an index to reference the individual rows of the Dataframe.
110 | 
111 | The `read_csv()` function has an `index_col` parameter which you can use to indicate which of the columns in the file you wish to use as the index instead. As the SN7577 dataset doesn't have a column which would uniquely identify each row we cannot do that.
112 | 
113 | Another thing to notice about the display is that it is truncated. By default you will see the first and last 30 rows. For the columns you will always get the first few columns and typically the last few depending on display space.
114 | 
115 | ```python
116 | df_SN7577
117 | ```
118 | 
119 | Similar information can be obtained with `df_SN7577.head()` But here you are only returned the first 5 rows by default.
120 | 
121 | ```python
122 | df_SN7577.head()
123 | ```
124 | 
125 | :::::::::::::::::::::::::::::::::::::::  challenge
126 | 
127 | ## Exercise
128 | 
129 | 1. As well as the `head()` method there is a `tail()` method. What do you think it does? Try it.
130 | 2. Both methods accept a single numeric parameter. What do you think it does? Try it.
131 | 
132 | ::::::::::::::::::::::::::::::::::::::::::::::::::
133 | 
134 | You can obtain other basic information about your Dataframe of data with:
135 | 
136 | ```python
137 | # How many rows?
138 | print(len(df_SN7577))
139 | # How many rows and columns - returned as a tuple
140 | print(df_SN7577.shape)
141 | #How many 'cells' in the table
142 | print(df_SN7577.size)
143 | # What are the column names
144 | print(df_SN7577.columns)
145 | # what are the data types of the columns?
146 | print(df_SN7577.dtypes)
147 | ```
148 | 
149 | ```output
150 | 1286
151 | (1286, 202)
152 | 259772
153 | Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av',
154 |        'Q5avi',
155 |        ...
156 |        'numhhd', 'numkid', 'numkid2', 'numkid31', 'numkid32', 'numkid33',
157 |        'numkid34', 'numkid35', 'numkid36', 'wts'],
158 |       dtype='object', length=202)
159 | Q1             int64
160 | Q2             int64
161 | Q3             int64
162 | ...
163 | Length: 202, dtype: object
164 | ```
165 | 
166 | :::::::::::::::::::::::::::::::::::::::  challenge
167 | 
168 | ## Exercise
169 | 
170 | When we asked for the column names and their data types, the output was abridged, i.e. we didn't get the values for all of the columns. Can you write a small piece of code which will return all of the values
171 | 
172 | :::::::::::::::  solution
173 | 
174 | ## Solution
175 | 
176 | ```python
177 | for name in df_SN7577.columns:
178 |     print(name)
179 | ```
180 | 
181 | :::::::::::::::::::::::::
182 | 
183 | ::::::::::::::::::::::::::::::::::::::::::::::::::
184 | 
185 | :::::::::::::::::::::::::::::::::::::::: keypoints
186 | 
187 | - pandas is a Python library containing functions and data structures to assist in data analysis
188 | - pandas data structures are the Series (like a vector) and the Dataframe (like a table)
189 | - the pandas `read_csv` function allows you to read an entire `csv` file into a Dataframe
190 | 
191 | ::::::::::::::::::::::::::::::::::::::::::::::::::
192 | 
193 | 
194 | 


--------------------------------------------------------------------------------
/episodes/09-extracting-data.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Extracting row and columns
  3 | teaching: 15
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Define indexing as it relates to data structures
 10 | - Select specific columns from a data frame
 11 | - Select specific rows from a data frame based on conditional expressions
 12 | - Using indexes to access rows and columns
 13 | - Copy a data frame
 14 | - Add columns to a data frame
 15 | - Analyse datasets having missing/null values
 16 | 
 17 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 18 | 
 19 | :::::::::::::::::::::::::::::::::::::::: questions
 20 | 
 21 | - How can I extract specific rows and columns from a Dataframe?
 22 | - How can I add or delete columns from a Dataframe?
 23 | - How can I find and change missing values in a Dataframe?
 24 | 
 25 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 26 | 
 27 | We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe.
 28 | 
 29 | ```python
 30 | import pandas as pd
 31 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
 32 | ```
 33 | 
 34 | ### Selecting rows and columns from a pandas Dataframe
 35 | 
 36 | If we know which columns we want before we read the data from the file we can tell `read_csv()` to only import those columns by specifying columns either by their index number (starting at 0) as a list to the `usecols` parameter. Alternatively we can also provide a list of column names.
 37 | 
 38 | ```python
 39 | df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175])
 40 | print(df_SN7577_some_cols.shape)
 41 | print(df_SN7577_some_cols.columns)
 42 | df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'])
 43 | print(df_SN7577_some_cols.columns)
 44 | ```
 45 | 
 46 | ```output
 47 | (1286, 6)
 48 | Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
 49 | Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
 50 | ```
 51 | 
 52 | Let us assume for now that we read in the complete file which is now in the Dataframe `df_SN7577`, how can we now refer to specific columns?
 53 | 
 54 | There are two ways of doing this using the column names (or labels):
 55 | 
 56 | ```python
 57 | # Both of these statements are the same
 58 | print(df_SN7577['Q1'])
 59 | # and
 60 | print(df_SN7577.Q1)
 61 | ```
 62 | 
 63 | ```output
 64 | 0        1
 65 | 1        3
 66 | 2       10
 67 | 3        9
 68 | ...
 69 | ```
 70 | 
 71 | If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of `'Q1'` we could also have provided a list of strings. Remember that lists are enclosed in `[]`.
 72 | 
 73 | ```python
 74 | print(df_SN7577[['Q1', 'Q2', 'Q3']])
 75 | ```
 76 | 
 77 | ```python
 78 | Q1  Q2  Q3
 79 | 0      1  -1   1
 80 | 1      3  -1   1
 81 | 2     10   3   2
 82 | 3      9  -1  10
 83 | ...
 84 | ```
 85 | 
 86 | :::::::::::::::::::::::::::::::::::::::  challenge
 87 | 
 88 | ## Exercise
 89 | 
 90 | What happens if you:
 91 | 
 92 | 1. List the columns you want out of order from the way they appear in the file?
 93 | 2. Put the same column name in twice?
 94 | 3. Put in a non-existing column name? (a.k.a Typo)
 95 | 
 96 | :::::::::::::::  solution
 97 | 
 98 | ## Solution
 99 | 
100 | ```python
101 | print(df_SN7577[['Q3', 'Q2']])
102 | print(df_SN7577[['Q3', 'Q2', 'Q3']])
103 | print(df_SN7577[['Q33', 'Q2']])
104 | ```
105 | 
106 | :::::::::::::::::::::::::
107 | 
108 | ::::::::::::::::::::::::::::::::::::::::::::::::::
109 | 
110 | ## Filtering by Rows
111 | 
112 | You can filter the Dataframe by rows by specifying a range in the form of `a:b`. `a` is the first row and `b` is one beyond the last row required.
113 | 
114 | ```python
115 | # select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe)
116 | df_SN7577_some_rows = df_SN7577[1:4]
117 | df_SN7577_some_rows
118 | ```
119 | 
120 | :::::::::::::::::::::::::::::::::::::::  challenge
121 | 
122 | ## Exercise
123 | 
124 | What happens if we ask for a single row instead of a range?
125 | 
126 | :::::::::::::::  solution
127 | 
128 | ## Solution
129 | 
130 | ```python
131 | df_SN7577[1]
132 | ```
133 | 
134 | You get an error if you only specify `1`. You need to use `:1` or `0:1` to get the first row returned. The `:` is always required. You can use `:` by itself to return all of the rows.
135 | 
136 | :::::::::::::::::::::::::
137 | 
138 | ::::::::::::::::::::::::::::::::::::::::::::::::::
139 | 
140 | ## Using criteria to filter rows
141 | 
142 | It is more likely that you will want to select rows from the Dataframe based on some criteria, such as "all rows where the value for Q2 is -1".
143 | 
144 | ```python
145 | df_SN7577_some_rows = df_SN7577[(df_SN7577.Q2 == -1)]
146 | df_SN7577_some_rows
147 | ```
148 | 
149 | The criteria can be more complex and isn't limited to a single column's values:
150 | 
151 | ```python
152 | df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)]
153 | df_SN7577_some_rows
154 | ```
155 | 
156 | We can combine the row selection with column selection:
157 | 
158 | ```python
159 | df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']]
160 | df_SN7577_some_rows
161 | ```
162 | 
163 | Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows.
164 | 
165 | There is however another way of selecting rows using the row index:
166 | 
167 | ```python
168 | df_SN7577_some_rows = df_SN7577.iloc[1:4]
169 | df_SN7577_some_rows
170 | ```
171 | 
172 | Using the `iloc` method gives the same results as our previous example.
173 | 
174 | However, now we can specify a single value and more importantly we can use the `range()` function to indicate the records that we want. This can be useful for making pseudo-random selections of rows from across the Dataframe.
175 | 
176 | ```python
177 | # Select the first row from the Dataframe
178 | df_SN7577_some_rows = df_SN7577.iloc[0]
179 | df_SN7577_some_rows
180 | # select every 100th record from the Dataframe.
181 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)]
182 | df_SN7577_some_rows
183 | ```
184 | 
185 | You can also specify column ranges using the `iloc` method again using the column index numbers:
186 | 
187 | ```python
188 | # columns 0,1,2 and 3
189 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4]
190 | df_SN7577_some_rows
191 | # columns 0,1,2,78 and 95
192 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]]
193 | df_SN7577_some_rows
194 | ```
195 | 
196 | There is also a `loc` method which allows you to use the column names.
197 | 
198 | ```python
199 | # columns 0,1,2,78 and 95 using the column names and changing 'iloc' to 'loc'
200 | df_SN7577_some_rows = df_SN7577.loc[range(0, len(df_SN7577), 100),['Q1', 'Q2', 'Q3', 'Q18bii', 'access6' ]]
201 | df_SN7577_some_rows
202 | ```
203 | 
204 | ## Sampling
205 | 
206 | Pandas does have a `sample` method which allows you to extract a sample of the records from the Dataframe.
207 | 
208 | ```python
209 | df_SN7577.sample(10, replace=False)             # ten records, do not select same record twice (this is the default)
210 | df_SN7577.sample(frac=0.05, random_state=1)     # 5% of records , same records if run again
211 | ```
212 | 
213 | :::::::::::::::::::::::::::::::::::::::: keypoints
214 | 
215 | - Import specific columns when reading in a .csv with the `usecols` parameter
216 | - We easily can chain boolean conditions when filtering rows of a pandas dataframe
217 | - The `loc` and `iloc` methods allow us to get rows with particular labels and at particular integer locations respectively
218 | - pandas has a handy `sample` method which allows us to extract a sample of rows from a dataframe
219 | 
220 | ::::::::::::::::::::::::::::::::::::::::::::::::::
221 | 
222 | 
223 | 


--------------------------------------------------------------------------------
/episodes/10-aggregations.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Data Aggregation using Pandas
  3 | teaching: 20
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Access and summarize data stored in a Data Frame
 10 | - Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame
 11 | - Understand missing data
 12 | - Changing to and from 'NaN' values
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - How can I summarise the data in a data frame?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | ## Using Pandas functions to summarise data in a Data Frame
 23 | 
 24 | For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset.
 25 | 
 26 | We shall use the SAFI\_results.csv dataset to demonstrate how we can obtain these pieces of information
 27 | 
 28 | ```python
 29 | import pandas as pd
 30 | df_SAFI = pd.read_csv("SAFI_results.csv")
 31 | df_SAFI
 32 | ```
 33 | 
 34 | For numeric variables we can obtain a variety of basic statistical information by using the `describe()` method.
 35 | 
 36 | ```python
 37 | df_SAFI.describe()
 38 | ```
 39 | 
 40 | This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as `NaN` you will get a warning message.
 41 | 
 42 | You can also `.describe()` on a single variable basis.
 43 | 
 44 | ```python
 45 | df_SAFI['B_no_membrs'].describe()
 46 | ```
 47 | 
 48 | There are also a set of methods which allow us to obtain individual values.
 49 | 
 50 | ```python
 51 | print(df_SAFI['B_no_membrs'].min())
 52 | print(df_SAFI['B_no_membrs'].max())
 53 | print(df_SAFI['B_no_membrs'].mean())
 54 | print(df_SAFI['B_no_membrs'].std())
 55 | print(df_SAFI['B_no_membrs'].count())
 56 | print(df_SAFI['B_no_membrs'].sum())
 57 | ```
 58 | 
 59 | ```output
 60 | 2
 61 | 19
 62 | 7.190839694656488
 63 | 3.1722704895263734
 64 | 131
 65 | 942
 66 | ```
 67 | 
 68 | Unlike the `describe()` method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed.
 69 | 
 70 | We can do the same thing for the `E19_period_use` variable
 71 | 
 72 | ```python
 73 | print(df_SAFI['E19_period_use'].min())
 74 | print(df_SAFI['E19_period_use'].max())
 75 | print(df_SAFI['E19_period_use'].mean())
 76 | print(df_SAFI['E19_period_use'].std())
 77 | print(df_SAFI['E19_period_use'].count())
 78 | print(df_SAFI['E19_period_use'].sum())
 79 | ```
 80 | 
 81 | ```
 82 | 1.0
 83 | 45.0
 84 | 12.043478260869565
 85 | 8.583030848015385
 86 | 92
 87 | 1108.0
 88 | ```
 89 | 
 90 | {: output}
 91 | 
 92 | :::::::::::::::::::::::::::::::::::::::  challenge
 93 | 
 94 | ## Exercise
 95 | 
 96 | Compare the count values returned for the `B_no_membrs` and the `E19_period_use` variables.
 97 | 
 98 | 1. Why do you think they are different?
 99 | 2. How does this affect the calculation of the mean values?
100 | 
101 | :::::::::::::::  solution
102 | 
103 | ## Solution
104 | 
105 | 1. We know from when we originally displayed the contents of the `df_SAFI` Dataframe that there are 131 rows in it. This matches the value for the `B_no_membrs` count. The count for `E19_period_use` however is only 92. If you look at the values in the `E19_period_use` column using
106 | 
107 | ```python
108 | df_SAFI['E19_period_use']
109 | ```
110 | 
111 | you will see that there are several `NaN` values. They also occurred when we used `describe()` on the full Dataframe. `NaN` stands for Not a Number, ie. the value is missing. There are only 92 non-missing values and this is what is reported by the `count()` method. This value is also used in the calculation of the mean and std values.
112 | 
113 | :::::::::::::::::::::::::
114 | 
115 | ::::::::::::::::::::::::::::::::::::::::::::::::::
116 | 
117 | ## Dealing with missing values
118 | 
119 | We can find out how many variables in our Dataframe contains any `NaN` values with the code
120 | 
121 | ```python
122 | df_SAFI.isnull().sum()
123 | ```
124 | 
125 | ```
126 | Column1                             0
127 | A01_interview_date                  0
128 | A03_quest_no                        0
129 | A04_start                           0
130 | ...
131 | ```
132 | 
133 | {: output}
134 | 
135 | or for a specific variable
136 | 
137 | ```python
138 | df_SAFI['E19_period_use'].isnull().sum()
139 | ```
140 | 
141 | ```
142 | 39
143 | ```
144 | 
145 | {: output}
146 | 
147 | Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do.
148 | 
149 | We have been using data from two very different sources.
150 | 
151 | The SN7577 dataset is provided by the [UK Data Service](https://www.ukdataservice.ac.uk). Datasets from the UK data Service, have already been 'cleaned' and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as '-1' or 'Not Specified'. In cases like these it may be appropriate to replace these values with 'NaN' before you try to process the data further.
152 | 
153 | The SAFI dataset we have been using comes from a project called 'Studying African Farmer-led Irrigation'. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of 'NaN' values in a record (one complete questionnaire) where you would still consider the record to be complete.
154 | 
155 | We have already seen how we can check for missing values. There are three other actions we need to be able to do:
156 | 
157 | 1. Remove complete rows which contain `NaN`
158 | 2. Replace `NaN` with a value of our choice
159 | 3. Replace specific values with `NaN`
160 | 
161 | With these options we can ensure that the data is suitable for the further processing we have planned.
162 | 
163 | ### Completely remove rows with NaNs
164 | 
165 | The `dropna()` method will delete all rows if *any* of the variables contain an `NaN`. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning.
166 | 
167 | ```python
168 | df_SAFI = pd.read_csv("SAFI_results.csv")
169 | print(df_SAFI.shape)
170 | df_SAFI.dropna(inplace=True)
171 | print(df_SAFI.shape)
172 | ```
173 | 
174 | ```
175 | (131, 55)
176 | (0, 55)
177 | ```
178 | 
179 | {: output}
180 | 
181 | Because there are variables in the SAFI dataset which are all `NaN` using the `dropna()` method effectively deletes all of the rows from the Dataframe, probably not what you wanted. Instead we can use the `notnull()` method as a row selection criteria and delete the rows where a specific variable has `NaN` values.
182 | 
183 | ```python
184 | df_SAFI = pd.read_csv("SAFI_results.csv")
185 | print(df_SAFI.shape)
186 | df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
187 | print(df_SAFI.shape)
188 | ```
189 | 
190 | ```
191 | (131, 55)
192 | (39, 55)
193 | ```
194 | 
195 | {: output}
196 | 
197 | ### Replace NaN with a value of our choice
198 | 
199 | The `E19_period_use` variable answers the question: "For how many years have you been irrigating the land?". In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run
200 | 
201 | ```
202 | df_SAFI['E19_period_use'].describe()
203 | ```
204 | 
205 | we get a count value of 92 and all of the other statistics are based on this count value.
206 | 
207 | Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is *not* irrigated has been irrigated for 0 years, technically correct.
208 | 
209 | To see what happens we can convert all of the NaN values in the `E19_period_use` column to 0 with the following code:
210 | 
211 | ```python
212 | df_SAFI['E19_period_use'].fillna(0, inplace=True)
213 | ```
214 | 
215 | If we now run the `describe()` again you can see that all of the statistic have been changed because the calculations are NOW based on a count of 131. Probably not what we would have wanted.
216 | 
217 | Conveniently this allows us to demonstrate our 3rd action.
218 | 
219 | ### Replace specific values with NaN
220 | 
221 | Although we can recognise `NaN` with methods like `isnull()` or `dropna()` actually creating a `NaN` value and putting it into a Dataframe, requires the `numpy` module. The following code will replace our 0 values with `NaN`. We can demonstrate that this has occurred by running the `describe()` again and see that we now have our original values back.
222 | 
223 | ```python
224 | import numpy as np
225 | df_SAFI['E19_period_use'].replace(0, np.NaN, inplace = True)
226 | df_SAFI['E19_period_use'].describe()
227 | ```
228 | 
229 | ## Categorical variables
230 | 
231 | For categorical variables, numerical statistics don't make any sense.
232 | For a categorical variable we can obtain a list of unique values used by the variable by using the `unique()` method.
233 | 
234 | ```python
235 | df_SAFI = pd.read_csv("SAFI_results.csv")
236 | pd.unique(df_SAFI['C01_respondent_roof_type'])
237 | ```
238 | 
239 | ```
240 | array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
241 | ```
242 | 
243 | {: output}
244 | 
245 | Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the `groupby` method.
246 | 
247 | Having performed the `groupby()` we can them `describe()` the results. The format is similar to that which we have seen before except that the 'grouped by' variable appears to the left and there is a set of statistics for each unique value of the variable.
248 | 
249 | ```python
250 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
251 | grouped_data.describe()
252 | ```
253 | 
254 | You can group by more than one variable at a time by providing them as a list.
255 | 
256 | ```python
257 | grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])
258 | grouped_data.describe()
259 | ```
260 | 
261 | You can also obtain individual statistics if you want.
262 | 
263 | ```python
264 | A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count()
265 | A11_years_farm
266 | ```
267 | 
268 | ```
269 | C01_respondent_roof_type  C02_respondent_wall_type
270 | grass                     burntbricks                 22
271 |                           muddaub                     42
272 |                           sunbricks                    9
273 | mabatipitched             burntbricks                  6
274 |                           muddaub                      3
275 | ...
276 | ```
277 | 
278 | {: output}
279 | 
280 | :::::::::::::::::::::::::::::::::::::::  challenge
281 | 
282 | ## Exercise
283 | 
284 | 1. Read in the SAFI\_results.csv dataset.
285 | 2. Get a list of the different `C01_respondent_roof_type` values.
286 | 3. Groupby `C01_respondent_roof_type` and describe the results.
287 | 4. Remove rows with NULL values for `E_no_group_count`.
288 | 5. repeat steps 2 \& 3 and compare the results.
289 | 
290 | :::::::::::::::  solution
291 | 
292 | ## Solution
293 | 
294 | ```python
295 | # Steps 1 and 2
296 | import numpy as np
297 | df_SAFI = pd.read_csv("SAFI_results.csv")
298 | print(df_SAFI.shape)
299 | print(pd.unique(df_SAFI['C01_respondent_roof_type']))
300 | ```
301 | 
302 | ```python
303 | # Step 3
304 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
305 | grouped_data.describe()
306 | ```
307 | 
308 | ```python
309 | # steps 4 and 5
310 | df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
311 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
312 | print(df_SAFI.shape)
313 | print(pd.unique(df_SAFI['C01_respondent_roof_type']))
314 | grouped_data.describe()
315 | ```
316 | 
317 | `E_no_group_count` is related to whether or not farm plots are irrigated or not. It has no obvious connection to farm buildings.
318 | By restricting the data to non-irrigated plots we have accidentally? removed one of the roof\_types completely.
319 | 
320 | :::::::::::::::::::::::::
321 | 
322 | ::::::::::::::::::::::::::::::::::::::::::::::::::
323 | 
324 | :::::::::::::::::::::::::::::::::::::::: keypoints
325 | 
326 | - Summarising numerical and categorical variables is a very common requirement
327 | - Missing data can interfere with how statistical summaries are calculated
328 | - Missing data can be replaced or created depending on requirement
329 | - Summarising or aggregation can be done over single or multiple variables at the same time
330 | 
331 | ::::::::::::::::::::::::::::::::::::::::::::::::::
332 | 
333 | 
334 | 


--------------------------------------------------------------------------------
/episodes/11-joins.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Joining Pandas Dataframes
  3 | teaching: 25
  4 | exercises: 10
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Understand why we would want to join Dataframes
 10 | - Know what is needed for a join to be possible
 11 | - Understand the different types of joins
 12 | - Understand what the joined results tell us about our data
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - How can I join two Dataframes with a common key?
 19 | 
 20 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 21 | 
 22 | ## Joining Dataframes
 23 | 
 24 | ### Why do we want to do this
 25 | 
 26 | There are many occasions when we have related data spread across multiple files.
 27 | 
 28 | The data can be related to each other in different ways. How they are related and how completely we can join the data
 29 | from the datasets will vary.
 30 | 
 31 | In this episode we will consider different scenarios and show we might join the data. We will use csv files and in all
 32 | cases the first step will be to read the datasets into a pandas Dataframe from where we will do the joining. The csv
 33 | files we are using are cut down versions of the SN7577 dataset to make the displays more manageable.
 34 | 
 35 | First, let's download the datafiles. They are listed in the [setup page][setup-page] for the lesson. Alternatively,
 36 | you can download the [GitHub repository for this lesson][gh-repo]. The data files are in the
 37 | *data* directory. If you're using Jupyter, make sure to place these files in the same directory where your notebook
 38 | file is.
 39 | 
 40 | ### Scenario 1 - Two data sets containing the same columns but different rows of data
 41 | 
 42 | Here we want to add the rows from one Dataframe to the rows of the other Dataframe.
 43 | In order to do this we can use the `pd.concat()` function.
 44 | 
 45 | ```python
 46 | import pandas as pd
 47 | 
 48 | df_SN7577i_a = pd.read_csv("SN7577i_a.csv")
 49 | df_SN7577i_b = pd.read_csv("SN7577i_b.csv")
 50 | ```
 51 | 
 52 | Have a quick look at what these Dataframes look like with
 53 | 
 54 | ```python
 55 | print(df_SN7577i_a)
 56 | print(df_SN7577i_b)
 57 | ```
 58 | 
 59 | ```
 60 |   Id  Q1  Q2  Q3  Q4
 61 | 0   1   1  -1   1   8
 62 | 1   2   3  -1   1   4
 63 | 2   3  10   3   2   6
 64 | 3   4   9  -1  10  10
 65 | ...
 66 | 
 67 |   Id  Q1  Q2  Q3  Q4
 68 | 0  1277  10  10   4   6
 69 | 1  1278   2  -1   5   4
 70 | 2  1279   2  -1   4   5
 71 | 3  1280   1  -1   2   3
 72 | ...
 73 | ```
 74 | 
 75 | {: output}
 76 | 
 77 | The `concat()` function appends the rows from the two Dataframes to create the df\_all\_rows Dataframe. When you list this out you can see that all of the data rows are there, however, there is a problem with the `index`.
 78 | 
 79 | ```python
 80 | df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b])
 81 | df_all_rows
 82 | ```
 83 | 
 84 | We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default
 85 | indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
 86 | 
 87 | This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.
 88 | 
 89 | ```python
 90 | df_all_rows=df_all_rows.reset_index(drop=True)
 91 | 
 92 | # or, alternatively, there's the `ignore_index` option in the `pd.concat()` function:
 93 | df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b], ignore_index=True)
 94 | 
 95 | df_all_rows
 96 | ```
 97 | 
 98 | What if the columns in the Dataframes are not the same?
 99 | 
100 | ```python
101 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv")
102 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv")
103 | df_all_rows = pd.concat([df_SN7577i_aa, df_SN7577i_bb])
104 | df_all_rows
105 | ```
106 | 
107 | In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the
108 | resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4`
109 | column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows.
110 | 
111 | ### Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe
112 | 
113 | ```python
114 | df_SN7577i_c = pd.read_csv("SN7577i_c.csv")
115 | df_SN7577i_d = pd.read_csv("SN7577i_d.csv")
116 | df_all_cols = pd.concat([df_SN7577i_c, df_SN7577i_d], axis = 1)
117 | df_all_cols
118 | ```
119 | 
120 | We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id`
121 | column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not
122 | necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
123 | 
124 | ### Scenario 3 - Using merge to join columns
125 | 
126 | We can join columns from two Dataframes using the `merge()` function. This is similar to the SQL 'join' functionality.
127 | 
128 | A detailed discussion of different join types is given in the [SQL lesson](./episodes/sql...).
129 | 
130 | You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the
131 | columns from both tables where the `key` or common column values match in both Dataframes.
132 | 
133 | The possible values of the `how` parameter are shown in the picture below (taken from the Pandas documentation)
134 | 
135 | ![](fig/pandas_join_types.png){alt='pandas\_join\_types'}
136 | 
137 | The different join types behave in the same way as they do in SQL. In Python/pandas, any missing values are shown as `NaN`
138 | 
139 | In order to `merge` the Dataframes we need to identify a column common to both of them.
140 | 
141 | ```python
142 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner')
143 | df_cd
144 | ```
145 | 
146 | In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to
147 | join on. In this example the `Id` column
148 | 
149 | Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using
150 | the `on` parameter.
151 | 
152 | ```python
153 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id')
154 | ```
155 | 
156 | In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you
157 | can use the `left_on` and `right_on` parameters to specify them separately.
158 | 
159 | ```python
160 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', left_on = 'Id', right_on = 'Id')
161 | ```
162 | 
163 | :::::::::::::::::::::::::::::::::::::::  challenge
164 | 
165 | ## Practice with data
166 | 
167 | 1. Examine the contents of the `SN7577i_aa` and `SN7577i_bb` csv files using Excel or equivalent.
168 | 2. Using the `SN7577i_aa` and `SN7577i_bb` csv files, create a Dataframe which is the result of an outer join using the `Id` column to join on.
169 | 3. What do you notice about the column names in the new Dataframe?
170 | 4. Using `shift`\+`tab` in Jupyter examine the possible parameters for the `merge()` function.
171 | 5. re-write the code so that the columns names which are common to both files have suffixes indicating the filename from which they come
172 | 6. If you add the parameter `indicator=True`, what additional information is provided in the resulting Dataframe?
173 | 
174 | :::::::::::::::  solution
175 | 
176 | ## Solution
177 | 
178 | ```python
179 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv")
180 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv")
181 | df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id')
182 | df_aabb
183 | ```
184 | 
185 | ```python
186 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv")
187 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv")
188 | df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id',suffixes=('_aa', '_bb'), indicator = True)
189 | df_aabb
190 | ```
191 | 
192 | :::::::::::::::::::::::::
193 | 
194 | ::::::::::::::::::::::::::::::::::::::::::::::::::
195 | 
196 | [setup-page]: https://datacarpentry.org/python-socialsci/setup.html
197 | [gh-repo]: https://github.com/datacarpentry/python-socialsci/archive/gh-pages.zip
198 | 
199 | 
200 | :::::::::::::::::::::::::::::::::::::::: keypoints
201 | 
202 | - You can join pandas Dataframes in much the same way as you join tables in SQL
203 | - The `concat()` function can be used to concatenate two Dataframes by adding the rows of one to the other.
204 | - `concat()` can also combine Dataframes by columns but the `merge()` function is the preferred way
205 | - The `merge()` function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
206 | 
207 | ::::::::::::::::::::::::::::::::::::::::::::::::::
208 | 
209 | 
210 | 


--------------------------------------------------------------------------------
/episodes/12-long-and-wide.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Wide and long data formats
  3 | teaching: 20
  4 | exercises: 15
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Explain difference between long and wide formats and why each might be used
 10 | - Illustrate how to change between formats using the `melt()` and `pivot()` methods
 11 | 
 12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 13 | 
 14 | :::::::::::::::::::::::::::::::::::::::: questions
 15 | 
 16 | - What are long and Wide formats?
 17 | - Why would I want to change between them?
 18 | 
 19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 20 | 
 21 | ## Wide and long data formats
 22 | 
 23 | In the SN7577 dataset that we have been using there is a group of columns which record which daily newspapers each respondent reads. Despite the un-informative names like 'daily1' each column refers to a current UK daily national or local newspaper.
 24 | 
 25 | Whether the paper is read or not is recorded using the values of 0 or 1 as a boolean indicator. The advantage of using a column for each paper means that should a respondent read multiple newspapers, all of the required information can still be recorded in a single record.
 26 | 
 27 | Recording information in this *wide* format is not always beneficial when trying to analyse the data.
 28 | 
 29 | Pandas provides methods for converting data from *wide* to *long* format and from *long* to *wide* format
 30 | 
 31 | The SN7577 dataset does not contain a variable that can be used to uniquely identify a row. This is often referred to as a 'primary key' field (or column).
 32 | 
 33 | A dataset doesn't need to have such a key. None of the work we have done so far has required it.
 34 | 
 35 | When we create a pandas Dataframe by importing a csv file, we have seen that pandas will create an index  for the rows. This index can be used a bit like a key field, but as we have seen there can be problems with the index when we concatenate two Dataframes together.
 36 | 
 37 | In the version of SN7577 that we are going to use to demonstrate long and wide formats we will add a new variable with the name of 'Id' and we will restrict the other columns to those starting with the word 'daily'.
 38 | 
 39 | ```python
 40 | import pandas as pd
 41 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
 42 | ```
 43 | 
 44 | We will create a new Dataframe with a single column of 'Id'.
 45 | 
 46 | ```python
 47 | # create an 'Id' column
 48 | df_papers1 = pd.DataFrame(pd.Series(range(1,1287)), index=None, columns=['Id'])
 49 | ```
 50 | 
 51 | Using the range function, we can create values of `Id` starting with 1 and going up to 1286 (remember the second parameter to range is one past the last value used.) We have explicitly coded this value because we knew how many rows were in the dataset. If we didn't, we could have used
 52 | 
 53 | ```python
 54 | len(df_SN7577.index) +1
 55 | ```
 56 | 
 57 | ```
 58 | 1287
 59 | ```
 60 | 
 61 | {: output}
 62 | 
 63 | We will create a 2nd Dataframe, based on SN7577 but containing only the columns starting with the word 'daily'.
 64 | 
 65 | There are several ways of doing this, we'll cover the way that we have covered all of the prerequisites for.  We will use the `filter` method of `pandas` with its `like` parameter.
 66 | 
 67 | ```python
 68 | df_papers2 = df_SN7577.filter(like= 'daily')
 69 | ```
 70 | 
 71 | The value supplied to `like` can occur anywhere in the column name to be matched (and therefore selected).
 72 | 
 73 | :::::::::::::::::::::::::::::::::::::::::  callout
 74 | 
 75 | ## Another way
 76 | 
 77 | If we knew the column numbers and they were all continuous we could use the `iloc` method and provide the index values of the range of columns we want.
 78 | 
 79 | ```python
 80 | df_papers2 = df_SN7577.iloc[:,118:143]
 81 | ```
 82 | 
 83 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 84 | 
 85 | To create the Dataframe that we will use, we will concatenate the two Dataframes we have created.
 86 | 
 87 | ```python
 88 | df_papers = pd.concat([df_papers1, df_papers2], axis = 1)
 89 | print(df_papers.index)
 90 | print(df_papers.columns)
 91 | ```
 92 | 
 93 | ```
 94 | RangeIndex(start=0, stop=1286, step=1)
 95 | Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6',
 96 |        'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12',
 97 |        'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18',
 98 |        'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24',
 99 |        'daily25'],
100 |       dtype='object')
101 | ```
102 | 
103 | {: output}
104 | 
105 | We use `axis = 1` because we are joining by columns, the default is joining by rows (`axis=0`).
106 | 
107 | ## From 'wide' to 'long'
108 | 
109 | To make the displays more manageable we will use only the first eight 'daily' columns
110 | 
111 | ```python
112 | ## using df_papers
113 | daily_list = df_papers.columns[1:8]
114 | 
115 | df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list)
116 | 
117 | # by default, the new columns created will be called 'variable' which is the name of the 'daily'
118 | # and 'value' which is the value of that 'daily' for that 'Id'. So, we will rename the columns
119 | 
120 | df_daily_papers_long.columns = ['Id','Daily_paper','Value']
121 | df_daily_papers_long
122 | ```
123 | 
124 | We now have a Dataframe that we can `groupby`.
125 | 
126 | We want to `groupby` the `Daily_paper` and then sum the `Value`.
127 | 
128 | ```python
129 | a = df_daily_papers_long.groupby('Daily_paper')['Value'].sum()
130 | a
131 | ```
132 | 
133 | ```
134 | Daily_paper
135 | daily1     0
136 | daily2    26
137 | daily3    52
138 | ```
139 | 
140 | {: output}
141 | 
142 | ## From Long to Wide
143 | 
144 | The process can be reversed by using the `pivot()` method.
145 | Here we need to indicate which column (or columns) remain fixed (this will become an index in the new Dataframe), which column contains the values which are to become column names and which column contains the values for the columns.
146 | 
147 | In our case we want to use the `Id` column as the fixed column, the `Daily_paper` column contains the column names and the `Value` column contains the values.
148 | 
149 | ```python
150 | df_daily_papers_wide = df_daily_papers_long.pivot(index = 'Id', columns = 'Daily_paper', values = 'Value')
151 | ```
152 | 
153 | We can change our `Id` index back to an ordinary column with
154 | 
155 | ```python
156 | df_daily_papers_wide.reset_index(level=0, inplace=True)
157 | ```
158 | 
159 | :::::::::::::::::::::::::::::::::::::::  challenge
160 | 
161 | ## Exercise
162 | 
163 | 1. Find out how many people take each of the daily newspapers by Title.
164 | 2. Which titles don't appear to be read by anyone?
165 | 
166 | There is a file called Newspapers.csv which lists all of the newspapers Titles along with the corresponding 'daily' value
167 | 
168 | Hint: Newspapers.csv contains both daily and Sunday newspapers you can filter out the Sunday papers with the following code:
169 | 
170 | ```python
171 | df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))]
172 | ```
173 | 
174 | :::::::::::::::  solution
175 | 
176 | ## Solution
177 | 
178 | 1. Read in Newspapers.csv file and keep only the dailies.
179 | 
180 | ```python
181 | df_newspapers = pd.read_csv("Newspapers.csv")
182 | df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))]
183 | df_newspapers
184 | ```
185 | 
186 | 2. Create the df\_papers Dataframe as we did before.
187 | 
188 | ```python
189 | import pandas as pd
190 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
191 | #create an 'Id' column
192 | df_papers1 = pd.DataFrame(pd.Series(range(1,1287)),index=None,columns=['Id'])
193 | df_papers2 = df_SN7577.filter(like= 'daily')
194 | df_papers = pd.concat([df_papers1, df_papers2], axis = 1)
195 | df_papers
196 | ```
197 | 
198 | 3. Create a list of all of the dailies, one way would be
199 | 
200 | ```python
201 | daily_list = []
202 | for i in range(1,26):
203 |     daily_list.append('daily'+str(i))  
204 | ```
205 | 
206 | 4. Pass the list as the `value_vars` parameter to the `melt()` method
207 | 
208 | ```python
209 | #use melt to create df_daily_papers_long  
210 | df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list )
211 | #Change the column names
212 | df_daily_papers_long.columns = ['Id', 'Daily_paper', 'Value']
213 | ```
214 | 
215 | 5. `merge` the two Dataframes with a left join, because we want all of the Newspaper Titles to be included.
216 | 
217 | ```python
218 | df_papers_taken = pd.merge(df_newspapers, df_daily_papers_long, how='left', left_on = 'Column_name',right_on = 'Daily_paper')
219 | ```
220 | 
221 | 6. Then `groupby` the 'Title' and sum the 'Value'
222 | 
223 | ```python
224 | df_papers_taken.groupby('Title')['Value'].sum()
225 | ```
226 | 
227 | :::::::::::::::::::::::::
228 | 
229 | ::::::::::::::::::::::::::::::::::::::::::::::::::
230 | 
231 | :::::::::::::::::::::::::::::::::::::::: keypoints
232 | 
233 | - The `melt()` method can be used to change from wide to long format
234 | - The `pivot()` method can be used to change from the long to wide format
235 | - Aggregations are best done from data in the long format.
236 | 
237 | ::::::::::::::::::::::::::::::::::::::::::::::::::
238 | 
239 | 
240 | 


--------------------------------------------------------------------------------
/episodes/14-sqlite.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Accessing SQLite Databases '
  3 | teaching: 35
  4 | exercises: 25
  5 | ---
  6 | 
  7 | ::::::::::::::::::::::::::::::::::::::: objectives
  8 | 
  9 | - Use the sqlite3 module to interact with a SQL database
 10 | - Access data stored in SQLite using Python
 11 | - Describe the difference in interacting with data stored as a CSV file versus in SQLite
 12 | - Describe the benefits of accessing data using a database compared to a CSV file
 13 | 
 14 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 15 | 
 16 | :::::::::::::::::::::::::::::::::::::::: questions
 17 | 
 18 | - How can I access database  tables using Pandas and Python?
 19 | - What are the advantages of storing data in a database
 20 | 
 21 | ::::::::::::::::::::::::::::::::::::::::::::::::::
 22 | 
 23 | ## Introducing the sqlite3 module
 24 | 
 25 | SQLite is a relational database system. Despite the 'Lite' in the name it can handle databases in excess of a Terabyte. The 'Lite'part really relates to the fact that it is a 'bare bones' system. It provides the mechanisms to create and query databases via a simple command line interface but not much else. In the SQL lesson we used a Firefox plugin to provide a GUI (Graphical User Interface) to the SQLite database engine.
 26 | 
 27 | In this lesson we will use Python code using the sqlite3 module to access the engine. We can use Python code and the sqlite3 module to create, delete and query database tables.
 28 | 
 29 | In practice we spend a lot of the time querying database tables.
 30 | 
 31 | ## Pandas Dataframe v SQL table
 32 | 
 33 | It is very easy and often very convenient to think of SQL tables and pandas Dataframes as being similar types of objects. All of the data manipulations, slicing, dicing, aggragetions and joins associated with SQL and SQL tables can all be accomplished with pandas methods operating on a pandas Dataframe.
 34 | 
 35 | The difference is that the pandas Dataframe is held in memory within the Python environment. The SQL table can largely be on disc and when you access it, it is the SQLite database engine which is doing the work. This allows you to work with very large tables which your Python environment may not have the memory to hold completely.
 36 | 
 37 | A typical use case for SQLite databases is to hold large datasets, you use SQL commands from Python to slice and dice and possibly aggregate the data within the database system to reduce the size to something that Python can comfortably process and then return the results to a Dataframe.
 38 | 
 39 | ## Accessing data stored in SQLite using Python
 40 | 
 41 | We will illustrate the use of the `sqlite3` module by connecting to an SQLite database using both core Python and also using pandas.
 42 | 
 43 | The database that we will use is SN7577.sqlite This contains the data from the SN7577 dataset that we have used in other lessons.
 44 | 
 45 | ## Connecting to an SQlite database
 46 | 
 47 | The first thing we need to do is import the `sqlite3` library, We will import pandas at the same time for convenience.
 48 | 
 49 | ```python
 50 | import sqlite3
 51 | import pandas as pd
 52 | ```
 53 | 
 54 | We will start looking at the sqlite3 library by connecting to an existing database and returning the results of running a query.
 55 | 
 56 | Initially we will do this without using Pandas and then we will repreat the exercise so that you can see the difference.
 57 | 
 58 | The first thing we need to do is to make a connection to the database. An SQLite database is just a file. To make a connection to it we only need to use the sqlite3 `connect()` function and specify the database file as the first parameter.
 59 | 
 60 | The connection is assigned to a variable. You could use any variable name, but 'con' is quite commonly used for this purpose
 61 | 
 62 | ```python
 63 | con = sqlite3.connect('SN7577.sqlite')
 64 | ```
 65 | 
 66 | The next thing we need to do is to create a `cursor` for the connection and assign it to a variable. We do this using the `cursor` method of the connection object.
 67 | 
 68 | The cursor allows us to pass SQL statements to the database, have them executed and then get the results back.
 69 | 
 70 | To execute an SQL statement we use the `execute()` method of the cursor object.
 71 | 
 72 | The only paramater we need to pass to `execute()` is a string which contains the SQL query we wish to execute.
 73 | 
 74 | In our example we are passing a literal string. It could have been contained in a string variable. The string can contain any valid SQL query. It could also be a valid DDL statement such as a "CREATE TABLE ...". In this lesson however we will confine ourseleves to querying exiting database tables.
 75 | 
 76 | ```python
 77 | cur = con.cursor()
 78 | cur.execute("SELECT * FROM SN7577")
 79 | ```
 80 | 
 81 | ```
 82 | <sqlite3.Cursor at 0x115e10d50>
 83 | ```
 84 | 
 85 | {: output}
 86 | 
 87 | The `execute()` method doesn't actually return any data, it just indicates that we want the data provided by running the SELECT statement.
 88 | 
 89 | :::::::::::::::::::::::::::::::::::::::  challenge
 90 | 
 91 | ## Exercise
 92 | 
 93 | 1. What happens if you if you ask for a non existent table?, field within a table? or just any kind of syntax error?
 94 | 
 95 | :::::::::::::::  solution
 96 | 
 97 | ## Solution
 98 | 
 99 | ```python
100 | cur = con.cursor()
101 | # notice the mistyping of 'SELECT'
102 | cur.execute("SELET * FROM SN7577")
103 | ```
104 | 
105 | In all cases an error message is returned. The error message is not from Python but from SQLite. It is the same error message that you would have got had you made the same errors in the SQLite plugin.
106 | 
107 | :::::::::::::::::::::::::
108 | 
109 | ::::::::::::::::::::::::::::::::::::::::::::::::::
110 | 
111 | Before we can make use of the results of the query we need to use the `fetchall()` method of the cursor.
112 | 
113 | The `fetchall()` method returns a list. Each item in the list is a tuple containing the values from one row of the table. You can iterate through the items in a tuple in the same way as you would do so for a list.
114 | 
115 | ```python
116 | cur = con.cursor()
117 | cur.execute("SELECT * FROM SN7577")
118 | rows = cur.fetchall()
119 | for row in rows:
120 |     print(row)
121 | ```
122 | 
123 | ```
124 | (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3
125 | ...
126 | ```
127 | 
128 | {: output}
129 | 
130 | The output is the data only, you do not get the column names.
131 | 
132 | The column names are available from the 'description' of the cursor.
133 | 
134 | ```python
135 | colnames = []
136 | for description in cur.description :
137 |     colnames.append(description[0])
138 | 
139 | print(colnames)
140 | ```
141 | 
142 | ```
143 | ['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a',
144 | ...
145 | ```
146 | 
147 | {: output}
148 | 
149 | One reason for using a database is the size of the data involved. Consequently it may not be practial to use `fetchall()` as this will return the complete result of your query.
150 | 
151 | An alternative is to use the `fetchone()` method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to `fetchone()` will return the next record. When there are no more records it will return 'None'.
152 | 
153 | ```python
154 | cur = con.cursor()
155 | cur.execute("SELECT * FROM SN7577")
156 | row = cur.fetchone()
157 | print(row)
158 | row = cur.fetchone()
159 | print(row)
160 | ```
161 | 
162 | ```
163 | (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
164 | ```
165 | 
166 | {: output}
167 | 
168 | :::::::::::::::::::::::::::::::::::::::  challenge
169 | 
170 | ## Exercise
171 | 
172 | Can you write code to return the first 5 records from the SN7577 table in two different ways?
173 | 
174 | :::::::::::::::  solution
175 | 
176 | ## Solution
177 | 
178 | ```python
179 | import sqlite3
180 | con = sqlite3.connect('SN7577.sqlite')
181 | cur = con.cursor()
182 | 
183 | # we can use the SQLite 'limit' clause to restrict the number of rows returned and then use 'fetchall'
184 | cur.execute("SELECT * FROM SN7577 Limit 5")
185 | rows = cur.fetchall()
186 | 
187 | for row in rows:
188 |     print(row)
189 | 
190 | # we can use 'fetchone' in a for loop
191 | cur.execute("SELECT * FROM SN7577")
192 | for i in range(1,6):
193 |     print(cur.fetchone())
194 | 
195 | # a third way would be to use the 'fetchmany()' method
196 | 
197 | cur.execute("SELECT * FROM SN7577")
198 | rows = cur.fetchmany(5)
199 | 
200 | for row in rows:
201 |     print(row)
202 | ```
203 | 
204 | :::::::::::::::::::::::::
205 | 
206 | ::::::::::::::::::::::::::::::::::::::::::::::::::
207 | 
208 | ## Using Pandas to read a database table.
209 | 
210 | When you use Pandas to read a database table, you connect to the database in the same way as before using the SQLite3 `connect()` function and providing the filename of the database file.
211 | 
212 | Pandas has a method `read_sql_query` to which you provide both the string containing the SQL query you wish to run and also the connection variable.
213 | 
214 | The results from running the query are placed in a pandas Dataframe with the table column names automatically added.
215 | 
216 | ```python
217 | con = sqlite3.connect('SN7577.sqlite')
218 | df = pd.read_sql_query("SELECT * from SN7577", con)
219 | 
220 | # verify that result of SQL query is stored in the Dataframe
221 | print(type(df))
222 | print(df.shape)
223 | print(df.head())
224 | 
225 | con.close()
226 | ```
227 | 
228 | ## Saving a Dataframe as an SQLite table
229 | 
230 | There may be occasions when it is convenient to save the data in you pandas Dataframe as an SQLite table for future use or for access to other systems. This can be done using the `to_sql()` method.
231 | 
232 | ```python
233 | con = sqlite3.connect('SN7577.sqlite')
234 | df = pd.read_sql_query("SELECT * from SN7577", con)
235 | 
236 | # select only the row where the response to Q1 is 10 meaning undecided voter
237 | df_undecided = df[df.Q1 == 10]
238 | print(df_undecided.shape)
239 | 
240 | # Write the new Dataframe to a new SQLite table
241 | df_undecided.to_sql("Q1_undecided", con)
242 | 
243 | # If you want to overwrite an existing SQLite table you can use the 'if_exists' parameter
244 | #df_undecided.to_sql("Q1_undecided", con, if_exists="replace")
245 | con.close()
246 | ```
247 | 
248 | ```
249 | (335, 202)
250 | ```
251 | 
252 | {: output}
253 | 
254 | ## Deleting an SQLite table
255 | 
256 | If you have created tables in an SQLite database, you may also want to delete them.
257 | You can do this by using the sqlite3 cursor `execute()` method
258 | 
259 | ```python
260 | con = sqlite3.connect('SN7577.sqlite')
261 | cur = con.cursor()
262 | 
263 | cur.execute('drop table if exists Q1_undecided')
264 | 
265 | con.close()
266 | ```
267 | 
268 | :::::::::::::::::::::::::::::::::::::::  challenge
269 | 
270 | ## Exercise
271 | 
272 | The code below creates an SQLite table as we have done in previous examples. Run this code to create the table.
273 | 
274 | ```
275 | con = sqlite3.connect('SN7577.sqlite')
276 | df_undecided = df[df.Q1 == 10]
277 | df_undecided.to_sql("Q1_undecided_v2", con)
278 | con.close()
279 | ```
280 | 
281 | Try using the following pandas code to delete (drop) the table.
282 | 
283 | ```python
284 | pd.read_sql_query("drop table Q1_undecided_v2", con)
285 | ```
286 | 
287 | 1. What happens?
288 | 2. Run this line of code again, What is different?
289 | 3. Can you explain the difference and does the table now exist or not?
290 | 
291 | :::::::::::::::  solution
292 | 
293 | ## Solution
294 | 
295 | 1. When the line of code is run the first time you get an error message : 'NoneType' object is not iterable.
296 | 
297 | 2. When you run it a second time you get a different error message:
298 |   DatabaseError: Execution failed on sql 'drop table Q1\_undecided\_v2': no such table: Q1\_undecided\_v2
299 | 
300 | 3. the `read_sql_query()` method is designed to send the SQL containing your query to the SQLite execution engine, which will execute the SQL and return the output to pandas which will create a Dataframe from the results.
301 | 
302 | The SQL statement we sent is valid SQL but it doesn't return rows from a table, it simply reports success of failure (in dropping the table in this case). The first time we run it the table is deleted and a response to the effect is returned. The resonse cannot be converted to a Dataframe, hence the first error message, which is a pandas error.
303 | 
304 | When we run it for the second time, the table has already has already been dropped, so this time the error message is from SQLite saying the table didn't exist. Pandas recognises that this is an SQLite error message and simply passes it on to the user.
305 | 
306 | The moral of the story: pandas may be better for getting data returned into a Dataframe, but there are some things best left to the sqlite functions directly.
307 | 
308 | :::::::::::::::::::::::::
309 | 
310 | ::::::::::::::::::::::::::::::::::::::::::::::::::
311 | 
312 | :::::::::::::::::::::::::::::::::::::::: keypoints
313 | 
314 | - The SQLite database system is directly available from within Python
315 | - A database table and a pandas Dataframe can be considered similar structures
316 | - Using pandas to return all of the results from a query is simpler than using sqlite3 alone
317 | 
318 | ::::::::::::::::::::::::::::::::::::::::::::::::::
319 | 
320 | 
321 | 


--------------------------------------------------------------------------------
/episodes/data/Newspapers.csv:
--------------------------------------------------------------------------------
 1 | Column_name,Title
 2 | daily1,THE GLASGOW HERALD
 3 | daily2,THE INDEPENDENT
 4 | daily3,THE DAILY TELEGRAPH
 5 | daily4,THE GUARDIAN
 6 | daily5,THE FINANCIAL TIMES
 7 | daily6,THE TIMES
 8 | daily7,THE SCOTSMAN
 9 | daily8,DAILY EXPRESS
10 | daily9,DAILY MAIL
11 | daily10,DAILY RECORD
12 | daily11,THE SUN
13 | daily12,DAILY MIRROR
14 | daily13,DAILY STAR
15 | daily14,WESTERN MAIL (WALES)
16 | daily15,BELFAST TELEGRAPH
17 | daily16,IRISH NEWS
18 | daily17,NEWS LETTER (ULSTER)
19 | daily18,THE METRO
20 | daily19,EVENING STANDARD
21 | daily20,I NEWSPAPER
22 | daily21,BROADSHEET
23 | daily22,MID MARKET
24 | daily23,TABLOID
25 | daily24,NONE OF THESE
26 | daily25,DON'T KNOW
27 | sunday1,SUNDAY MAIL (SCOTLAND)
28 | sunday2,THE MAIL ON SUNDAY
29 | sunday3,SUNDAY POST
30 | sunday4,THE INDEPENDENT ON SUNDAY
31 | sunday5,SUNDAY TIMES
32 | sunday6,SUNDAY TELEGRAPH
33 | sunday7,SUNDAY EXPRESS
34 | sunday8,OBSERVER
35 | sunday9,NEWS OF THE WORLD
36 | sunday10,THE PEOPLE
37 | sunday11,SUNDAY MIRROR
38 | sunday12,SUNDAY SPORT
39 | sunday13,SCOTLAND ON SUNDAY
40 | sunday14,DAILY STAR SUNDAY
41 | sunday15,THE SUN (SUNDAY)
42 | sunday16,BROADSHEET
43 | sunday17,MID-MARKET
44 | sunday18,TABLOIDS
45 | sunday19,NONE OF THESE
46 | sunday20,DON'T KNOW
47 | 


--------------------------------------------------------------------------------
/episodes/data/Newspapers.txt:
--------------------------------------------------------------------------------
 1 | daily1,"THE GLASGOW HERALD"
 2 | daily2,"THE INDEPENDENT"
 3 | daily3,"THE DAILY TELEGRAPH"
 4 | daily4,"THE GUARDIAN"
 5 | daily5,"THE FINANCIAL TIMES"
 6 | daily6,"THE TIMES"
 7 | daily7,"THE SCOTSMAN"
 8 | daily8,"DAILY EXPRESS"
 9 | daily9,"DAILY MAIL"
10 | daily10,"DAILY RECORD"
11 | daily11,"THE SUN"
12 | daily12,"DAILY MIRROR"
13 | daily13,"DAILY STAR"
14 | daily14,"WESTERN MAIL (WALES)"
15 | daily15,"BELFAST TELEGRAPH"
16 | daily16,"IRISH NEWS"
17 | daily17,"NEWS LETTER (ULSTER)"
18 | daily18,"THE METRO"
19 | daily19,"EVENING STANDARD"
20 | daily20,"I NEWSPAPER"
21 | daily21,"BROADSHEET"
22 | daily22,"MID MARKET"
23 | daily23,"TABLOID"
24 | daily24,"NONE OF THESE"
25 | daily25,"DON'T KNOW"
26 | sunday1,"SUNDAY MAIL (SCOTLAND)"
27 | sunday2,"THE MAIL ON SUNDAY"
28 | sunday3,"SUNDAY POST"
29 | sunday4,"THE INDEPENDENT ON SUNDAY"
30 | sunday5,"SUNDAY TIMES"
31 | sunday6,"SUNDAY TELEGRAPH"
32 | sunday7,"SUNDAY EXPRESS"
33 | sunday8,"OBSERVER"
34 | sunday9,"NEWS OF THE WORLD"
35 | sunday10,"THE PEOPLE"
36 | sunday11,"SUNDAY MIRROR"
37 | sunday12,"SUNDAY SPORT"
38 | sunday13,"SCOTLAND ON SUNDAY"
39 | sunday14,"DAILY STAR SUNDAY"
40 | sunday15,"THE SUN (SUNDAY)"
41 | sunday16,"BROADSHEET"
42 | sunday17,"MID-MARKET"
43 | sunday18,"TABLOIDS"
44 | sunday19,"NONE OF THESE"
45 | sunday20,"DON'T KNOW"
46 | 
47 | 


--------------------------------------------------------------------------------
/episodes/data/Q1.txt:
--------------------------------------------------------------------------------
 1 | 1,Conservative
 2 | 2,Labour
 3 | 3,Liberal Democrats (Lib Dem)
 4 | 4,Scottish/Welsh Nationalist
 5 | 5,Green Party
 6 | 6,UK Independence Party
 7 | 7,British National Party (BNP)
 8 | 8,Other
 9 | 9,Would not vote
10 | 10,Undecided
11 | 11,Refused
12 | 


--------------------------------------------------------------------------------
/episodes/data/Q6.txt:
--------------------------------------------------------------------------------
1 | 1,Very interested
2 | 2,Fairly interested
3 | 3,Not very interested
4 | 4,Not at all interested
5 | 5,Don't know
6 | 
7 | 


--------------------------------------------------------------------------------
/episodes/data/Q7.txt:
--------------------------------------------------------------------------------
1 | 1,A great deal
2 | 2,A fair amount
3 | 3,Not very much
4 | 4,Nothing at all
5 | 5,Don't know
6 | 


--------------------------------------------------------------------------------
/episodes/data/SAFI_crops.csv:
--------------------------------------------------------------------------------
  1 | Farm,plot_no,plot_area,crop_no,crop_name
  2 | 01,1,0.5,1,maize
  3 | 01,2,0.5,1,maize
  4 | 01,1,1.0,1,maize
  5 | 01,2,1.5,1,tomatoes
  6 | 01,3,1.0,1,vegetable
  7 | 03,1,1.0,1,maize
  8 | 04,1,1.0,1,maize
  9 | 04,2,1.0,1,maize
 10 | 04,3,1.0,1,sorghum
 11 | 05,1,1.5,1,maize
 12 | 05,2,0.5,1,maize
 13 | 6,1,1.5,1,maize
 14 | 7,1,2.0,1,maize
 15 | 7,2,1.0,1,tomatoes
 16 | 7,3,1.0,1,beans
 17 | 7,4,1.0,1,vegetable
 18 | 08,1,1.0,1,maize
 19 | 08,2,2.0,1,maize
 20 | 9,1,1.5,1,maize
 21 | 9,2,1.0,1,tomatoes
 22 | 9,3,1.0,1,vegetable
 23 | 10,1,3.0,1,maize
 24 | 10,2,1.5,1,tomatoes
 25 | 11,1,1.0,1,maize
 26 | 11,2,1.0,1,maize
 27 | 12,1,3.5,1,maize
 28 | 12,2,1.5,1,tomatoes
 29 | 13,1,12.0,1,maize
 30 | 13,2,1.0,1,beans
 31 | 13,3,1.0,1,vegetable
 32 | 13,4,1.0,1,onion
 33 | 14,1,1.0,1,maize
 34 | 14,2,1.0,1,maize
 35 | 14,3,1.0,1,maize
 36 | 15,1,15.0,1,maize
 37 | 15,2,1.0,1,beans
 38 | 15,3,0.5,1,ngogwe
 39 | 16,1,3.5,1,maize
 40 | 17,1,1.0,1,maize
 41 | 17,2,0.3,1,vegetable
 42 | 17,3,0.3,1,other
 43 | 18,1,1.0,1,maize
 44 | 18,2,0.5,1,pigeonpeas
 45 | 19,1,5.5,1,maize
 46 | 19,2,1.0,1,beans
 47 | 20,1,1.5,1,maize
 48 | 20,2,0.5,1,beans
 49 | 21,1,1.0,1,maize
 50 | 21,2,1.0,1,vegetable
 51 | 21,3,1.0,1,tomatoes
 52 | 21,4,1.0,1,cabbage
 53 | 22,1,3.0,1,maize
 54 | 23,1,10.0,1,maize
 55 | 23,2,1.0,1,amendoim
 56 | 24,1,2.0,1,maize
 57 | 24,2,2.0,1,tomatoes
 58 | 24,3,0.5,1,peanut
 59 | 25,1,6.0,1,maize
 60 | 25,2,1.0,1,tomatoes
 61 | 25,3,1.0,1,cabbage
 62 | 26,1,5.0,1,maize
 63 | 26,2,1.0,1,beans
 64 | 27,1,4.0,1,maize
 65 | 27,2,1.0,1,beans
 66 | 28,1,1.0,1,maize
 67 | 28,2,0.5,1,tomatoes
 68 | 28,3,0.5,1,vegetable
 69 | 29,1,1.5,1,maize
 70 | 29,2,0.2,1,tomatoes
 71 | 29,3,0.5,1,beans
 72 | 30,1,5.0,1,maize
 73 | 31,1,1.0,1,maize
 74 | 32,1,10.0,1,maize
 75 | 32,2,3.0,1,sorghum
 76 | 32,3,0.5,1,sunflower
 77 | 32,4,1.0,1,peanut
 78 | 32,5,0.5,1,tomatoes
 79 | 32,6,0.3,1,onion
 80 | 32,7,0.2,1,cabbage
 81 | 32,8,0.5,1,vegetable
 82 | 33,1,4.0,1,maize
 83 | 33,2,2.0,1,tomatoes
 84 | 34,1,1.0,1,tomatoes
 85 | 34,2,1.5,1,maize
 86 | 35,1,1.0,1,tomatoes
 87 | 35,2,4.0,1,maize
 88 | 36,1,1.0,1,maize
 89 | 36,2,2.0,1,tomatoes
 90 | 36,3,1.0,1,vegetable
 91 | 37,1,1.0,1,maize
 92 | 38,1,0.5,1,maize
 93 | 38,2,1.0,1,tomatoes
 94 | 39,1,7.0,1,maize
 95 | 40,1,4.0,1,maize
 96 | 40,2,2.0,1,tomatoes
 97 | 41,1,5.0,1,maize
 98 | 42,1,0.5,1,maize
 99 | 42,2,0.5,1,other
100 | 43,1,5.0,1,maize
101 | 43,2,1.0,1,amendoim
102 | 43,2,1.0,2,
103 | 43,3,1.0,1,vegetable
104 | 43,4,1.0,1,tomatoes
105 | 43,4,1.0,2,peanut
106 | 44,1,1.5,1,maize
107 | 45,1,1.5,1,maize
108 | 45,2,1.0,1,tomatoes
109 | 45,3,1.0,1,beans
110 | 46,1,2.0,1,maize
111 | 46,2,2.0,1,tomatoes
112 | 46,3,1.0,1,vegetable
113 | 46,4,1.0,1,onion
114 | 46,5,0.5,1,beans
115 | 47,1,2.5,1,maize
116 | 47,2,0.5,1,tomatoes
117 | 48,1,3.5,1,maize
118 | 48,1,3.5,2,sorghum
119 | 49,1,3.0,1,maize
120 | 49,1,3.0,2,sorghum
121 | 50,1,0.5,1,maize
122 | 50,1,0.5,2,tomatoes
123 | 50,1,0.5,3,piri_piri
124 | 51,1,1.0,1,maize
125 | 52,1,2.0,1,maize
126 | 21,1,5.0,1,maize
127 | 21,2,1.0,1,beans
128 | 21,2,1.0,2,tomatoes
129 | 21,2,1.0,3,cabbage
130 | 21,2,1.0,4,vegetable
131 | 21,3,1.0,1,maize
132 | 54,1,1.0,1,maize
133 | 54,1,1.0,2,tomatoes
134 | 55,1,9.0,1,maize
135 | 56,1,5.5,1,maize
136 | 56,2,1.0,1,tomatoes
137 | 57,1,1.0,1,maize
138 | 57,2,0.5,1,other
139 | 58,1,3.0,1,beans
140 | 58,2,0.5,1,other
141 | 58,3,10.0,1,maize
142 | 59,1,0.5,1,maize
143 | 59,1,0.5,2,other
144 | 59,2,1.0,1,maize
145 | 60,1,2.0,1,maize
146 | 60,2,1.0,1,tomatoes
147 | 60,3,1.0,1,peanut
148 | 61,1,2.0,1,tomatoes
149 | 61,1,2.0,2,onion
150 | 61,1,2.0,3,cabbage
151 | 61,1,2.0,4,vegetable
152 | 61,1,2.0,5,piri_piri
153 | 61,2,6.0,1,maize
154 | 62,1,0.5,1,maize
155 | 62,2,0.5,1,sorghum
156 | 63,1,1.5,1,maize
157 | 63,1,1.5,2,sesame
158 | 64,1,2.0,1,maize
159 | 64,2,0.5,1,sesame
160 | 65,1,2.0,1,maize
161 | 65,2,1.0,1,tomatoes
162 | 66,1,8.0,1,beans
163 | 66,1,8.0,2,tomatoes
164 | 66,1,8.0,3,onion
165 | 66,1,8.0,4,vegetable
166 | 67,1,6.0,1,maize
167 | 67,1,6.0,2,tomatoes
168 | 67,1,6.0,3,vegetable
169 | 68,1,1.5,1,maize
170 | 68,2,1.5,1,piri_piri
171 | 68,3,2.0,1,maize
172 | 69,1,3.0,1,maize
173 | 69,1,3.0,2,tomatoes
174 | 70,1,8.0,1,maize
175 | 70,1,8.0,2,
176 | 70,2,14.0,1,maize
177 | 70,3,0.2,1,maize
178 | 71,1,4.0,1,maize
179 | 71,2,4.0,1,tomatoes
180 | 127,1,1.0,1,maize
181 | 133,1,8.0,1,maize
182 | 133,1,8.0,2,vegetable
183 | 133,1,8.0,3,peanut
184 | 152,1,12.0,1,maize
185 | 152,1,12.0,2,sorghum
186 | 152,2,2.0,1,beans
187 | 152,2,2.0,2,tomatoes
188 | 152,2,2.0,3,vegetable
189 | 153,1,1.0,1,maize
190 | 155,1,2.0,1,maize
191 | 155,2,1.0,1,other
192 | 178,1,4.0,1,maize
193 | 178,2,1.0,1,beans
194 | 178,2,1.0,2,vegetable
195 | 178,3,2.0,1,maize
196 | 177,1,2.5,1,tomatoes
197 | 177,1,2.5,2,onion
198 | 177,1,2.5,3,vegetable
199 | 177,2,4.8,1,maize
200 | 180,1,6.0,1,maize
201 | 180,1,6.0,2,sorghum
202 | 180,2,1.0,1,tomatoes
203 | 181,1,3.0,1,maize
204 | 181,1,3.0,2,tomatoes
205 | 181,1,3.0,3,vegetable
206 | 182,1,5.0,1,maize
207 | 182,1,5.0,2,beans
208 | 182,1,5.0,3,bananas
209 | 182,1,5.0,4,tomatoes
210 | 182,1,5.0,5,vegetable
211 | 186,1,2.0,1,maize
212 | 186,1,2.0,2,beans
213 | 186,1,2.0,3,tomatoes
214 | 186,1,2.0,4,vegetable
215 | 187,1,3.0,1,beans
216 | 187,1,3.0,2,tomatoes
217 | 187,1,3.0,3,vegetable
218 | 187,2,2.0,1,maize
219 | 195,1,2.0,1,maize
220 | 195,2,0.5,1,tomatoes
221 | 195,3,1.0,1,beans
222 | 196,1,6.0,1,maize
223 | 196,2,1.5,1,tomatoes
224 | 196,2,1.5,2,vegetable
225 | 197,1,6.5,1,maize
226 | 197,2,3.0,1,beans
227 | 197,2,3.0,2,tomatoes
228 | 197,2,3.0,3,vegetable
229 | 198,1,2.0,1,maize
230 | 198,2,1.0,1,other
231 | 201,1,6.0,1,maize
232 | 201,2,2.0,1,maize
233 | 202,1,3.0,1,maize
234 | 202,2,0.5,1,tomatoes
235 | 72,1,2.0,1,maize
236 | 72,2,3.0,1,maize
237 | 72,3,1.5,1,tomatoes
238 | 73,1,0.5,1,tomatoes
239 | 73,2,3.0,1,maize
240 | 73,3,0.5,1,beans
241 | 76,1,6.0,1,maize
242 | 76,2,6.0,1,sorghum
243 | 76,3,6.0,1,peanut
244 | 76,4,3.0,1,cabbage
245 | 76,5,3.0,1,tomatoes
246 | 76,6,3.0,1,vegetable
247 | 83,1,5.0,1,maize
248 | 83,2,0.5,1,maize
249 | 83,2,0.5,2,other
250 | 85,1,1.0,1,maize
251 | 85,2,1.5,1,beans
252 | 85,2,1.5,2,
253 | 85,3,3.0,1,tomatoes
254 | 89,1,1.5,1,maize
255 | 89,2,3.0,1,maize
256 | 89,2,3.0,2,beans
257 | 89,2,3.0,3,vegetable
258 | 89,2,3.0,4,other
259 | 101,1,4.0,1,maize
260 | 101,2,3.0,1,beans
261 | 101,2,3.0,2,tomatoes
262 | 103,1,12.5,1,maize
263 | 103,1,12.5,2,sorghum
264 | 103,2,1.0,1,beans
265 | 103,2,1.0,2,bananas
266 | 103,2,1.0,3,vegetable
267 | 103,2,1.0,4,other
268 | 102,1,5.0,1,maize
269 | 102,2,2.0,1,tomatoes
270 | 102,2,2.0,2,onion
271 | 78,1,12.0,1,maize
272 | 78,2,1.0,1,maize
273 | 78,2,1.0,2,beans
274 | 78,2,1.0,3,tomatoes
275 | 78,2,1.0,4,onion
276 | 80,1,3.0,1,maize
277 | 80,1,3.0,2,sorghum
278 | 80,1,3.0,3,tomatoes
279 | 80,1,3.0,4,onion
280 | 80,1,3.0,5,vegetable
281 | 104,1,5.0,1,maize
282 | 104,1,5.0,2,sorghum
283 | 104,2,2.0,1,beans
284 | 104,2,2.0,2,tomatoes
285 | 104,3,0.5,1,beans
286 | 104,3,0.5,2,tomatoes
287 | 104,3,0.5,3,vegetable
288 | 104,4,1.0,1,tomatoes
289 | 104,4,1.0,2,baby_corn
290 | 105,1,3.0,1,maize
291 | 105,2,1.0,1,tomatoes
292 | 105,2,1.0,2,vegetable
293 | 106,1,22.0,1,baby_corn
294 | 106,2,16.0,1,piri_piri
295 | 106,3,4.0,1,maize
296 | 106,4,1.0,1,cabbage
297 | 106,5,2.0,1,tomatoes
298 | 109,1,5.0,1,maize
299 | 109,2,0.5,1,pigeonpeas
300 | 109,3,6.0,1,sorghum
301 | 110,1,6.0,1,maize
302 | 110,2,5.0,1,maize
303 | 110,3,3.0,1,maize
304 | 113,1,1.0,1,maize
305 | 113,2,1.0,1,beans
306 | 113,2,1.0,2,onion
307 | 113,3,0.5,1,other
308 | 118,1,3.0,1,maize
309 | 125,1,1.0,1,maize
310 | 125,2,0.5,1,beans
311 | 125,3,3.5,1,tomatoes
312 | 125,3,3.5,2,onion
313 | 125,3,3.5,3,vegetable
314 | 125,3,3.5,4,cucumber
315 | 119,1,2.0,1,maize
316 | 119,1,2.0,2,potatoes
317 | 119,2,0.5,1,maize
318 | 115,1,8.0,1,maize
319 | 115,2,3.0,1,maize
320 | 108,1,2.0,1,maize
321 | 108,1,2.0,2,sorghum
322 | 108,2,2.0,1,maize
323 | 108,3,1.5,1,vegetable
324 | 116,1,5.0,1,maize
325 | 117,1,12.0,1,maize
326 | 144,1,6.0,1,maize
327 | 144,2,1.0,1,tomatoes
328 | 143,1,4.0,1,maize
329 | 143,2,4.0,1,maize
330 | 143,3,5.5,1,beans
331 | 143,3,5.5,2,tomatoes
332 | 143,3,5.5,3,piri_piri
333 | 150,1,2.0,1,maize
334 | 150,2,0.5,1,other
335 | 159,1,2.5,1,maize
336 | 159,2,0.5,1,sorghum
337 | 160,1,2.0,1,maize
338 | 160,2,2.0,1,beans
339 | 160,3,1.5,1,tomatoes
340 | 165,1,1.0,1,maize
341 | 165,2,0.5,1,beans
342 | 166,1,0.5,1,tomatoes
343 | 166,2,1.0,1,maize
344 | 166,3,2.0,1,maize
345 | 167,1,1.5,1,maize
346 | 167,2,0.5,1,tomatoes
347 | 174,1,7.0,1,maize
348 | 174,2,3.0,1,tomatoes
349 | 174,3,1.0,1,tomatoes
350 | 175,1,2.0,1,maize
351 | 175,2,1.0,1,tomatoes
352 | 175,2,1.0,2,vegetable
353 | 189,1,7.0,1,maize
354 | 189,2,3.0,1,piri_piri
355 | 189,3,3.0,1,baby_corn
356 | 191,1,1.0,1,tomatoes
357 | 191,2,5.0,1,maize
358 | 192,1,0.5,1,tomatoes
359 | 192,2,0.5,1,maize
360 | 192,3,0.5,1,baby_corn
361 | 126,1,4.5,1,maize
362 | 126,1,4.5,2,beans
363 | 126,1,4.5,3,peanut
364 | 126,2,1.0,1,tomatoes
365 | 193,1,3.0,1,maize
366 | 193,2,2.5,1,tomatoes
367 | 193,3,0.5,1,cabbage
368 | 193,4,1.5,1,maize
369 | 194,1,5.0,1,maize
370 | 194,2,0.5,1,tomatoes
371 | 199,1,5.0,1,maize
372 | 199,2,1.0,1,vegetable
373 | 200,1,0.5,1,maize
374 | 200,2,4.5,1,maize
375 | 


--------------------------------------------------------------------------------
/episodes/data/SAFI_grass_roof_burntbricks.csv:
--------------------------------------------------------------------------------
 1 | Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID
 2 | 4,17/11/2016,5,2017-04-02T15:10:35.000Z,2017-04-02T17:27:35.000Z,Province1,District1,Ward2,Village2,18,no,no,40,yes,no,yes,no,7,grass,burntbricks,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,4,2,7,,10,689,-19.11221722,33.48342524,uuid:2c867811-9696-4966-9866-f35c3e97d02d
 3 | 8,16/11/2016,9,2017-04-02T16:23:36.000Z,2017-04-02T16:42:08.000Z,Province1,District1,Ward2,Village3,16,no,no,6,yes,no,yes,no,8,grass,burntbricks,,earth,no,2,1,yes,3,yes,yes,6,yes,no,no,,no,never,,3,no,,more_half,yes,yes,,yes,no,3,3,8,,11,701,-19.11221518,33.48343695,uuid:846103d2-b1db-4055-b502-9cd510bb7b37
 4 | 12,21/11/2016,13,2017-04-03T03:58:43.000Z,2017-04-03T04:19:36.000Z,Province1,District1,Ward2,Village2,7,yes,no,8,yes,no,yes,no,6,grass,burntbricks,,earth,no,1,1,no,4,yes,yes,7,yes,no,no,,no,never,,4,no,,more_half,yes,no,,yes,no,3,2,6,,15,706,-19.11236935,33.48355635,uuid:6c00c145-ee3b-409c-8c02-2c8d743b6918
 5 | 13,21/11/2016,14,2017-04-03T04:19:57.000Z,2017-04-03T04:50:05.000Z,Province1,District1,Ward2,Village2,20,yes,no,20,yes,yes,no,yes,10,grass,burntbricks,,earth,no,3,3,no,3,no,,,,,,,,,3,,,,,yes,yes,,yes,no,3,3,10,,11,698,-19.11222089,33.4834388,uuid:9b21467f-1116-4340-a3b1-1ab64f13c87d
 6 | 19,21/11/2016,20,2017-04-03T14:04:50.000Z,2017-04-03T14:20:04.000Z,Province1,District1,Ward2,Village2,24,yes,no,1,yes,yes,yes,yes,6,grass,burntbricks,,earth,yes,1,1,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,yes,1,2,6,,27,700,-19.11147317,33.47619213,uuid:d1005274-bf52-4e79-8380-3350dd7c2bac
 7 | 26,21/11/2016,27,2017-04-05T04:59:42.000Z,2017-04-05T05:14:45.000Z,Province1,District1,Ward2,Village1,36,no,no,36,no,no,no,no,7,grass,burntbricks,,earth,no,3,2,yes,2,no,,,,,,,,,2,,,,,no,no,,yes,no,3,3,7,,14,679,-19.0430007,33.40508367,uuid:3197cded-1fdc-4c0c-9b10-cfcc0bf49c4d
 8 | 39,17/11/2016,40,2017-04-06T08:44:51.000Z,2017-04-06T09:03:47.000Z,Province1,District1,Ward2,Village2,23,yes,yes,23,no,yes,yes,yes,9,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,23,yes,no,yes,no,no,never,,2,no,,more_half,no,no,,yes,no,1,3,9,,22.112,0,-19.0433618,33.4046671,uuid:c0b34854-eede-4e81-b183-ef58a45bfc34
 9 | 56,16/11/2016,57,2017-04-08T06:26:22.000Z,2017-04-08T06:39:40.000Z,Province1,District1,Ward2,Village3,20,yes,no,27,yes,yes,yes,yes,4,grass,burntbricks,,earth,no,4,1,no,2,yes,no,20,yes,no,no,,no,never,,2,no,,less_half,no,no,,no,no,1,2,4,,10,695,-19.11227947,33.48338576,uuid:a7184e55-0615-492d-9835-8f44f3b03a71
10 | 59,16/11/2016,60,2017-04-08T09:03:01.000Z,2017-04-08T09:20:18.000Z,Province1,District2,Ward2,Village3,12,yes,no,15,yes,yes,yes,yes,8,grass,burntbricks,,earth,no,3,2,no,3,yes,yes,12,yes,no,no,,no,never,,3,no,,more_half,yes,yes,,yes,no,4,2,8,,11,694,-19.11225763,33.48341208,uuid:85465caf-23e4-4283-bb72-a0ef30e30176
11 | 70,18/11/2016,71,2017-04-09T15:00:19.000Z,2017-04-09T15:19:22.000Z,Province1,District1,Ward2,Village1,12,yes,no,14,yes,yes,yes,yes,6,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,1,yes,no,yes,no,no,more_once,,2,no,,more_half,no,yes,,no,no,3,2,6,,4,696,-19.11220603,33.48332601,uuid:761f9c49-ec93-4932-ba4c-cc7b78dfcef1
12 | 71,16/11/2016,127,2017-04-09T05:16:06.000Z,2017-04-09T05:27:41.000Z,Province1,District1,Ward2,Village3,10,yes,no,18,no,no,no,no,4,grass,burntbricks,,earth,no,3,8,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,1,2,4,,8,676,-19.11221396,33.48341359,uuid:f6d04b41-b539-4e00-868a-0f62b427587d
13 | 73,24/11/2016,152,2017-04-09T05:47:31.000Z,2017-04-09T06:16:11.000Z,Province1,District1,Ward2,Village1,15,yes,no,16,yes,no,yes,no,10,grass,burntbricks,,cement,no,3,1,no,2,yes,yes,16,yes,no,yes,no,no,once,,2,no,,abt_half,no,no,,yes,yes,3,3,10,,11,702,-19.1121034,33.48344669,uuid:59738c17-1cda-49ee-a563-acd76f6bc487
14 | 74,24/11/2016,153,2017-04-09T06:16:49.000Z,2017-04-09T06:28:48.000Z,Province1,District1,Ward2,Village1,41,no,no,41,yes,yes,yes,yes,5,grass,burntbricks,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,2,5,,12,691,-19.11223572,33.48345393,uuid:7e7961ca-fa1c-4567-9bfa-a02f876e4e03
15 | 75,24/11/2016,155,2017-04-09T06:35:16.000Z,2017-04-09T06:48:01.000Z,Province1,District1,Ward2,Village2,5,no,no,4,no,no,no,no,4,grass,burntbricks,,earth,no,2,1,no,2,no,,,,,,,,,2,,,,,yes,no,,yes,no,1,2,4,,13,713,-19.11220708,33.48342364,uuid:77b3021b-a9d6-4276-aaeb-5bfcfd413852
16 | 83,28/11/2016,195,2017-04-09T16:13:19.000Z,2017-04-09T16:35:24.000Z,Province1,District1,Ward2,Village2,7,yes,no,48,yes,yes,no,no,5,grass,burntbricks,,earth,no,3,1,no,3,yes,yes,5,yes,no,no,,no,never,,3,no,,more_half,yes,no,,no,no,3,2,5,,9,706,-19.11220559,33.48342104,uuid:2c132929-9c8f-450a-81ff-367360ce2c19
17 | 86,28/11/2016,198,2017-04-09T19:15:21.000Z,2017-04-09T19:27:56.000Z,Province1,District1,Ward2,Village2,11,no,no,49,no,no,no,no,3,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,11,yes,no,no,,no,never,,2,no,,less_half,yes,no,,yes,no,1,3,3,,7,716,-19.11212338,33.48338774,uuid:28c64954-739c-444c-a6e0-355878e471c8
18 | 111,11/05/2017,116,2017-05-11T06:09:56.000Z,2017-05-11T06:22:19.000Z,Province1,District1,Ward2,Village1a,21,yes,no,25,no,no,no,no,5,grass,burntbricks,,cement,no,2,3,yes,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,3,3,5,,20,0,-19.1114691,33.4761047,uuid:cfee6297-2c0e-4f8a-94cc-9aaee0bd64cb
19 | 114,18/05/2017,143,2017-05-18T05:55:04.000Z,2017-05-18T06:37:10.000Z,Province1,District1,Ward2,Village1,24,yes,no,24,yes,no,no,no,10,grass,burntbricks,,earth,no,3,2,yes,3,yes,yes,12,yes,no,no,,no,frequently,,3,no,,more_half,yes,no,,yes,no,3,3,10,,1911,0,-19.1124845,33.4763322,uuid:9a096a12-b335-468c-b3cc-1191180d62de
20 | 118,03/06/2017,165,2017-06-03T05:32:33.000Z,2017-06-03T05:51:49.000Z,Province1,District1,Ward2,Village1,14,no,no,14,no,no,no,no,9,grass,burntbricks,,earth,no,1,1,yes,2,yes,yes,10,yes,no,no,,no,never,,2,no,,less_half,no,no,,yes,no,3,3,9,,9,708,-19.11217437,33.48346513,uuid:62f3f7af-f0f3-4f88-b9e0-acf8baa49ae4
21 | 121,03/06/2017,174,2017-06-03T06:50:47.000Z,2017-06-03T07:20:21.000Z,Province1,District1,Ward2,Village1,13,no,no,25,yes,yes,yes,yes,12,grass,burntbricks,,cement,no,2,2,yes,3,yes,yes,13,yes,yes,no,,no,never,,3,no,,more_half,yes,no,,yes,no,3,3,12,,3,703,-19.11216751,33.48340539,uuid:43ec6132-478c-4f87-878d-fb3c0c4d0c74
22 | 125,03/06/2017,192,2017-06-03T16:17:55.000Z,2017-06-03T17:16:39.000Z,Province1,District1,Ward2,Village3,15,yes,no,20,yes,yes,yes,yes,9,grass,burntbricks,,cement,yes,4,1,no,3,yes,yes,5,yes,no,no,,no,once,,3,no,,more_half,yes,yes,,yes,no,1,3,9,,4,705,-19.11215404,33.48335905,uuid:f94409a6-e461-4e4c-a6fb-0072d3d58b00
23 | 126,18/05/2017,126,2017-05-18T04:13:37.000Z,2017-05-18T04:35:47.000Z,Province1,District1,Ward2,Village1,5,yes,no,7,yes,yes,yes,yes,3,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,4,yes,no,no,,no,more_once,,2,no,,less_half,yes,yes,,yes,no,3,3,3,,7,700,-19.11219355,33.48337856,uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965
24 | 


--------------------------------------------------------------------------------
/episodes/data/SAFI_grass_roof_muddaub.csv:
--------------------------------------------------------------------------------
 1 | Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID
 2 | 0,17/11/2016,1,2017-03-23T09:49:57.000Z,2017-04-02T17:29:08.000Z,Province1,District1,Ward2,Village2,11,no,no,4,no,yes,no,yes,3,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,2,3,,14,698,-19.11225943,33.48345609,uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
 3 | 1,17/11/2016,1,2017-04-02T09:48:16.000Z,2017-04-02T17:26:19.000Z,Province1,District1,Ward2,Village2,2,yes,no,9,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,no,3,yes,yes,2,yes,no,yes,no,no,once,,3,no,,more_half,yes,no,,yes,no,3,2,7,,19,690,-19.11247712,33.48341568,uuid:099de9c9-3e5e-427b-8452-26250e840d6e
 4 | 5,17/11/2016,6,2017-04-02T15:27:25.000Z,2017-04-02T17:28:02.000Z,Province1,District1,Ward2,Village2,3,no,no,3,no,no,no,no,3,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,2,3,,12,692,-19.1121959,33.48339187,uuid:daa56c91-c8e3-44c3-a663-af6a49a2ca70
 5 | 6,17/11/2016,7,2017-04-02T15:38:01.000Z,2017-04-02T17:28:19.000Z,Province1,District1,Ward2,Village2,20,no,no,38,yes,no,yes,no,6,grass,muddaub,,earth,no,1,1,yes,4,yes,yes,10,yes,no,no,,no,never,,4,no,,more_half,no,no,,no,no,1,3,6,,11,709,-19.11221904,33.48336498,uuid:ae20a58d-56f4-43d7-bafa-e7963d850844
 6 | 15,24/11/2016,16,2017-04-03T05:29:24.000Z,2017-04-03T05:40:53.000Z,Province1,District1,Ward2,Village2,24,yes,no,47,yes,yes,yes,yes,6,grass,muddaub,,earth,yes,2,1,yes,1,no,,,,,,,,,1,,,,,yes,yes,,no,no,4,3,6,,9,709,-19.11210678,33.48344397,uuid:d17db52f-4b87-4768-b534-ea8f9704c565
 7 | 17,21/11/2016,18,2017-04-03T12:27:04.000Z,2017-04-03T12:39:48.000Z,Province1,District1,Ward2,Village2,6,no,no,20,yes,yes,no,yes,4,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,no,3,2,4,,17,685,-19.11133515,33.47630848,uuid:7ffe7bd1-a15c-420c-a137-e1f006c317a3
 8 | 21,21/11/2016,22,2017-04-03T16:28:52.000Z,2017-04-03T16:40:47.000Z,Province1,District1,Ward2,Village2,14,no,no,20,yes,yes,yes,yes,4,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,2,4,,9,722,-19.11212691,33.48350764,uuid:a51c3006-8847-46ff-9d4e-d29919b8ecf9
 9 | 27,21/11/2016,28,2017-04-05T05:14:49.000Z,2017-04-05T05:36:18.000Z,Province1,District1,Ward2,Village1,2,no,no,2,no,no,no,no,2,grass,muddaub,,earth,no,1,1,yes,3,yes,yes,2,yes,no,no,,no,more_once,,3,no,,more_half,no,yes,,yes,no,1,3,2,,7,721,-19.04290893,33.40506932,uuid:1de53318-a8cf-4736-99b1-8239f8822473
10 | 29,21/11/2016,30,2017-04-05T06:05:58.000Z,2017-04-05T06:20:39.000Z,Province1,District1,Ward2,Village1,22,yes,yes,22,no,no,no,no,7,grass,muddaub,,earth,no,1,2,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,2,7,,5,669,-19.04300478,33.40505449,uuid:59341ead-92be-45a9-8545-6edf9f94fdc6
11 | 30,21/11/2016,31,2017-04-05T06:21:20.000Z,2017-04-05T06:38:26.000Z,Province1,District1,Ward2,Village1,15,yes,no,2,yes,yes,yes,yes,3,grass,muddaub,,earth,no,7,1,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,1,3,3,,4,704,-19.04302176,33.40509382,uuid:cb06eb49-dd39-4150-8bbe-a599e074afe8
12 | 32,21/11/2016,33,2017-04-05T08:08:19.000Z,2017-04-05T08:25:48.000Z,Province1,District1,Ward2,Village1,20,yes,no,34,yes,yes,yes,yes,8,grass,muddaub,,cement,no,2,1,no,2,yes,yes,20,yes,no,no,,no,more_once,,2,no,,more_half,no,no,,yes,no,2,2,8,,5,695,-19.04414887,33.40383602,uuid:0fbd2df1-2640-4550-9fbd-7317feaa4758
13 | 34,17/11/2016,35,2017-04-05T16:22:13.000Z,2017-04-05T16:50:25.000Z,Province1,District1,Ward2,Village3,45,yes,no,45,no,no,no,no,5,grass,muddaub,,earth,no,3,1,no,2,yes,yes,20,yes,no,yes,no,no,more_once,,2,no,,more_half,no,yes,,yes,no,2,3,5,,11,733,-19.11211362,33.48342515,uuid:ff7496e7-984a-47d3-a8a1-13618b5683ce
14 | 37,17/11/2016,38,2017-04-05T17:28:12.000Z,2017-04-05T17:50:57.000Z,Province1,District1,Ward2,Village2,19,yes,yes,19,no,no,no,no,10,grass,muddaub,,earth,no,3,1,no,2,yes,yes,9,yes,no,yes,no,no,never,,2,no,,more_half,no,yes,,yes,yes,3,3,10,,9,696,-19.11222939,33.48337467,uuid:81309594-ff58-4dc1-83a7-72af5952ee08
15 | 40,17/11/2016,41,2017-04-06T09:03:50.000Z,2017-04-06T09:14:05.000Z,Province1,District1,Ward2,Village2,22,yes,no,22,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,2,3,7,,13,679,-19.04339398,33.40485363,uuid:b3ba34d8-eea1-453d-bc73-c141bcbbc5e5
16 | 42,17/11/2016,43,2017-04-06T09:31:56.000Z,2017-04-06T09:53:53.000Z,Province1,District1,Ward2,Village3,3,no,no,29,no,no,no,no,7,grass,muddaub,,earth,no,2,1,no,4,yes,no,3,yes,no,no,,no,never,,4,no,,abt_half,yes,no,,no,no,2,2,7,,30,605,-19.04303063,33.40472726,uuid:b4dff49f-ef27-40e5-a9d1-acf287b47358
17 | 43,17/11/2016,44,2017-04-06T14:44:32.000Z,2017-04-06T14:53:01.000Z,Province1,District1,Ward2,Village3,3,no,no,6,no,no,no,no,2,grass,muddaub,,earth,no,2,1,no,1,no,,,,,,,,,1,,,,,yes,no,,yes,no,3,2,2,,11,716,-19.04315107,33.40458039,uuid:f9fadf44-d040-4fca-86c1-2835f79c4952
18 | 44,17/11/2016,45,2017-04-06T14:53:04.000Z,2017-04-06T15:11:57.000Z,Province1,District1,Ward2,Village3,25,yes,no,7,no,no,no,no,9,grass,muddaub,,earth,no,2,1,no,3,yes,yes,20,yes,no,no,,no,never,,3,yes,,more_half,yes,no,,no,no,4,3,9,,28,703,-19.04312371,33.40466493,uuid:e3554d22-35b1-4fb9-b386-dd5866ad5792
19 | 46,17/11/2016,47,2017-04-07T14:05:25.000Z,2017-04-07T14:19:45.000Z,Province1,District1,Ward2,Village3,2,yes,no,2,yes,yes,yes,yes,2,grass,muddaub,,earth,no,1,1,yes,2,yes,yes,2,yes,no,yes,no,no,once,,2,no,,more_half,no,no,,no,no,1,3,2,,5,689,-19.11226093,33.48339791,uuid:2d0b1936-4f82-4ec3-a3b5-7c3c8cd6cc2b
20 | 47,16/11/2016,48,2017-04-07T14:19:49.000Z,2017-04-07T14:40:23.000Z,Province1,District1,Ward2,Village3,48,yes,no,58,yes,no,yes,no,7,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,no,,yes,no,3,3,7,,12,689,-19.11222978,33.48353345,uuid:e180899c-7614-49eb-a97c-40ed013a38a2
21 | 49,16/11/2016,50,2017-04-07T14:56:01.000Z,2017-04-07T15:26:23.000Z,Province1,District1,Ward2,Village3,6,yes,no,7,no,no,no,no,6,grass,muddaub,,earth,no,1,1,no,1,yes,yes,1,no,no,yes,no,no,never,,1,no,,abt_half,yes,no,,yes,no,1,2,6,,12,718,-19.11220496,33.48344521,uuid:4267c33c-53a7-46d9-8bd6-b96f58a4f92c
22 | 50,16/11/2016,51,2017-04-07T15:27:45.000Z,2017-04-07T15:39:10.000Z,Province1,District1,Ward2,Village3,11,yes,no,30,yes,no,yes,no,5,grass,muddaub,,cement,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,3,5,,12,709,-19.11221446,33.48338443,uuid:18ac8e77-bdaf-47ab-85a2-e4c947c9d3ce
23 | 53,16/11/2016,54,2017-04-08T05:36:55.000Z,2017-04-08T05:52:15.000Z,Province1,District1,Ward2,Village3,10,no,yes,15,yes,no,yes,no,7,grass,muddaub,,earth,no,3,1,no,1,yes,yes,10,yes,no,no,,no,never,,1,no,,more_half,no,no,,yes,no,1,2,7,,9,681,-19.11220247,33.48335527,uuid:273ab27f-9be3-4f3b-83c9-d3e1592de919
24 | 54,16/11/2016,55,2017-04-08T05:52:32.000Z,2017-04-08T06:05:41.000Z,Province1,District1,Ward2,Village3,23,yes,no,23,yes,no,yes,no,9,grass,muddaub,,earth,no,2,2,no,1,no,,,,,,,,,1,,,,,no,no,,yes,no,1,2,9,,11,702,-19.11228974,33.48333393,uuid:883c0433-9891-4121-bc63-744f082c1fa0
25 | 58,16/11/2016,59,2017-04-08T08:52:05.000Z,2017-04-08T09:02:34.000Z,Province1,District1,Ward2,Village3,60,no,no,60,yes,yes,yes,yes,2,grass,muddaub,,earth,no,1,3,yes,2,no,,,,,,,,,2,,,,,no,no,,no,no,3,2,2,,13,683,-19.1123395,33.48333251,uuid:1936db62-5732-45dc-98ff-9b3ac7a22518
26 | 60,16/11/2016,61,2017-04-08T10:47:11.000Z,2017-04-08T11:14:09.000Z,Province1,District1,Ward2,Village3,14,yes,no,14,no,yes,yes,yes,10,grass,muddaub,,earth,no,4,1,no,2,yes,yes,13,yes,no,yes,yes,no,more_once,,2,no,,more_half,yes,no,,yes,no,3,3,10,,13,712,-19.11218035,33.48341635,uuid:2401cf50-8859-44d9-bd14-1bf9128766f2
27 | 61,16/11/2016,62,2017-04-08T13:27:58.000Z,2017-04-08T13:41:21.000Z,Province1,District1,Ward2,Village3,5,no,no,5,yes,no,no,no,5,grass,muddaub,,earth,no,3,1,no,2,no,,,,,,,,,2,,,,,no,yes,,no,no,1,3,5,,18,719,-19.11216869,33.48339699,uuid:c6597ecc-cc2a-4c35-a6dc-e62c71b345d6
28 | 62,16/11/2016,63,2017-04-08T13:41:39.000Z,2017-04-08T13:52:07.000Z,Province1,District1,Ward2,Village3,1,yes,no,10,yes,yes,no,yes,4,grass,muddaub,,earth,no,1,1,yes,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,3,4,,25,702,-19.11220024,33.4833903,uuid:86ed4328-7688-462f-aac7-d6518414526a
29 | 63,16/11/2016,64,2017-04-08T13:52:30.000Z,2017-04-08T14:02:24.000Z,Province1,District1,Ward2,Village3,1,no,no,1,no,no,no,no,6,grass,muddaub,,earth,no,2,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,3,6,,9,704,-19.1121962,33.48339576,uuid:28cfd718-bf62-4d90-8100-55fafbe45d06
30 | 68,16/11/2016,69,2017-04-09T22:08:07.000Z,2017-04-09T22:21:08.000Z,Province1,District1,Ward2,Village3,12,yes,no,12,no,no,no,no,4,grass,muddaub,,earth,no,2,1,no,1,yes,yes,12,yes,no,no,,no,more_once,,1,no,,more_half,no,no,,yes,no,1,3,4,,8,708,-19.11216693,33.48340465,uuid:f86933a5-12b8-4427-b821-43c5b039401d
31 | 78,25/11/2016,180,2017-04-09T08:23:05.000Z,2017-04-09T08:42:02.000Z,Province1,District1,Ward2,Village1,4,no,no,50,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,yes,2,yes,yes,4,yes,no,no,,no,never,,2,no,,abt_half,no,yes,,yes,no,3,3,7,,4,701,-19.11204921,33.48346659,uuid:ece89122-ea99-4378-b67e-a170127ec4e6
32 | 81,28/11/2016,186,2017-04-09T15:20:26.000Z,2017-04-09T15:46:14.000Z,Province1,District1,Ward1,Village2,24,no,no,24,yes,no,yes,no,7,grass,muddaub,,earth,no,1,1,no,1,yes,yes,21,yes,no,no,,no,more_once,,1,no,,more_half,no,no,,no,no,2,3,7,,10,690,-19.11232028,33.48346266,uuid:268bfd97-991c-473f-bd51-bc80676c65c6
33 | 82,28/11/2016,187,2017-04-09T15:48:14.000Z,2017-04-09T16:12:46.000Z,Province1,District1,Ward2,Village2,1,yes,no,43,yes,no,no,yes,5,grass,muddaub,,earth,no,3,2,yes,2,yes,yes,24,yes,no,yes,no,no,more_once,,2,no,,more_half,no,no,,yes,no,4,3,5,,13,691,-19.1122345,33.48349248,uuid:0a42c9ee-a840-4dda-8123-15c1bede5dfc
34 | 87,21/11/2016,201,2017-04-09T19:31:47.000Z,2017-04-09T19:45:38.000Z,Province1,District1,Ward2,Village2,6,yes,no,6,yes,yes,yes,yes,4,grass,muddaub,,earth,no,1,2,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,no,2,2,4,,10,685,-19.11217226,33.48338504,uuid:9e79a31c-3ea5-44f0-80f9-a32db49422e3
35 | 89,26/04/2017,72,2017-04-26T15:46:24.000Z,2017-04-26T16:13:33.000Z,Province1,District1,Ward2,Village1,24,yes,no,24,yes,yes,yes,yes,6,grass,muddaub,,earth,no,2,1,no,3,yes,yes,4,yes,no,yes,yes,no,more_once,,3,no,,more_half,no,yes,,yes,no,3,2,6,,5,716,-19.11221489,33.48347358,uuid:c4a2c982-244e-45a5-aa4b-71fa53f99e18
36 | 95,27/04/2017,101,2017-04-27T16:42:02.000Z,2017-04-27T18:11:54.000Z,Province1,District1,Ward2,Village2,10,no,no,4,yes,yes,no,no,3,grass,muddaub,,earth,no,1,1,no,2,yes,yes,10,yes,no,no,,no,never,,2,no,,abt_half,no,yes,,no,yes,1,3,3,,11,702,-19.11223501,33.48341767,uuid:3c174acd-e431-4523-9ad6-eb14cddca805
37 | 106,04/05/2017,118,2017-05-04T10:26:35.000Z,2017-05-04T10:46:35.000Z,Province1,District1,Ward2,Village1,6,no,no,25,yes,yes,yes,no,5,grass,muddaub,,earth,no,2,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,3,5,,9,713,-19.11220573,33.48344178,uuid:77335b2e-8812-4a35-b1e5-ca9ab626dfea
38 | 108,04/05/2017,119,2017-05-04T11:16:57.000Z,2017-05-04T11:38:38.000Z,Province1,District1,Ward2,Village1a,14,no,yes,14,yes,no,no,no,3,grass,muddaub,,earth,no,4,1,no,2,yes,no,3,yes,no,no,,no,never,,2,no,,more_half,no,no,,no,no,4,3,3,,5,706,-19.11225658,33.48334839,uuid:fa201fce-4e94-44b8-b435-c558c2e1ed55
39 | 112,11/05/2017,117,2017-05-11T06:28:02.000Z,2017-05-11T06:55:35.000Z,Province1,District1,Ward2,Village1a,1,no,no,28,yes,yes,yes,no,10,grass,muddaub,,cement,no,1,4,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,3,10,,20,0,-19.1114691,33.4761047,uuid:3fe626b3-c794-48e1-a80f-5bfe440c507b
40 | 115,18/05/2017,150,2017-05-18T10:37:37.000Z,2017-05-18T10:56:00.000Z,Province1,District1,Ward2,Village1,8,no,no,8,no,yes,no,yes,7,grass,muddaub,,earth,no,2,1,no,2,yes,yes,8,yes,no,no,,no,never,,2,no,,less_half,no,yes,,no,yes,1,3,7,,17,709,-19.11147739,33.47618369,uuid:92613d0d-e7b1-4d62-8ea4-451d7cd0a982
41 | 119,03/06/2017,166,2017-06-03T05:53:28.000Z,2017-06-03T06:25:06.000Z,Province1,District1,Ward2,Village1,16,no,no,16,yes,yes,yes,yes,11,grass,muddaub,,earth,no,2,1,no,3,yes,yes,2,yes,no,no,,no,never,,3,no,,less_half,no,yes,,yes,no,1,2,11,,1799.999,0,-19.1138589,33.4826653,uuid:40aac732-94df-496c-97ba-5b67f59bcc7a
42 | 120,03/06/2017,167,2017-06-03T06:25:09.000Z,2017-06-03T06:45:06.000Z,Province1,District1,Ward2,Village1,16,no,no,24,yes,no,yes,no,8,grass,muddaub,,earth,no,5,1,no,2,yes,yes,16,yes,no,no,,no,never,,2,no,,less_half,no,yes,,yes,yes,3,2,8,,2000,0,-19.1149887,33.4882685,uuid:a9d1a013-043b-475d-a71b-77ed80abe970
43 | 128,04/06/2017,194,2017-06-04T10:13:36.000Z,2017-06-04T10:32:06.000Z,Province1,District1,Ward2,Village1,5,no,no,5,no,no,yes,no,4,grass,muddaub,,earth,no,2,1,no,2,yes,yes,2,yes,no,no,,no,more_once,,2,no,,more_half,no,yes,,no,no,1,3,4,,10,719,-19.11227133,33.48347111,uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf
44 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577.sqlite:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/data/SN7577.sqlite


--------------------------------------------------------------------------------
/episodes/data/SN7577i_a.csv:
--------------------------------------------------------------------------------
 1 | Id,Q1,Q2,Q3,Q4
 2 | 1,1,-1,1,8
 3 | 2,3,-1,1,4
 4 | 3,10,3,2,6
 5 | 4,9,-1,10,10
 6 | 5,10,2,6,1
 7 | 6,1,-1,1,1
 8 | 7,1,-1,1,8
 9 | 8,1,-1,1,1
10 | 9,9,-1,10,10
11 | 10,2,-1,1,1
12 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577i_aa.csv:
--------------------------------------------------------------------------------
 1 | Id,Q1,Q2,Q3
 2 | 1,1,-1,1
 3 | 2,3,-1,1
 4 | 3,10,3,2
 5 | 4,9,-1,10
 6 | 5,10,2,6
 7 | 6,1,-1,1
 8 | 7,1,-1,1
 9 | 8,1,-1,1
10 | 9,9,-1,10
11 | 10,2,-1,1
12 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577i_b.csv:
--------------------------------------------------------------------------------
 1 | Id,Q1,Q2,Q3,Q4
 2 | 1277,10,10,4,6
 3 | 1278,2,-1,5,4
 4 | 1279,2,-1,4,5
 5 | 1280,1,-1,2,3
 6 | 1281,10,2,3,4
 7 | 1282,2,-1,3,6
 8 | 1283,10,10,2,10
 9 | 1284,9,-1,8,9
10 | 1285,11,11,1,2
11 | 1286,10,6,6,6
12 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577i_bb.csv:
--------------------------------------------------------------------------------
 1 | Id,Q1,Q2,Q4
 2 | 1277,10,10,6
 3 | 1278,2,-1,4
 4 | 1279,2,-1,5
 5 | 1280,1,-1,3
 6 | 1281,10,2,4
 7 | 1282,2,-1,6
 8 | 1283,10,10,10
 9 | 1284,9,-1,9
10 | 1285,11,11,2
11 | 1286,10,6,6
12 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577i_c.csv:
--------------------------------------------------------------------------------
 1 | Id,maritl,numhhd
 2 | 1,6,3
 3 | 2,4,3
 4 | 3,6,2
 5 | 4,4,1
 6 | 5,4,1
 7 | 6,2,2
 8 | 7,2,2
 9 | 8,2,2
10 | 9,6,2
11 | 10,6,1
12 | 


--------------------------------------------------------------------------------
/episodes/data/SN7577i_d.csv:
--------------------------------------------------------------------------------
 1 | Id,Q1,Q2
 2 | 1,1,-1
 3 | 2,3,-1
 4 | 3,10,3
 5 | 4,9,-1
 6 | 5,10,2
 7 | 6,1,-1
 8 | 7,1,-1
 9 | 8,1,-1
10 | 9,9,-1
11 | 10,2,-1
12 | 


--------------------------------------------------------------------------------
/episodes/fig/Python_date_format_01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_date_format_01.png


--------------------------------------------------------------------------------
/episodes/fig/Python_function_parameters_9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_function_parameters_9.png


--------------------------------------------------------------------------------
/episodes/fig/Python_install_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_install_1.png


--------------------------------------------------------------------------------
/episodes/fig/Python_install_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_install_2.png


--------------------------------------------------------------------------------
/episodes/fig/Python_jupyter_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_6.png


--------------------------------------------------------------------------------
/episodes/fig/Python_jupyter_7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_7.png


--------------------------------------------------------------------------------
/episodes/fig/Python_jupyter_8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_8.png


--------------------------------------------------------------------------------
/episodes/fig/Python_jupyterl_6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyterl_6.png


--------------------------------------------------------------------------------
/episodes/fig/Python_repl_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_3.png


--------------------------------------------------------------------------------
/episodes/fig/Python_repl_4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_4.png


--------------------------------------------------------------------------------
/episodes/fig/Python_repl_5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_5.png


--------------------------------------------------------------------------------
/episodes/fig/Python_repll_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repll_3.png


--------------------------------------------------------------------------------
/episodes/fig/barplot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/barplot1.png


--------------------------------------------------------------------------------
/episodes/fig/boxplot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot1.png


--------------------------------------------------------------------------------
/episodes/fig/boxplot2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot2.png


--------------------------------------------------------------------------------
/episodes/fig/boxplot3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot3.png


--------------------------------------------------------------------------------
/episodes/fig/functionAnatomy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/functionAnatomy.png


--------------------------------------------------------------------------------
/episodes/fig/histogram1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/histogram1.png


--------------------------------------------------------------------------------
/episodes/fig/histogram3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/histogram3.png


--------------------------------------------------------------------------------
/episodes/fig/lm1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/lm1.png


--------------------------------------------------------------------------------
/episodes/fig/pandas_join_types.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/pandas_join_types.png


--------------------------------------------------------------------------------
/episodes/fig/scatter1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/scatter1.png


--------------------------------------------------------------------------------
/episodes/fig/scatter2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/scatter2.png


--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | maintainers:
 3 | - Stephen Childs
 4 | - Geoffrey Boushey
 5 | - Annajiat Alim Rasel
 6 | permalink: index.html
 7 | site: sandpaper::sandpaper_site
 8 | ---
 9 | 
10 | **Lesson Maintainers:** {{ page.maintainers | join: ', ' }}
11 | 
12 | Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data.
13 | 
14 | This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python.
15 | 
16 | ::::::::::::::::::::::::::::::::::::::::::  prereq
17 | 
18 | ## Getting Started
19 | 
20 | Data Carpentry's teaching is hands-on, so participants are encouraged to use
21 | their own computers to ensure the proper setup of tools for an efficient
22 | workflow.
23 | 
24 | **These lessons assume no prior knowledge of the skills or tools.**
25 | 
26 | To get started, follow the directions in the "[Setup](learners/setup.md)" tab to
27 | download data to your computer and follow any installation instructions.
28 | 
29 | 
30 | ::::::::::::::::::::::::::::::::::::::::::::::::::
31 | 
32 | ::::::::::::::::::::::::::::::::::::::::::  prereq
33 | 
34 | ## Prerequisites
35 | 
36 | This lesson requires a working copy of **Python**.
37 | 
38 | To most effectively use these materials, please make sure to install
39 | everything *before* working through this lesson and download data files mentioned in the [Setup](learners/setup.md) tab.
40 | 
41 | 
42 | ::::::::::::::::::::::::::::::::::::::::::::::::::
43 | 
44 | ::::::::::::::::::::::::::::::::::::::::::  prereq
45 | 
46 | ## For Instructors
47 | 
48 | If you are teaching this lesson in a workshop, please see the
49 | [Instructor notes](instructors/instructor-notes.md).
50 | 
51 | 
52 | ::::::::::::::::::::::::::::::::::::::::::::::::::
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/instructors/instructor-notes.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: Instructor Notes
  3 | ---
  4 | 
  5 | ## Setup
  6 | 
  7 | The setup instruction for installing the Anaconda distribution of Python is in the [setup](../learners/setup.md) file.
  8 | The Anaconda distribution contains all of the 3rd party libraries used.
  9 | 
 10 | PIP is referred to in the text but it shouldn't need to be used.
 11 | 
 12 | It is assumed that Jupyter notebooks will be used for all of the coding. (The shell is used in explaining REPL)
 13 | 
 14 | How to start Jupyter is included in the setup instructions.
 15 | 
 16 | ## The datasets used
 17 | 
 18 | All of the datasets used have been placed in the data folder.
 19 | 
 20 | They should be downloaded to the local machine before use.
 21 | 
 22 | The code examples are written on the assumption that the data files are in the same folder as the notebook.
 23 | 
 24 | ## The Lessons
 25 | 
 26 | It is assumed that all of the work will be done in Jupyter notebooks. There are no specific instructions when to start a notebook, at each new episode would be appropriate.
 27 | 
 28 | If a new notebook is created or a new Jupyter session is started. Any mudules that need to be used will have to be imported again.
 29 | 
 30 | Each section of code within an episode will not necessarily re import modules that are needed for that section.
 31 | 
 32 | [Introduction to Python](link)
 33 | 
 34 | An introduction to the benefits of the Python language.
 35 | 
 36 | Comparison between interpreted and compiled languages.
 37 | 
 38 | An explanation and example of using REPL.
 39 | 
 40 | An Introduction to using Jupyter notebooks.
 41 | 
 42 | [Python basics ](link)
 43 | 
 44 | Using Jupyter - creating new cells, hiding output, changing cell type.
 45 | 
 46 | Python variables and data types.
 47 | 
 48 | Using Print and builtin functions.
 49 | 
 50 | Python Help  and Internet help for Python.
 51 | 
 52 | Strings and String functions.
 53 | 
 54 | Lists and range function (Dictionaries kept for later, when needed).
 55 | 
 56 | [Python control structures](link)
 57 | 
 58 | Explanation of typical program structure and the need for control structures.
 59 | 
 60 | Explanation and examples of the `if`, `while` and `for` constructs.
 61 | 
 62 | [Creating re-usable code](link)
 63 | 
 64 | Explanation of functions and why we use them.
 65 | 
 66 | Creating functions.
 67 | 
 68 | Using parameters in functions.
 69 | 
 70 | Python libraries and how to install and use them.
 71 | 
 72 | [Processing data from a file](link)
 73 | 
 74 | This episode starts to use some of the control structures and files to create complete small programs with
 75 | proper 'Input - Processing - Output' structure.
 76 | 
 77 | Different approaches to reading files ie. one record at a time  v reading the whole file.
 78 | 
 79 | Opening and closing files and file handles.
 80 | 
 81 | Description of a csv file being a list of strings.
 82 | 
 83 | Iterating over complete files.
 84 | 
 85 | Use of the `split` function to parse a line from a csv file and extracting specific elements.
 86 | 
 87 | Writing extracted information to a file.
 88 | 
 89 | The Python dictionary explanation and examples.
 90 | 
 91 | Creating and populating a dictionary on the fly from data read from a file.
 92 | 
 93 | [Date and Time](link)
 94 | 
 95 | Need to import the datetime module.
 96 | 
 97 | Explanation of format strings to provide a representation of the dates and times.
 98 | 
 99 | Need to convert date/time strings to a datetime structure before using them
100 | 
101 | The ability to extract individual components from  and do arithmetic with datetime structures
102 | 
103 | [Processing JSON data](link)
104 | 
105 | This episode is programatically the most complex. However the examples are built up as gently as possible.
106 | The solution to the 2nd exercise needs to be understood to make it worthwhile continuing with the coding.
107 | 
108 | The episode starts with further details of the Dictionary object and how it can be used to represent nested data structures.
109 | 
110 | The JSON data format is described and compared to a Python dictionary.
111 | 
112 | An example of using the json module to read a file in JSON format is explained.
113 | 
114 | How to make the printed JSON more presentable using `json.dumops` parameters is demonstrated.
115 | 
116 | Systematically  extracting single items from a dictionary is covered followed by extracting all such entries
117 | from a file of JSON.
118 | 
119 | The overall aim is to demonstrate extracting a set of specific fields from a JSON formatted file and writing them to a flat structured csv file making subsequent processing more straight forward.
120 | 
121 | [Pandas](link)
122 | 
123 | This episode introduces the pandas module and explains the advantages of using it for data analysis.
124 | 
125 | The two key pandas data structures are introduced.
126 | 
127 | Examples of reading a csv file into a dataframe, optionally selecting only specific columns.
128 | 
129 | Various methods for extract information about the dataframe are covered.
130 | 
131 | [Extracting rows and columns](link)
132 | 
133 | Basic pandas methods for extracting row and columns from a dataframe are covered.
134 | 
135 | For row selection the emphasis is on specifying criteria, although random selection is also covered.
136 | 
137 | [Aggregations\_and\_missing\_data](link)
138 | 
139 | Explain why we want to summarise data.
140 | 
141 | Introduce the basic aggregation methods for a dataframe and individual columns.
142 | 
143 | The examples of the aggregations will show up 'NaN' values. Missing data and its representations are discussed.
144 | 
145 | The effects of missing data in summarisation is discussed.
146 | 
147 | Ways of recognising and dealing with missing data are discussed and demonstrated.
148 | 
149 | Summarising categorical variables using the `groupby` method is discussed and demonstrated.
150 | 
151 | [Joins](link)
152 | 
153 | Explain why we want to join dataframes and the necessary conditions for doing so.
154 | 
155 | Simple concatenation of rows using `concat` and the effect of the columns not being the same.
156 | 
157 | The downside of using `concat` to join by columns.
158 | 
159 | The use of `merge` and its similarity with the SQL JOIN clause.
160 | 
161 | The different types of joins available with `merge`
162 | 
163 | Discussion and examples of what different join types tell you about the data.
164 | 
165 | [Long\_and\_wide\_data\_formats](link)
166 | 
167 | The difference between wide and long formats
168 | 
169 | Use and examples of `melt` and `pivot` methods to convert between them.
170 | 
171 | Because the SN7577 dataset has no 'key' column, the setup for the examples is more complex, but it makes use of previously used techniques as well as introducing different ways of selecting columns from a dataframe.
172 | 
173 | The final exercise is quite long, but brings together the concepts learned in this and the previous two episodes.
174 | 
175 | [Data visualisation using Matplotlib](link)
176 | 
177 | The basic aim of this episode is to illustrate how simple graphs can be produced using matplotlib.
178 | 
179 | This episode uses entirely random data produced by numpy methods
180 | 
181 | Four commonly used graph types are demonstrated; Bar charts, Histograms, Scatter plots and Boxplots.
182 | 
183 | The tight integration between pandas and matplotlib is discussed and illustrated.
184 | 
185 | How graphs can be built up from individual components is covered.
186 | 
187 | Saving the produced graph to a file is explained and demonstrated.
188 | 
189 | [Accessing SQLite Databases](link)
190 | 
191 | A comparison between an relational database table and a pandas dataframe is made.
192 | 
193 | The possible advantages of using data from a database (SQLite in this case) rather than having it in a dataframe is explained.
194 | 
195 | The construction of an SQLite database as a single file is covered
196 | 
197 | The sqlite3 module is introduced and connection strings are explained.
198 | 
199 | The use of the connection and the cursor are explained.
200 | 
201 | The `execute` method is used to run an SQL query on the database system is demonstrated.
202 | 
203 | Various ways of retrieving the results of the query are covered.
204 | 
205 | Pandas is used to run a similar query, its greater simplicity is explained.
206 | 
207 | The problem with running DML statements with pandas is illustrated in the final exercise.
208 | 
209 | 
210 | 


--------------------------------------------------------------------------------
/learners/discuss.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Discussion
3 | ---
4 | 
5 | FIXME
6 | 
7 | 
8 | 


--------------------------------------------------------------------------------
/learners/reference.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: 'Glossary'
  3 | ---
  4 | 
  5 | ## Glossary
  6 | 
  7 | 0-based indexing
  8 | :   is a way of assigning indices to elements in a sequential, ordered data structure
  9 | starting from 0, i.e. where the first element of the sequence has index 0.
 10 | 
 11 | attribute
 12 | :   a property of an object that can be viewed, accessed with a `.` but no `()` ex: `df.dtypes`
 13 | 
 14 | boolean
 15 | :   a data type that can be `True` or `False`
 16 | 
 17 | cast
 18 | :   the process of changing the type of a variable, in python the data type names operate as functions for casting, for example `int(3.5)`
 19 | 
 20 | CSV (file)
 21 | :   is an acronym which stands for Comma-Separated Values file. CSV files store
 22 | tabular data, either numbers, strings, or a combination of the two, in plain
 23 | text with columns separated by a comma and rows by the carriage return character.
 24 | 
 25 | database
 26 | :   is an organized collection of data.
 27 | 
 28 | DataFrame
 29 | :   is a two-dimensional labeled data structure with columns of (potentially)
 30 | different type.
 31 | 
 32 | data structure
 33 | :   is a particular way of organizing data in memory.
 34 | 
 35 | data type
 36 | :   is a particular kind of item that can be assigned to a variable, defined by
 37 | by the values it can take, the programming language in use and the operations
 38 | that can be performed on it. examples: `int` (integer), `str` (string), float, boolean, list
 39 | 
 40 | docstring
 41 | :   is an recommended documentation string to describe what a Python function does.
 42 | 
 43 | faceting
 44 | :   is the act of plotting relationships between set variables in multiple subsets
 45 | of the data with the results appearing as different panels in the same figure.
 46 | 
 47 | float
 48 | :   is a Python data type designed to store positive and negative decimal numbers
 49 | by means of a floating point representation.
 50 | 
 51 | function
 52 | :   is a group of related statements that perform a specific task.
 53 | 
 54 | integer
 55 | :   is a Python data type designed to store positive and negative integer numbers.
 56 | 
 57 | interactive mode
 58 | :   is an online mode of operation in which the user writes the commands directly
 59 | on the command line one-by-one and execute them immediately by pressing a button
 60 | on the keyword, usually Enter.
 61 | 
 62 | join key
 63 | :   is a variable or an array representing the column names over which pandas.DataFrame.join()
 64 | merge together columns of different data sets.
 65 | 
 66 | library
 67 | :   is a set of functions and methods grouped together to perform some specific
 68 | sort of tasks.
 69 | 
 70 | list
 71 | :   is a Python data structure designed to contain sequences of integers, floats,
 72 | strings and any combination of the previous. The sequence is ordered and indexed
 73 | by integers, starting from 0. Elements of a list can be accessed by their index
 74 | and can be modified.
 75 | 
 76 | loop
 77 | :   is a sequence of instructions that is continually repeated until a condition
 78 | is satisfied.
 79 | 
 80 | method
 81 | :   a function that is specific to a type of data, accessed via `.` and requires `()` to run, for example `df.sum()`
 82 | 
 83 | NaN
 84 | :   is an acronym for Not-a-Number and represents that either a value is missing or
 85 | the calculation cannot output any meaningful result.
 86 | 
 87 | None
 88 | :   is an object that represents no value.
 89 | 
 90 | scripting mode
 91 | :   is an offline mode of operation in which the user writes the commands to be
 92 | executed in a text file (with .py extension for Python) which is then compiled
 93 | or interpreted to run the program. Notes that Python interprets script on
 94 | run-time and compiles a binary version of the program to speed up the execution time.
 95 | 
 96 | sequential (data structure)
 97 | :   is an ordered group of objects stored in memory which can be accessed specifying
 98 | their index, i.e. their position, in the structure.
 99 | 
100 | string
101 | :   is a Python data type designed to store sequences of characters.
102 | 
103 | tuple
104 | :   is a Python data structure designed to contain sequences of integers, floats,
105 | strings and any combination of the previous. The sequence is ordered and indexed
106 | by integers, starting from 0. Elements of a tuple can be accessed by their index
107 | but cannot be modified.
108 | 
109 | variable
110 | :   a named quantity that can store a value, a variable can store any type, but always one type for a given value.
111 | 
112 | ## Jupyter Notebook Hints
113 | 
114 | `Esc` will take you into command mode where you can navigate around your notebook with arrow keys.
115 | 
116 | ### While in command mode:
117 | 
118 | - <kbd>A</kbd> to insert a new cell above the current cell,
119 | - <kbd>B</kbd> to insert a new cell below.
120 | - <kbd>M</kbd> to change the current cell to Markdown,
121 | - <kbd>Y</kbd> to change it back to code
122 | - <kbd>D</kbd> + <kbd>D</kbd> (press the key twice) to delete the current cell
123 | - <kbd>Enter</kbd> will take you from command mode back into edit mode for the given cell.
124 | 
125 | ### while in edit mode:
126 | 
127 | - <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>\-</kbd> will split the current cell into two from where your cursor is.
128 | - <kbd>Shift</kbd> + <kbd>Enter</kbd> will run the current cell
129 | 
130 | ### Full Shortcut Listing
131 | 
132 | <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>P</kbd> (or <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>P</kbd> on Linux and Windows)
133 | 
134 | 
135 | 


--------------------------------------------------------------------------------
/learners/setup.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | title: Setup
 3 | ---
 4 | 
 5 | ::::::::::::::::::::::::::::::::::::::::::  prereq
 6 | 
 7 | ## Data
 8 | 
 9 | Data for this lesson is from the [The SAFI Teaching Database](https://datacarpentry.org/socialsci-workshop/data) and [Audit of Political Engagement 11, 2013](https://doi.org/10.5255/UKDA-SN-7577-1).
10 | 
11 | We will use the files listed below for the data in this lesson. You can download the files by clicking on the links.
12 | 
13 | **Note: make sure to place the data files on the same folder that your notebook is running on.**
14 | 
15 | - [SAFI\_clean.csv](/data/SAFI_clean.csv)
16 | - [SAFI\_results.csv](/data/SAFI_results.csv)
17 | - [SAFI\_full\_shortname.csv](/data/SAFI_full_shortname.csv)
18 | - [SAFI.json](/data/SAFI.json)
19 | - [SN7577.tab](/data/SN7577.tab)
20 | - [SN7577i\_a.csv](/data/SN7577i_a.csv)
21 | - [SN7577i\_aa.csv](/data/SN7577i_aa.csv)
22 | - [SN7577i\_b.csv](/data/SN7577i_b.csv)
23 | - [SN7577i\_bb.csv](/data/SN7577i_bb.csv)
24 | - [SN7577i\_c.csv](/data/SN7577i_c.csv)
25 | - [SN7577i\_d.csv](/data/SN7577i_d.csv)
26 | - [SN7577.sqlite](/data/SN7577.sqlite)
27 | - [Newspapers.csv](/data/Newspapers.csv)
28 |   
29 | 
30 | ::::::::::::::::::::::::::::::::::::::::::::::::::
31 | 
32 | ::::::::::::::::::::::::::::::::::::::::::  prereq
33 | 
34 | ## Software
35 | 
36 | [Python](https://python.org) is a popular language for
37 | scientific computing, and great for general-purpose programming as
38 | well.  Installing all of its scientific packages individually can be
39 | a bit difficult, so we recommend an all-in-one installer.
40 | 
41 | For this workshop we use Python version 3.x.
42 | 
43 | ### Required Python Packages for this workshop
44 | 
45 | - [Pandas](https://pandas.pydata.org/)
46 | - [Jupyter notebook](https://jupyter.org/)
47 | - [Numpy](https://www.numpy.org/)
48 | - [Matplotlib](https://matplotlib.org/)
49 | - [Seaborn](https://seaborn.pydata.org)
50 |   
51 | 
52 | ::::::::::::::::::::::::::::::::::::::::::::::::::
53 | 
54 | ## Setup instructions for Python
55 | 
56 | In order to complete the materials for the Python lesson, you will need Python to be installed on your machine. As many of the examples and exercises use Jupyter notebooks, you will need it to be installed as well.
57 | 
58 | The [Anaconda](https://www.anaconda.com/download/) distribution of Python will allow you to install both Python and Jupyter notebooks as a single install. Anaconda will also install many other commonly used Python packages.
59 | 
60 | ### How to install the Anaconda distribution of python
61 | 
62 | 1. Follow the Anaconda link above to the Anaconda website. There are versions of Anaconda available for Windows, macOS, and Linux. The website will detect your operating system and provide a link to the appropriate download.
63 | 2. There will be two options, one for Python 2.x and another for Python 3.x. We will take the Python 3.x option. Python 2.x will eventually be phased out but is still provided for backward compatibility with some older optional Python modules. The majority of popular modules have been converted to work with Python 3.x. The actual value of x will vary depending on when you download. At the time of writing I am being offered Python 3.6 or Python 2.7.
64 | 3. For Windows and Linux there is the option of either a 64 bit (default) download or a 32 bit download. Unless you know that you have an old 32 bit pc you should choose the 64 bit installer.
65 | 4. Run the downloaded installer program. Accept the default settings until you are given the option to add Anaconda to your environmental Path variable. Despite the recommendation not to and the subsequent warning, you should select this option. This will make it easier later on to start Jupyter notebooks from any location.
66 | 5. The installation can take a few minutes. When finished you should be able to open a cmd prompt (Type cmd from Windows start and into the cmd window type python. You should get a display similar to that below.
67 |   ![](/fig/Python_install_1.png){alt='Python Install'}
68 | 6. The `>>>` prompt tells you that you are in the Python environment. You can exit Python with the `exit()` command.
69 | 
70 | ### Running Jupyter Notebooks in Windows
71 | 
72 | 1. From file explorer navigate to where you can select the folder which contains your Jupyter Notebook notebooks (it can be empty initially).
73 | 2. Hold down the `shift` key and right-click the mouse
74 | 3. The pop-up menu items will include an option to start a cmd window or in the latest Windows release start a ‘PowerShell' window. Select whichever appears.
75 | 4. When the window opens, type the command `jupyter notebook`.
76 | 5. Several messages will appear in the command window. In addition your default web browser will open and display the Jupyter notebook home page. The main part of this is a file browser window starting at the folder you selected in step 1.
77 | 6. There may be existing notebooks which you can select and open in a new tab in your browser or there is a menu option to create a new notebook.
78 |   ![](/fig/Python_install_2.png){alt='Python Install'}
79 | 7. The Jupyter package creates a small web services and opens your browser pointing at it. If your browser does not open, you can open it manually and specify ‘localhost:8888' as the URL.
80 | 8. Port 8888 is the default port used by the Jupyter web service, but if it is already in use it will increment the port number automatically. Either way the port number it does use is given in a message in the cmd/powershell window.
81 | 9. Once running, the cmd/powershell window will display additional messages, e.g. about saving notebooks, but there is no need to interact with it directly. The window can be minimized and ignored.
82 | 10. To shut Jupyter down, select the cmd/powershell window and type <kbd>Ctrl</kbd>\+<kbd>c</kbd> twice and then close the window.
83 | 
84 | 
85 | 


--------------------------------------------------------------------------------
/profiles/learner-profiles.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: FIXME
3 | ---
4 | 
5 | This is a placeholder file. Please add content here. 
6 | 


--------------------------------------------------------------------------------
/site/README.md:
--------------------------------------------------------------------------------
1 | This directory contains rendered lesson materials. Please do not edit files
2 | here.  
3 | 


--------------------------------------------------------------------------------