├── .editorconfig ├── .github └── workflows │ ├── README.md │ ├── pr-close-signal.yaml │ ├── pr-comment.yaml │ ├── pr-post-remove-branch.yaml │ ├── pr-preflight.yaml │ ├── pr-receive.yaml │ ├── sandpaper-main.yaml │ ├── sandpaper-version.txt │ ├── update-cache.yaml │ └── update-workflows.yaml ├── .gitignore ├── .zenodo.json ├── AUTHORS.md ├── CITATION ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE.md ├── README.md ├── config.yaml ├── episodes ├── 01-introduction.md ├── 02-basics.md ├── 03-control-structures.md ├── 04-reusable.md ├── 05-processing-data-from-file.md ├── 06-date-and-time.md ├── 07-json.md ├── 08-Pandas.md ├── 09-extracting-data.md ├── 10-aggregations.md ├── 11-joins.md ├── 12-long-and-wide.md ├── 13-matplotlib.md ├── 14-sqlite.md ├── data │ ├── Newspapers.csv │ ├── Newspapers.txt │ ├── Q1.txt │ ├── Q6.txt │ ├── Q7.txt │ ├── SAFI.json │ ├── SAFI_clean.csv │ ├── SAFI_crops.csv │ ├── SAFI_full_shortname.csv │ ├── SAFI_grass_roof.csv │ ├── SAFI_grass_roof_burntbricks.csv │ ├── SAFI_grass_roof_muddaub.csv │ ├── SAFI_results.csv │ ├── SAFI_results_anon.csv │ ├── SN7577.sqlite │ ├── SN7577.tab │ ├── SN7577i_a.csv │ ├── SN7577i_aa.csv │ ├── SN7577i_b.csv │ ├── SN7577i_bb.csv │ ├── SN7577i_c.csv │ └── SN7577i_d.csv └── fig │ ├── Python_date_format_01.png │ ├── Python_function_parameters_9.png │ ├── Python_install_1.png │ ├── Python_install_2.png │ ├── Python_jupyter_6.png │ ├── Python_jupyter_7.png │ ├── Python_jupyter_8.png │ ├── Python_jupyterl_6.png │ ├── Python_repl_3.png │ ├── Python_repl_4.png │ ├── Python_repl_5.png │ ├── Python_repll_3.png │ ├── barplot1.png │ ├── boxplot1.png │ ├── boxplot2.png │ ├── boxplot3.png │ ├── functionAnatomy.png │ ├── histogram1.png │ ├── histogram3.png │ ├── lm1.png │ ├── pandas_join_types.png │ ├── scatter1.png │ └── scatter2.png ├── index.md ├── instructors └── instructor-notes.md ├── learners ├── discuss.md ├── reference.md └── setup.md ├── profiles └── learner-profiles.md └── site └── README.md /.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | charset = utf-8 5 | insert_final_newline = true 6 | trim_trailing_whitespace = true 7 | 8 | [*.md] 9 | indent_size = 2 10 | indent_style = space 11 | max_line_length = 100 # Please keep this in sync with bin/lesson_check.py! 12 | trim_trailing_whitespace = false # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (
) 13 | 14 | [*.r] 15 | max_line_length = 80 16 | 17 | [*.py] 18 | indent_size = 4 19 | indent_style = space 20 | max_line_length = 79 21 | 22 | [*.sh] 23 | end_of_line = lf 24 | 25 | [Makefile] 26 | indent_style = tab 27 | -------------------------------------------------------------------------------- /.github/workflows/README.md: -------------------------------------------------------------------------------- 1 | # Carpentries Workflows 2 | 3 | This directory contains workflows to be used for Lessons using the {sandpaper} 4 | lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml` 5 | and `pr-receive.yaml`) and the rest are bots to handle pull request management. 6 | 7 | These workflows will likely change as {sandpaper} evolves, so it is important to 8 | keep them up-to-date. To do this in your lesson you can do the following in your 9 | R console: 10 | 11 | ```r 12 | # Install/Update sandpaper 13 | options(repos = c(carpentries = "https://carpentries.r-universe.dev/", 14 | CRAN = "https://cloud.r-project.org")) 15 | install.packages("sandpaper") 16 | 17 | # update the workflows in your lesson 18 | library("sandpaper") 19 | update_github_workflows() 20 | ``` 21 | 22 | Inside this folder, you will find a file called `sandpaper-version.txt`, which 23 | will contain a version number for sandpaper. This will be used in the future to 24 | alert you if a workflow update is needed. 25 | 26 | What follows are the descriptions of the workflow files: 27 | 28 | ## Deployment 29 | 30 | ### 01 Build and Deploy (sandpaper-main.yaml) 31 | 32 | This is the main driver that will only act on the main branch of the repository. 33 | This workflow does the following: 34 | 35 | 1. checks out the lesson 36 | 2. provisions the following resources 37 | - R 38 | - pandoc 39 | - lesson infrastructure (stored in a cache) 40 | - lesson dependencies if needed (stored in a cache) 41 | 3. builds the lesson via `sandpaper:::ci_deploy()` 42 | 43 | #### Caching 44 | 45 | This workflow has two caches; one cache is for the lesson infrastructure and 46 | the other is for the the lesson dependencies if the lesson contains rendered 47 | content. These caches are invalidated by new versions of the infrastructure and 48 | the `renv.lock` file, respectively. If there is a problem with the cache, 49 | manual invaliation is necessary. You will need maintain access to the repository 50 | and you can either go to the actions tab and [click on the caches button to find 51 | and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/) 52 | or by setting the `CACHE_VERSION` secret to the current date (which will 53 | invalidate all of the caches). 54 | 55 | ## Updates 56 | 57 | ### Setup Information 58 | 59 | These workflows run on a schedule and at the maintainer's request. Because they 60 | create pull requests that update workflows/require the downstream actions to run, 61 | they need a special repository/organization secret token called 62 | `SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope. 63 | 64 | This can be an individual user token, OR it can be a trusted bot account. If you 65 | have a repository in one of the official Carpentries accounts, then you do not 66 | need to worry about this token being present because the Carpentries Core Team 67 | will take care of supplying this token. 68 | 69 | If you want to use your personal account: you can go to 70 | 71 | to create a token. Once you have created your token, you should copy it to your 72 | clipboard and then go to your repository's settings > secrets > actions and 73 | create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token. 74 | 75 | If you do not specify your token correctly, the runs will not fail and they will 76 | give you instructions to provide the token for your repository. 77 | 78 | ### 02 Maintain: Update Workflow Files (update-workflow.yaml) 79 | 80 | The {sandpaper} repository was designed to do as much as possible to separate 81 | the tools from the content. For local builds, this is absolutely true, but 82 | there is a minor issue when it comes to workflow files: they must live inside 83 | the repository. 84 | 85 | This workflow ensures that the workflow files are up-to-date. The way it work is 86 | to download the update-workflows.sh script from GitHub and run it. The script 87 | will do the following: 88 | 89 | 1. check the recorded version of sandpaper against the current version on github 90 | 2. update the files if there is a difference in versions 91 | 92 | After the files are updated, if there are any changes, they are pushed to a 93 | branch called `update/workflows` and a pull request is created. Maintainers are 94 | encouraged to review the changes and accept the pull request if the outputs 95 | are okay. 96 | 97 | This update is run weekly or on demand. 98 | 99 | ### 03 Maintain: Update Package Cache (update-cache.yaml) 100 | 101 | For lessons that have generated content, we use {renv} to ensure that the output 102 | is stable. This is controlled by a single lockfile which documents the packages 103 | needed for the lesson and the version numbers. This workflow is skipped in 104 | lessons that do not have generated content. 105 | 106 | Because the lessons need to remain current with the package ecosystem, it's a 107 | good idea to make sure these packages can be updated periodically. The 108 | update cache workflow will do this by checking for updates, applying them in a 109 | branch called `updates/packages` and creating a pull request with _only the 110 | lockfile changed_. 111 | 112 | From here, the markdown documents will be rebuilt and you can inspect what has 113 | changed based on how the packages have updated. 114 | 115 | ## Pull Request and Review Management 116 | 117 | Because our lessons execute code, pull requests are a secruity risk for any 118 | lesson and thus have security measures associted with them. **Do not merge any 119 | pull requests that do not pass checks and do not have bots commented on them.** 120 | 121 | This series of workflows all go together and are described in the following 122 | diagram and the below sections: 123 | 124 | ![Graph representation of a pull request](https://carpentries.github.io/sandpaper/articles/img/pr-flow.dot.svg) 125 | 126 | ### Pre Flight Pull Request Validation (pr-preflight.yaml) 127 | 128 | This workflow runs every time a pull request is created and its purpose is to 129 | validate that the pull request is okay to run. This means the following things: 130 | 131 | 1. The pull request does not contain modified workflow files 132 | 2. If the pull request contains modified workflow files, it does not contain 133 | modified content files (such as a situation where @carpentries-bot will 134 | make an automated pull request) 135 | 3. The pull request does not contain an invalid commit hash (e.g. from a fork 136 | that was made before a lesson was transitioned from styles to use the 137 | workbench). 138 | 139 | Once the checks are finished, a comment is issued to the pull request, which 140 | will allow maintainers to determine if it is safe to run the 141 | "Receive Pull Request" workflow from new contributors. 142 | 143 | ### Receive Pull Request (pr-receive.yaml) 144 | 145 | **Note of caution:** This workflow runs arbitrary code by anyone who creates a 146 | pull request. GitHub has safeguarded the token used in this workflow to have no 147 | priviledges in the repository, but we have taken precautions to protect against 148 | spoofing. 149 | 150 | This workflow is triggered with every push to a pull request. If this workflow 151 | is already running and a new push is sent to the pull request, the workflow 152 | running from the previous push will be cancelled and a new workflow run will be 153 | started. 154 | 155 | The first step of this workflow is to check if it is valid (e.g. that no 156 | workflow files have been modified). If there are workflow files that have been 157 | modified, a comment is made that indicates that the workflow is not run. If 158 | both a workflow file and lesson content is modified, an error will occurr. 159 | 160 | The second step (if valid) is to build the generated content from the pull 161 | request. This builds the content and uploads three artifacts: 162 | 163 | 1. The pull request number (pr) 164 | 2. A summary of changes after the rendering process (diff) 165 | 3. The rendered files (build) 166 | 167 | Because this workflow builds generated content, it follows the same general 168 | process as the `sandpaper-main` workflow with the same caching mechanisms. 169 | 170 | The artifacts produced are used by the next workflow. 171 | 172 | ### Comment on Pull Request (pr-comment.yaml) 173 | 174 | This workflow is triggered if the `pr-receive.yaml` workflow is successful. 175 | The steps in this workflow are: 176 | 177 | 1. Test if the workflow is valid and comment the validity of the workflow to the 178 | pull request. 179 | 2. If it is valid: create an orphan branch with two commits: the current state 180 | of the repository and the proposed changes. 181 | 3. If it is valid: update the pull request comment with the summary of changes 182 | 183 | Importantly: if the pull request is invalid, the branch is not created so any 184 | malicious code is not published. 185 | 186 | From here, the maintainer can request changes from the author and eventually 187 | either merge or reject the PR. When this happens, if the PR was valid, the 188 | preview branch needs to be deleted. 189 | 190 | ### Send Close PR Signal (pr-close-signal.yaml) 191 | 192 | Triggered any time a pull request is closed. This emits an artifact that is the 193 | pull request number for the next action 194 | 195 | ### Remove Pull Request Branch (pr-post-remove-branch.yaml) 196 | 197 | Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with 198 | the pull request (if it was created). 199 | -------------------------------------------------------------------------------- /.github/workflows/pr-close-signal.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Send Close Pull Request Signal" 2 | 3 | on: 4 | pull_request: 5 | types: 6 | [closed] 7 | 8 | jobs: 9 | send-close-signal: 10 | name: "Send closing signal" 11 | runs-on: ubuntu-22.04 12 | if: ${{ github.event.action == 'closed' }} 13 | steps: 14 | - name: "Create PRtifact" 15 | run: | 16 | mkdir -p ./pr 17 | printf ${{ github.event.number }} > ./pr/NUM 18 | - name: Upload Diff 19 | uses: actions/upload-artifact@v4 20 | with: 21 | name: pr 22 | path: ./pr 23 | -------------------------------------------------------------------------------- /.github/workflows/pr-comment.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Comment on the Pull Request" 2 | 3 | # read-write repo token 4 | # access to secrets 5 | on: 6 | workflow_run: 7 | workflows: ["Receive Pull Request"] 8 | types: 9 | - completed 10 | 11 | concurrency: 12 | group: pr-${{ github.event.workflow_run.pull_requests[0].number }} 13 | cancel-in-progress: true 14 | 15 | 16 | jobs: 17 | # Pull requests are valid if: 18 | # - they match the sha of the workflow run head commit 19 | # - they are open 20 | # - no .github files were committed 21 | test-pr: 22 | name: "Test if pull request is valid" 23 | runs-on: ubuntu-22.04 24 | if: > 25 | github.event.workflow_run.event == 'pull_request' && 26 | github.event.workflow_run.conclusion == 'success' 27 | outputs: 28 | is_valid: ${{ steps.check-pr.outputs.VALID }} 29 | payload: ${{ steps.check-pr.outputs.payload }} 30 | number: ${{ steps.get-pr.outputs.NUM }} 31 | msg: ${{ steps.check-pr.outputs.MSG }} 32 | steps: 33 | - name: 'Download PR artifact' 34 | id: dl 35 | uses: carpentries/actions/download-workflow-artifact@main 36 | with: 37 | run: ${{ github.event.workflow_run.id }} 38 | name: 'pr' 39 | 40 | - name: "Get PR Number" 41 | if: ${{ steps.dl.outputs.success == 'true' }} 42 | id: get-pr 43 | run: | 44 | unzip pr.zip 45 | echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT 46 | 47 | - name: "Fail if PR number was not present" 48 | id: bad-pr 49 | if: ${{ steps.dl.outputs.success != 'true' }} 50 | run: | 51 | echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.' 52 | exit 1 53 | - name: "Get Invalid Hashes File" 54 | id: hash 55 | run: | 56 | echo "json<> $GITHUB_OUTPUT 59 | - name: "Check PR" 60 | id: check-pr 61 | if: ${{ steps.dl.outputs.success == 'true' }} 62 | uses: carpentries/actions/check-valid-pr@main 63 | with: 64 | pr: ${{ steps.get-pr.outputs.NUM }} 65 | sha: ${{ github.event.workflow_run.head_sha }} 66 | headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire 67 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 68 | fail_on_error: true 69 | 70 | # Create an orphan branch on this repository with two commits 71 | # - the current HEAD of the md-outputs branch 72 | # - the output from running the current HEAD of the pull request through 73 | # the md generator 74 | create-branch: 75 | name: "Create Git Branch" 76 | needs: test-pr 77 | runs-on: ubuntu-22.04 78 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 79 | env: 80 | NR: ${{ needs.test-pr.outputs.number }} 81 | permissions: 82 | contents: write 83 | steps: 84 | - name: 'Checkout md outputs' 85 | uses: actions/checkout@v4 86 | with: 87 | ref: md-outputs 88 | path: built 89 | fetch-depth: 1 90 | 91 | - name: 'Download built markdown' 92 | id: dl 93 | uses: carpentries/actions/download-workflow-artifact@main 94 | with: 95 | run: ${{ github.event.workflow_run.id }} 96 | name: 'built' 97 | 98 | - if: ${{ steps.dl.outputs.success == 'true' }} 99 | run: unzip built.zip 100 | 101 | - name: "Create orphan and push" 102 | if: ${{ steps.dl.outputs.success == 'true' }} 103 | run: | 104 | cd built/ 105 | git config --local user.email "actions@github.com" 106 | git config --local user.name "GitHub Actions" 107 | CURR_HEAD=$(git rev-parse HEAD) 108 | git checkout --orphan md-outputs-PR-${NR} 109 | git add -A 110 | git commit -m "source commit: ${CURR_HEAD}" 111 | ls -A | grep -v '^.git$' | xargs -I _ rm -r '_' 112 | cd .. 113 | unzip -o -d built built.zip 114 | cd built 115 | git add -A 116 | git commit --allow-empty -m "differences for PR #${NR}" 117 | git push -u --force --set-upstream origin md-outputs-PR-${NR} 118 | 119 | # Comment on the Pull Request with a link to the branch and the diff 120 | comment-pr: 121 | name: "Comment on Pull Request" 122 | needs: [test-pr, create-branch] 123 | runs-on: ubuntu-22.04 124 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 125 | env: 126 | NR: ${{ needs.test-pr.outputs.number }} 127 | permissions: 128 | pull-requests: write 129 | steps: 130 | - name: 'Download comment artifact' 131 | id: dl 132 | uses: carpentries/actions/download-workflow-artifact@main 133 | with: 134 | run: ${{ github.event.workflow_run.id }} 135 | name: 'diff' 136 | 137 | - if: ${{ steps.dl.outputs.success == 'true' }} 138 | run: unzip ${{ github.workspace }}/diff.zip 139 | 140 | - name: "Comment on PR" 141 | id: comment-diff 142 | if: ${{ steps.dl.outputs.success == 'true' }} 143 | uses: carpentries/actions/comment-diff@main 144 | with: 145 | pr: ${{ env.NR }} 146 | path: ${{ github.workspace }}/diff.md 147 | 148 | # Comment if the PR is open and matches the SHA, but the workflow files have 149 | # changed 150 | comment-changed-workflow: 151 | name: "Comment if workflow files have changed" 152 | needs: test-pr 153 | runs-on: ubuntu-22.04 154 | if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }} 155 | env: 156 | NR: ${{ github.event.workflow_run.pull_requests[0].number }} 157 | body: ${{ needs.test-pr.outputs.msg }} 158 | permissions: 159 | pull-requests: write 160 | steps: 161 | - name: 'Check for spoofing' 162 | id: dl 163 | uses: carpentries/actions/download-workflow-artifact@main 164 | with: 165 | run: ${{ github.event.workflow_run.id }} 166 | name: 'built' 167 | 168 | - name: 'Alert if spoofed' 169 | id: spoof 170 | if: ${{ steps.dl.outputs.success == 'true' }} 171 | run: | 172 | echo 'body<> $GITHUB_ENV 173 | echo '' >> $GITHUB_ENV 174 | echo '## :x: DANGER :x:' >> $GITHUB_ENV 175 | echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV 176 | echo '' >> $GITHUB_ENV 177 | echo 'EOF' >> $GITHUB_ENV 178 | 179 | - name: "Comment on PR" 180 | id: comment-diff 181 | uses: carpentries/actions/comment-diff@main 182 | with: 183 | pr: ${{ env.NR }} 184 | body: ${{ env.body }} 185 | -------------------------------------------------------------------------------- /.github/workflows/pr-post-remove-branch.yaml: -------------------------------------------------------------------------------- 1 | name: "Bot: Remove Temporary PR Branch" 2 | 3 | on: 4 | workflow_run: 5 | workflows: ["Bot: Send Close Pull Request Signal"] 6 | types: 7 | - completed 8 | 9 | jobs: 10 | delete: 11 | name: "Delete branch from Pull Request" 12 | runs-on: ubuntu-22.04 13 | if: > 14 | github.event.workflow_run.event == 'pull_request' && 15 | github.event.workflow_run.conclusion == 'success' 16 | permissions: 17 | contents: write 18 | steps: 19 | - name: 'Download artifact' 20 | uses: carpentries/actions/download-workflow-artifact@main 21 | with: 22 | run: ${{ github.event.workflow_run.id }} 23 | name: pr 24 | - name: "Get PR Number" 25 | id: get-pr 26 | run: | 27 | unzip pr.zip 28 | echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT 29 | - name: 'Remove branch' 30 | uses: carpentries/actions/remove-branch@main 31 | with: 32 | pr: ${{ steps.get-pr.outputs.NUM }} 33 | -------------------------------------------------------------------------------- /.github/workflows/pr-preflight.yaml: -------------------------------------------------------------------------------- 1 | name: "Pull Request Preflight Check" 2 | 3 | on: 4 | pull_request_target: 5 | branches: 6 | ["main"] 7 | types: 8 | ["opened", "synchronize", "reopened"] 9 | 10 | jobs: 11 | test-pr: 12 | name: "Test if pull request is valid" 13 | if: ${{ github.event.action != 'closed' }} 14 | runs-on: ubuntu-22.04 15 | outputs: 16 | is_valid: ${{ steps.check-pr.outputs.VALID }} 17 | permissions: 18 | pull-requests: write 19 | steps: 20 | - name: "Get Invalid Hashes File" 21 | id: hash 22 | run: | 23 | echo "json<> $GITHUB_OUTPUT 26 | - name: "Check PR" 27 | id: check-pr 28 | uses: carpentries/actions/check-valid-pr@main 29 | with: 30 | pr: ${{ github.event.number }} 31 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 32 | fail_on_error: true 33 | - name: "Comment result of validation" 34 | id: comment-diff 35 | if: ${{ always() }} 36 | uses: carpentries/actions/comment-diff@main 37 | with: 38 | pr: ${{ github.event.number }} 39 | body: ${{ steps.check-pr.outputs.MSG }} 40 | -------------------------------------------------------------------------------- /.github/workflows/pr-receive.yaml: -------------------------------------------------------------------------------- 1 | name: "Receive Pull Request" 2 | 3 | on: 4 | pull_request: 5 | types: 6 | [opened, synchronize, reopened] 7 | 8 | concurrency: 9 | group: ${{ github.ref }} 10 | cancel-in-progress: true 11 | 12 | jobs: 13 | test-pr: 14 | name: "Record PR number" 15 | if: ${{ github.event.action != 'closed' }} 16 | runs-on: ubuntu-22.04 17 | outputs: 18 | is_valid: ${{ steps.check-pr.outputs.VALID }} 19 | steps: 20 | - name: "Record PR number" 21 | id: record 22 | if: ${{ always() }} 23 | run: | 24 | echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR 25 | - name: "Upload PR number" 26 | id: upload 27 | if: ${{ always() }} 28 | uses: actions/upload-artifact@v4 29 | with: 30 | name: pr 31 | path: ${{ github.workspace }}/NR 32 | - name: "Get Invalid Hashes File" 33 | id: hash 34 | run: | 35 | echo "json<> $GITHUB_OUTPUT 38 | - name: "echo output" 39 | run: | 40 | echo "${{ steps.hash.outputs.json }}" 41 | - name: "Check PR" 42 | id: check-pr 43 | uses: carpentries/actions/check-valid-pr@main 44 | with: 45 | pr: ${{ github.event.number }} 46 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }} 47 | 48 | build-md-source: 49 | name: "Build markdown source files if valid" 50 | needs: test-pr 51 | runs-on: ubuntu-22.04 52 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }} 53 | env: 54 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 55 | RENV_PATHS_ROOT: ~/.local/share/renv/ 56 | CHIVE: ${{ github.workspace }}/site/chive 57 | PR: ${{ github.workspace }}/site/pr 58 | MD: ${{ github.workspace }}/site/built 59 | steps: 60 | - name: "Check Out Main Branch" 61 | uses: actions/checkout@v4 62 | 63 | - name: "Check Out Staging Branch" 64 | uses: actions/checkout@v4 65 | with: 66 | ref: md-outputs 67 | path: ${{ env.MD }} 68 | 69 | - name: "Set up R" 70 | uses: r-lib/actions/setup-r@v2 71 | with: 72 | use-public-rspm: true 73 | install-r: false 74 | 75 | - name: "Set up Pandoc" 76 | uses: r-lib/actions/setup-pandoc@v2 77 | 78 | - name: "Setup Lesson Engine" 79 | uses: carpentries/actions/setup-sandpaper@main 80 | with: 81 | cache-version: ${{ secrets.CACHE_VERSION }} 82 | 83 | - name: "Setup Package Cache" 84 | uses: carpentries/actions/setup-lesson-deps@main 85 | with: 86 | cache-version: ${{ secrets.CACHE_VERSION }} 87 | 88 | - name: "Validate and Build Markdown" 89 | id: build-site 90 | run: | 91 | sandpaper::package_cache_trigger(TRUE) 92 | sandpaper::validate_lesson(path = '${{ github.workspace }}') 93 | sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE) 94 | shell: Rscript {0} 95 | 96 | - name: "Generate Artifacts" 97 | id: generate-artifacts 98 | run: | 99 | sandpaper:::ci_bundle_pr_artifacts( 100 | repo = '${{ github.repository }}', 101 | pr_number = '${{ github.event.number }}', 102 | path_md = '${{ env.MD }}', 103 | path_pr = '${{ env.PR }}', 104 | path_archive = '${{ env.CHIVE }}', 105 | branch = 'md-outputs' 106 | ) 107 | shell: Rscript {0} 108 | 109 | - name: "Upload PR" 110 | uses: actions/upload-artifact@v4 111 | with: 112 | name: pr 113 | path: ${{ env.PR }} 114 | overwrite: true 115 | 116 | - name: "Upload Diff" 117 | uses: actions/upload-artifact@v4 118 | with: 119 | name: diff 120 | path: ${{ env.CHIVE }} 121 | retention-days: 1 122 | 123 | - name: "Upload Build" 124 | uses: actions/upload-artifact@v4 125 | with: 126 | name: built 127 | path: ${{ env.MD }} 128 | retention-days: 1 129 | 130 | - name: "Teardown" 131 | run: sandpaper::reset_site() 132 | shell: Rscript {0} 133 | -------------------------------------------------------------------------------- /.github/workflows/sandpaper-main.yaml: -------------------------------------------------------------------------------- 1 | name: "01 Build and Deploy Site" 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | - master 8 | schedule: 9 | - cron: '0 0 * * 2' 10 | workflow_dispatch: 11 | inputs: 12 | name: 13 | description: 'Who triggered this build?' 14 | required: true 15 | default: 'Maintainer (via GitHub)' 16 | reset: 17 | description: 'Reset cached markdown files' 18 | required: false 19 | default: false 20 | type: boolean 21 | jobs: 22 | full-build: 23 | name: "Build Full Site" 24 | 25 | # 2024-10-01: ubuntu-latest is now 24.04 and R is not installed by default in the runner image 26 | # pin to 22.04 for now 27 | runs-on: ubuntu-22.04 28 | permissions: 29 | checks: write 30 | contents: write 31 | pages: write 32 | env: 33 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 34 | RENV_PATHS_ROOT: ~/.local/share/renv/ 35 | steps: 36 | 37 | - name: "Checkout Lesson" 38 | uses: actions/checkout@v4 39 | 40 | - name: "Set up R" 41 | uses: r-lib/actions/setup-r@v2 42 | with: 43 | use-public-rspm: true 44 | install-r: false 45 | 46 | - name: "Set up Pandoc" 47 | uses: r-lib/actions/setup-pandoc@v2 48 | 49 | - name: "Setup Lesson Engine" 50 | uses: carpentries/actions/setup-sandpaper@main 51 | with: 52 | cache-version: ${{ secrets.CACHE_VERSION }} 53 | 54 | - name: "Setup Package Cache" 55 | uses: carpentries/actions/setup-lesson-deps@main 56 | with: 57 | cache-version: ${{ secrets.CACHE_VERSION }} 58 | 59 | - name: "Deploy Site" 60 | run: | 61 | reset <- "${{ github.event.inputs.reset }}" == "true" 62 | sandpaper::package_cache_trigger(TRUE) 63 | sandpaper:::ci_deploy(reset = reset) 64 | shell: Rscript {0} 65 | -------------------------------------------------------------------------------- /.github/workflows/sandpaper-version.txt: -------------------------------------------------------------------------------- 1 | 0.16.11 2 | -------------------------------------------------------------------------------- /.github/workflows/update-cache.yaml: -------------------------------------------------------------------------------- 1 | name: "03 Maintain: Update Package Cache" 2 | 3 | on: 4 | workflow_dispatch: 5 | inputs: 6 | name: 7 | description: 'Who triggered this build (enter github username to tag yourself)?' 8 | required: true 9 | default: 'monthly run' 10 | schedule: 11 | # Run every tuesday 12 | - cron: '0 0 * * 2' 13 | 14 | jobs: 15 | preflight: 16 | name: "Preflight Check" 17 | runs-on: ubuntu-22.04 18 | outputs: 19 | ok: ${{ steps.check.outputs.ok }} 20 | steps: 21 | - id: check 22 | run: | 23 | if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then 24 | echo "ok=true" >> $GITHUB_OUTPUT 25 | echo "Running on request" 26 | # using single brackets here to avoid 08 being interpreted as octal 27 | # https://github.com/carpentries/sandpaper/issues/250 28 | elif [ `date +%d` -le 7 ]; then 29 | # If the Tuesday lands in the first week of the month, run it 30 | echo "ok=true" >> $GITHUB_OUTPUT 31 | echo "Running on schedule" 32 | else 33 | echo "ok=false" >> $GITHUB_OUTPUT 34 | echo "Not Running Today" 35 | fi 36 | 37 | check_renv: 38 | name: "Check if We Need {renv}" 39 | runs-on: ubuntu-22.04 40 | needs: preflight 41 | if: ${{ needs.preflight.outputs.ok == 'true'}} 42 | outputs: 43 | needed: ${{ steps.renv.outputs.exists }} 44 | steps: 45 | - name: "Checkout Lesson" 46 | uses: actions/checkout@v4 47 | - id: renv 48 | run: | 49 | if [[ -d renv ]]; then 50 | echo "exists=true" >> $GITHUB_OUTPUT 51 | fi 52 | 53 | check_token: 54 | name: "Check SANDPAPER_WORKFLOW token" 55 | runs-on: ubuntu-22.04 56 | needs: check_renv 57 | if: ${{ needs.check_renv.outputs.needed == 'true' }} 58 | outputs: 59 | workflow: ${{ steps.validate.outputs.wf }} 60 | repo: ${{ steps.validate.outputs.repo }} 61 | steps: 62 | - name: "validate token" 63 | id: validate 64 | uses: carpentries/actions/check-valid-credentials@main 65 | with: 66 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 67 | 68 | update_cache: 69 | name: "Update Package Cache" 70 | needs: check_token 71 | if: ${{ needs.check_token.outputs.repo== 'true' }} 72 | runs-on: ubuntu-22.04 73 | env: 74 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 75 | RENV_PATHS_ROOT: ~/.local/share/renv/ 76 | steps: 77 | 78 | - name: "Checkout Lesson" 79 | uses: actions/checkout@v4 80 | 81 | - name: "Set up R" 82 | uses: r-lib/actions/setup-r@v2 83 | with: 84 | use-public-rspm: true 85 | install-r: false 86 | 87 | - name: "Update {renv} deps and determine if a PR is needed" 88 | id: update 89 | uses: carpentries/actions/update-lockfile@main 90 | with: 91 | cache-version: ${{ secrets.CACHE_VERSION }} 92 | 93 | - name: Create Pull Request 94 | id: cpr 95 | if: ${{ steps.update.outputs.n > 0 }} 96 | uses: carpentries/create-pull-request@main 97 | with: 98 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 99 | delete-branch: true 100 | branch: "update/packages" 101 | commit-message: "[actions] update ${{ steps.update.outputs.n }} packages" 102 | title: "Update ${{ steps.update.outputs.n }} packages" 103 | body: | 104 | :robot: This is an automated build 105 | 106 | This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions: 107 | 108 | ``` 109 | ${{ steps.update.outputs.report }} 110 | ``` 111 | 112 | :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates. 113 | 114 | If you want to inspect these changes locally, you can use the following code to check out a new branch: 115 | 116 | ```bash 117 | git fetch origin update/packages 118 | git checkout update/packages 119 | ``` 120 | 121 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }} 122 | 123 | [1]: https://github.com/carpentries/create-pull-request/tree/main 124 | labels: "type: package cache" 125 | draft: false 126 | -------------------------------------------------------------------------------- /.github/workflows/update-workflows.yaml: -------------------------------------------------------------------------------- 1 | name: "02 Maintain: Update Workflow Files" 2 | 3 | on: 4 | workflow_dispatch: 5 | inputs: 6 | name: 7 | description: 'Who triggered this build (enter github username to tag yourself)?' 8 | required: true 9 | default: 'weekly run' 10 | clean: 11 | description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)' 12 | required: false 13 | default: '.yaml' 14 | schedule: 15 | # Run every Tuesday 16 | - cron: '0 0 * * 2' 17 | 18 | jobs: 19 | check_token: 20 | name: "Check SANDPAPER_WORKFLOW token" 21 | runs-on: ubuntu-22.04 22 | outputs: 23 | workflow: ${{ steps.validate.outputs.wf }} 24 | repo: ${{ steps.validate.outputs.repo }} 25 | steps: 26 | - name: "validate token" 27 | id: validate 28 | uses: carpentries/actions/check-valid-credentials@main 29 | with: 30 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 31 | 32 | update_workflow: 33 | name: "Update Workflow" 34 | runs-on: ubuntu-22.04 35 | needs: check_token 36 | if: ${{ needs.check_token.outputs.workflow == 'true' }} 37 | steps: 38 | - name: "Checkout Repository" 39 | uses: actions/checkout@v4 40 | 41 | - name: Update Workflows 42 | id: update 43 | uses: carpentries/actions/update-workflows@main 44 | with: 45 | clean: ${{ github.event.inputs.clean }} 46 | 47 | - name: Create Pull Request 48 | id: cpr 49 | if: "${{ steps.update.outputs.new }}" 50 | uses: carpentries/create-pull-request@main 51 | with: 52 | token: ${{ secrets.SANDPAPER_WORKFLOW }} 53 | delete-branch: true 54 | branch: "update/workflows" 55 | commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}" 56 | title: "Update Workflows to Version ${{ steps.update.outputs.new }}" 57 | body: | 58 | :robot: This is an automated build 59 | 60 | Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }} 61 | 62 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }} 63 | 64 | [1]: https://github.com/carpentries/create-pull-request/tree/main 65 | labels: "type: template and tools" 66 | draft: false 67 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # sandpaper files 2 | episodes/*html 3 | site/* 4 | !site/README.md 5 | 6 | # History files 7 | .Rhistory 8 | .Rapp.history 9 | # Session Data files 10 | .RData 11 | # User-specific files 12 | .Ruserdata 13 | # Example code in package build process 14 | *-Ex.R 15 | # Output files from R CMD build 16 | /*.tar.gz 17 | # Output files from R CMD check 18 | /*.Rcheck/ 19 | # RStudio files 20 | .Rproj.user/ 21 | # produced vignettes 22 | vignettes/*.html 23 | vignettes/*.pdf 24 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3 25 | .httr-oauth 26 | # knitr and R markdown default cache directories 27 | *_cache/ 28 | /cache/ 29 | # Temporary files created by R markdown 30 | *.utf8.md 31 | *.knit.md 32 | # R Environment Variables 33 | .Renviron 34 | # pkgdown site 35 | docs/ 36 | # translation temp files 37 | po/*~ 38 | # renv detritus 39 | renv/sandbox/ 40 | *.pyc 41 | *~ 42 | .DS_Store 43 | .ipynb_checkpoints 44 | .sass-cache 45 | .jekyll-cache/ 46 | .jekyll-metadata 47 | __pycache__ 48 | _site 49 | .Rproj.user 50 | .bundle/ 51 | .vendor/ 52 | vendor/ 53 | .docker-vendor/ 54 | Gemfile.lock 55 | .*history 56 | -------------------------------------------------------------------------------- /.zenodo.json: -------------------------------------------------------------------------------- 1 | { 2 | "contributors": [ 3 | { 4 | "type": "Editor", 5 | "name": "Annajiat Alim Rasel", 6 | "orcid": "0000-0003-0198-3734" 7 | } 8 | ], 9 | "creators": [ 10 | { 11 | "name": "Annajiat Alim Rasel", 12 | "orcid": "0000-0003-0198-3734" 13 | }, 14 | { 15 | "name": "Peter Smyth" 16 | }, 17 | { 18 | "name": "Sarah M Brown", 19 | "orcid": "0000-0001-5728-0822" 20 | }, 21 | { 22 | "name": "Stephen Edward Childs", 23 | "orcid": "0000-0002-4450-4281" 24 | }, 25 | { 26 | "name": "Vini Salazar", 27 | "orcid": "0000-0002-8362-3195" 28 | }, 29 | { 30 | "name": "Scott Carl Peterson", 31 | "orcid": "0000-0002-1920-616X" 32 | }, 33 | { 34 | "name": "Geoffrey Boushey" 35 | }, 36 | { 37 | "name": "Christopher Erdmann", 38 | "orcid": "0000-0003-2554-180X" 39 | }, 40 | { 41 | "name": "Katrin Tirok", 42 | "orcid": "0000-0002-5040-9838" 43 | }, 44 | { 45 | "name": "Katrin Leinweber", 46 | "orcid": "0000-0001-5135-5758" 47 | }, 48 | { 49 | "name": "joelostblom" 50 | }, 51 | { 52 | "name": "tg340" 53 | }, 54 | { 55 | "name": "Tejaswinee Kelkar", 56 | "orcid": "0000-0002-2324-6850" 57 | }, 58 | { 59 | "name": "Yee Mey Seah", 60 | "orcid": "0000-0002-5616-021X" 61 | }, 62 | { 63 | "name": "Benjamin Tovar", 64 | "orcid": "0000-0002-5294-2281" 65 | }, 66 | { 67 | "name": "Tadiwanashe Gutsa", 68 | "orcid": "0000-0002-6871-0899" 69 | }, 70 | { 71 | "name": "crahal" 72 | }, 73 | { 74 | "name": "Jacob Deppen" 75 | }, 76 | { 77 | "name": "Karen Word", 78 | "orcid": "0000-0002-7294-7231" 79 | }, 80 | { 81 | "name": "Katrin Tirok" 82 | }, 83 | { 84 | "name": "Kevan Swanberg" 85 | }, 86 | { 87 | "name": "Tim Young" 88 | }, 89 | { 90 | "name": "Steve Haddock" 91 | }, 92 | { 93 | "name": "sadkate" 94 | }, 95 | { 96 | "name": "Sanjay Fuloria", 97 | "orcid": "0000-0002-4185-0541" 98 | } 99 | ], 100 | "license": { 101 | "id": "CC-BY-4.0" 102 | } 103 | } -------------------------------------------------------------------------------- /AUTHORS.md: -------------------------------------------------------------------------------- 1 | * FIXME: list authors' names and email addresses. 2 | * FIXME: https://github.com/datacarpentry/python-socialsci/blob/gh-pages/.github/workflows/generate_author_md_yml 3 | * LOCALSOURCE: git log | grep Author: | sort | uniq 4 | * DATED: 2021020402 5 | # STATIC LIST: 6 | * Abigail Cabunoc 7 | * Abigail Cabunoc 8 | * Allen Lee 9 | * Andrew Sanchez 10 | * Andy Boughton 11 | * Annajiat Alim Rasel 12 | * Belinda Weaver 13 | * beroe 14 | * Bill Mills 15 | * Brandon Curtis 16 | * Chris Erdmann 17 | * David Mawdsley 18 | * David Perez Suarez 19 | * EDUB <35309993+eemdub@users.noreply.github.com> 20 | * Erin Becker 21 | * ErinBecker 22 | * evanwill 23 | * Francois Michonneau 24 | * Francois Michonneau 25 | * François Michonneau 26 | * Gabriel A. Devenyi 27 | * Geoffrey Boushey 28 | * Greg Wilson 29 | * Greg Wilson 30 | * Ian Carroll 31 | * Ian Lee 32 | * Jacob Deppen 33 | * James Allen 34 | * jcoliver 35 | * Joel Nothman 36 | * joelostblom 37 | * Jon Pipitone 38 | * Jonah Duckles 39 | * Joseph Stachelek 40 | * jsta 41 | * karenword 42 | * Katrin Leinweber <9948149+katrinleinweber@users.noreply.github.com> 43 | * Katrin Leinweber 44 | * Katrin Tirok <11316788+katrintirok@users.noreply.github.com> 45 | * Katrin Tirok 46 | * Katrin Tirok 47 | * Katrin Tirok 48 | * Maxim Belkin 49 | * Maxim Belkin 50 | * Michael Hansen 51 | * Michael R. Crusoe <1330696+mr-c@users.noreply.github.com> 52 | * Michael R. Crusoe 53 | * mzc9 54 | * naught101 55 | * Nick Young 56 | * Nick Young 57 | * PeterSmyth12 58 | * Piotr Banaszkiewicz 59 | * Raniere Silva 60 | * Raniere Silva 61 | * Raniere Silva 62 | * Rémi Emonet 63 | * Rémi Emonet 64 | * Remi Rampin 65 | * sadkate <63014805+sadkate@users.noreply.github.com> 66 | * Sarah Brown 67 | * Sarah Brown 68 | * Scott Peterson 69 | * Stephen Childs 70 | * Tadigutsah <43744878+Tadigutsah@users.noreply.github.com> 71 | * Tejaswinee K 72 | * tg340 <33851795+tg340@users.noreply.github.com> 73 | * Tim Young 74 | * Timothée Poisot 75 | * Toby Hodges 76 | * Tracy Teal 77 | * Tracy Teal 78 | * trk 79 | * W. Trevor King 80 | * William L. Close 81 | * William L. Close 82 | * Yee Mey 83 | * ymseah 84 | -------------------------------------------------------------------------------- /CITATION: -------------------------------------------------------------------------------- 1 | FIXME: describe how to cite this lesson. 2 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Contributor Code of Conduct" 3 | --- 4 | 5 | As contributors and maintainers of this project, 6 | we pledge to follow the [The Carpentries Code of Conduct][coc]. 7 | 8 | Instances of abusive, harassing, or otherwise unacceptable behavior 9 | may be reported by following our [reporting guidelines][coc-reporting]. 10 | 11 | 12 | [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html 13 | [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html 14 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | ## Contributing 2 | 3 | [The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data 4 | Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source 5 | projects, and we welcome contributions of all kinds: new lessons, fixes to 6 | existing material, bug reports, and reviews of proposed changes are all 7 | welcome. 8 | 9 | ### Contributor Agreement 10 | 11 | By contributing, you agree that we may redistribute your work under [our 12 | license](LICENSE.md). In exchange, we will address your issues and/or assess 13 | your change proposal as promptly as we can, and help you become a member of our 14 | community. Everyone involved in [The Carpentries][cp-site] agrees to abide by 15 | our [code of conduct](CODE_OF_CONDUCT.md). 16 | 17 | ### How to Contribute 18 | 19 | The easiest way to get started is to file an issue to tell us about a spelling 20 | mistake, some awkward wording, or a factual error. This is a good way to 21 | introduce yourself and to meet some of our community members. 22 | 23 | 1. If you do not have a [GitHub][github] account, you can [send us comments by 24 | email][contact]. However, we will be able to respond more quickly if you use 25 | one of the other methods described below. 26 | 27 | 2. If you have a [GitHub][github] account, or are willing to [create 28 | one][github-join], but do not know how to use Git, you can report problems 29 | or suggest improvements by [creating an issue][issues]. This allows us to 30 | assign the item to someone and to respond to it in a threaded discussion. 31 | 32 | 3. If you are comfortable with Git, and would like to add or change material, 33 | you can submit a pull request (PR). Instructions for doing this are 34 | [included below](#using-github). 35 | 36 | Note: if you want to build the website locally, please refer to [The Workbench 37 | documentation][template-doc]. 38 | 39 | ### Where to Contribute 40 | 41 | 1. If you wish to change this lesson, add issues and pull requests here. 42 | 2. If you wish to change the template used for workshop websites, please refer 43 | to [The Workbench documentation][template-doc]. 44 | 45 | 46 | ### What to Contribute 47 | 48 | There are many ways to contribute, from writing new exercises and improving 49 | existing ones to updating or filling in the documentation and submitting [bug 50 | reports][issues] about things that do not work, are not clear, or are missing. 51 | If you are looking for ideas, please see [the list of issues for this 52 | repository][repo], or the issues for [Data Carpentry][dc-issues], [Library 53 | Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects. 54 | 55 | Comments on issues and reviews of pull requests are just as welcome: we are 56 | smarter together than we are on our own. **Reviews from novices and newcomers 57 | are particularly valuable**: it's easy for people who have been using these 58 | lessons for a while to forget how impenetrable some of this material can be, so 59 | fresh eyes are always welcome. 60 | 61 | ### What *Not* to Contribute 62 | 63 | Our lessons already contain more material than we can cover in a typical 64 | workshop, so we are usually *not* looking for more concepts or tools to add to 65 | them. As a rule, if you want to introduce a new idea, you must (a) estimate how 66 | long it will take to teach and (b) explain what you would take out to make room 67 | for it. The first encourages contributors to be honest about requirements; the 68 | second, to think hard about priorities. 69 | 70 | We are also not looking for exercises or other material that only run on one 71 | platform. Our workshops typically contain a mixture of Windows, macOS, and 72 | Linux users; in order to be usable, our lessons must run equally well on all 73 | three. 74 | 75 | ### Using GitHub 76 | 77 | If you choose to contribute via GitHub, you may want to look at [How to 78 | Contribute to an Open Source Project on GitHub][how-contribute]. In brief, we 79 | use [GitHub flow][github-flow] to manage changes: 80 | 81 | 1. Create a new branch in your desktop copy of this repository for each 82 | significant change. 83 | 2. Commit the change in that branch. 84 | 3. Push that branch to your fork of this repository on GitHub. 85 | 4. Submit a pull request from that branch to the [upstream repository][repo]. 86 | 5. If you receive feedback, make changes on your desktop and push to your 87 | branch on GitHub: the pull request will update automatically. 88 | 89 | NB: The published copy of the lesson is usually in the `main` branch. 90 | 91 | Each lesson has a team of maintainers who review issues and pull requests or 92 | encourage others to do so. The maintainers are community volunteers, and have 93 | final say over what gets merged into the lesson. 94 | 95 | ### Other Resources 96 | 97 | The Carpentries is a global organisation with volunteers and learners all over 98 | the world. We share values of inclusivity and a passion for sharing knowledge, 99 | teaching and learning. There are several ways to connect with The Carpentries 100 | community listed at including via social 101 | media, slack, newsletters, and email lists. You can also [reach us by 102 | email][contact]. 103 | 104 | [repo]: https://example.com/FIXME 105 | [contact]: mailto:team@carpentries.org 106 | [cp-site]: https://carpentries.org/ 107 | [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry 108 | [dc-lessons]: https://datacarpentry.org/lessons/ 109 | [dc-site]: https://datacarpentry.org/ 110 | [discuss-list]: https://lists.software-carpentry.org/listinfo/discuss 111 | [github]: https://github.com 112 | [github-flow]: https://guides.github.com/introduction/flow/ 113 | [github-join]: https://github.com/join 114 | [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github 115 | [issues]: https://carpentries.org/help-wanted-issues/ 116 | [lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry 117 | [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry 118 | [swc-lessons]: https://software-carpentry.org/lessons/ 119 | [swc-site]: https://software-carpentry.org/ 120 | [lc-site]: https://librarycarpentry.org/ 121 | [template-doc]: https://carpentries.github.io/workbench/ 122 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Licenses" 3 | --- 4 | 5 | ## Instructional Material 6 | 7 | All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) 8 | instructional material is made available under the [Creative Commons 9 | Attribution license][cc-by-human]. The following is a human-readable summary of 10 | (and not a substitute for) the [full legal text of the CC BY 4.0 11 | license][cc-by-legal]. 12 | 13 | You are free: 14 | 15 | - to **Share**---copy and redistribute the material in any medium or format 16 | - to **Adapt**---remix, transform, and build upon the material 17 | 18 | for any purpose, even commercially. 19 | 20 | The licensor cannot revoke these freedoms as long as you follow the license 21 | terms. 22 | 23 | Under the following terms: 24 | 25 | - **Attribution**---You must give appropriate credit (mentioning that your work 26 | is derived from work that is Copyright (c) The Carpentries and, where 27 | practical, linking to ), provide a [link to the 28 | license][cc-by-human], and indicate if changes were made. You may do so in 29 | any reasonable manner, but not in any way that suggests the licensor endorses 30 | you or your use. 31 | 32 | - **No additional restrictions**---You may not apply legal terms or 33 | technological measures that legally restrict others from doing anything the 34 | license permits. With the understanding that: 35 | 36 | Notices: 37 | 38 | * You do not have to comply with the license for elements of the material in 39 | the public domain or where your use is permitted by an applicable exception 40 | or limitation. 41 | * No warranties are given. The license may not give you all of the permissions 42 | necessary for your intended use. For example, other rights such as publicity, 43 | privacy, or moral rights may limit how you use the material. 44 | 45 | ## Software 46 | 47 | Except where otherwise noted, the example programs and other software provided 48 | by The Carpentries are made available under the [OSI][osi]-approved [MIT 49 | license][mit-license]. 50 | 51 | Permission is hereby granted, free of charge, to any person obtaining a copy of 52 | this software and associated documentation files (the "Software"), to deal in 53 | the Software without restriction, including without limitation the rights to 54 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 55 | of the Software, and to permit persons to whom the Software is furnished to do 56 | so, subject to the following conditions: 57 | 58 | The above copyright notice and this permission notice shall be included in all 59 | copies or substantial portions of the Software. 60 | 61 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 62 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 63 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 64 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 65 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 66 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 67 | SOFTWARE. 68 | 69 | ## Trademark 70 | 71 | "The Carpentries", "Software Carpentry", "Data Carpentry", and "Library 72 | Carpentry" and their respective logos are registered trademarks of 73 | [The Carpentries, Inc.][carpentries]. 74 | 75 | [cc-by-human]: https://creativecommons.org/licenses/by/4.0/ 76 | [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode 77 | [mit-license]: https://opensource.org/licenses/mit-license.html 78 | [carpentries]: https://carpentries.org 79 | [osi]: https://opensource.org 80 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Create a Slack Account with us](https://img.shields.io/badge/Create_Slack_Account-The_Carpentries-071159.svg)](https://slack-invite.carpentries.org/) 2 | [![Slack Status](https://img.shields.io/badge/Slack_Channel-dc--socsci--py-E01563.svg)](https://carpentries.slack.com/messages/C9WJEBW01) 3 | 4 | # Data Carpentry Python Lesson with Social Science Data 5 | 6 | Data Carpentry Lesson on Python for social scientists and others based on the data sources mentioned below. Please see our [contribution guidelines](CONTRIBUTING.md) for information on how to contribute updates, bug fixes, or other corrections. 7 | 8 | ### Data 9 | 10 | - [The SAFI Teaching Database](https://datacarpentry.org/socialsci-workshop/data/) 11 | - Hansard Society, Parliament and Government Programme. (2014). Audit of Political Engagement 11, 2013. [data collection]. UK Data Service. SN: 7577, [http://doi.org/10.5255/UKDA-SN-7577-1](https://doi.org/10.5255/UKDA-SN-7577-1). Contains public sector information licensed under the [Open Government Licence v2.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/2.0) 12 | 13 | ### Maintainers 14 | 15 | - Stephen Childs ([@sechilds](https://github.com/sechilds)) 16 | - Geoffrey Boushey ([@gboushey](https://github.com/gboushey)) 17 | - Annajiat Alim Rasel ([@annajiat](https://github.com/annajiat)) 18 | 19 | 20 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | #------------------------------------------------------------ 2 | # Values for this lesson. 3 | #------------------------------------------------------------ 4 | 5 | # Which carpentry is this (swc, dc, lc, or cp)? 6 | # swc: Software Carpentry 7 | # dc: Data Carpentry 8 | # lc: Library Carpentry 9 | # cp: Carpentries (to use for instructor training for instance) 10 | # incubator: The Carpentries Incubator 11 | carpentry: 'dc' 12 | 13 | # Overall title for pages. 14 | title: 'Data Analysis and Visualization with Python for Social Scientists *alpha*' 15 | 16 | # Date the lesson was created (YYYY-MM-DD, this is empty by default) 17 | created: '2017-05-25' 18 | 19 | # Comma-separated list of keywords for the lesson 20 | keywords: 'software, data, lesson, The Carpentries' 21 | 22 | # Life cycle stage of the lesson 23 | # possible values: pre-alpha, alpha, beta, stable 24 | life_cycle: 'alpha' 25 | 26 | # License of the lesson materials (recommended CC-BY 4.0) 27 | license: 'CC-BY 4.0' 28 | 29 | # Link to the source repository for this lesson 30 | source: 'https://github.com/datacarpentry/python-socialsci' 31 | 32 | # Default branch of your lesson 33 | branch: 'main' 34 | 35 | # Who to contact if there are any issues 36 | contact: 'team@carpentries.org' 37 | 38 | # Navigation ------------------------------------------------ 39 | # 40 | # Use the following menu items to specify the order of 41 | # individual pages in each dropdown section. Leave blank to 42 | # include all pages in the folder. 43 | # 44 | # Example ------------- 45 | # 46 | # episodes: 47 | # - introduction.md 48 | # - first-steps.md 49 | # 50 | # learners: 51 | # - setup.md 52 | # 53 | # instructors: 54 | # - instructor-notes.md 55 | # 56 | # profiles: 57 | # - one-learner.md 58 | # - another-learner.md 59 | 60 | # Order of episodes in your lesson 61 | episodes: 62 | - 01-introduction.md 63 | - 02-basics.md 64 | - 03-control-structures.md 65 | - 04-reusable.md 66 | - 05-processing-data-from-file.md 67 | - 06-date-and-time.md 68 | - 07-json.md 69 | - 08-Pandas.md 70 | - 09-extracting-data.md 71 | - 10-aggregations.md 72 | - 11-joins.md 73 | - 12-long-and-wide.md 74 | - 13-matplotlib.md 75 | - 14-sqlite.md 76 | 77 | # Information for Learners 78 | learners: 79 | 80 | # Information for Instructors 81 | instructors: 82 | 83 | # Learner Profiles 84 | profiles: 85 | 86 | # Customisation --------------------------------------------- 87 | # 88 | # This space below is where custom yaml items (e.g. pinning 89 | # sandpaper and varnish versions) should live 90 | 91 | 92 | url: 'https://datacarpentry.github.io/python-socialsci' 93 | analytics: carpentries 94 | lang: en 95 | -------------------------------------------------------------------------------- /episodes/01-introduction.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Introduction to Python 3 | teaching: 15 4 | exercises: 0 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Examine the Python interpreter 10 | - Recognize the advantage of using the Python programming language 11 | - Understand the concept and benefits of using notebooks for coding 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - Why learn Python? 18 | - What are Jupyter notebooks? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | ## Introducing the Python programming language 23 | 24 | Python is a general purpose programming language. It is an interpreted language, 25 | which makes it suitable for rapid development and prototyping of programming segments or complete 26 | small programs. 27 | 28 | Python's main advantages: 29 | 30 | - Open source software, supported by [Python Software 31 | Foundation](https://www.python.org/psf/) 32 | - Available on all major platforms (Windows, macOS, Linux) 33 | - It is a good language for new programmers to learn due to its straightforward, 34 | object-oriented style 35 | - It is well-structured, which aids readability 36 | - It is extensible (i.e. modifiable) and is supported by a large community who 37 | provide a comprehensive range of 3rd party packages 38 | 39 | ## Interpreted vs. compiled languages 40 | 41 | In any programming language, the code must be translated into "machine code" 42 | before running it. It is the machine code which is executed and produces 43 | results. In a language like C++ your code is translated into machine code and 44 | stored in a separate file, in a process referred to as **compiling** the code. 45 | You then execute the machine code from the file as a separate step. This is 46 | efficient if you intend to run the same machine code many times as you only have 47 | to compile it once and it is very fast to run the compiled machine code. 48 | 49 | On the other hand, if you are experimenting, then your 50 | code will change often and you would have to compile it again every time before 51 | the machine can execute it. This is where **interpreted** languages have the 52 | advantage. You don't need a complete compiled program to "run" what has been 53 | written so far and see the results. This rapid prototyping is helped further by 54 | use of a system called REPL. 55 | 56 | ## REPL 57 | 58 | **REPL** is an acronym which stands for Read, Evaluate, Print and Loop. 59 | 60 | REPL allows you to write single statements of code, have them executed, and if 61 | there are any results to show, they are displayed and then the interpreter loops 62 | back to the beginning and waits for the next program statement. 63 | 64 | ![](fig/Python_repl_3.png){alt='Python\_Repl'} 65 | 66 | In the example above, two variables `a` and `b` have been created, assigned to values 67 | `2` and `3`, and then multiplied together. 68 | 69 | Every time you press Return, the line is interpreted. The assignment statements don't produce any 70 | output so you get only the standard `>>>` prompt. 71 | 72 | For the `a*b` statement (it is more of an expression than a statement), because 73 | the result is not being assigned to a variable, the REPL displays the result of 74 | the calculation on screen and then waits for the next input. 75 | 76 | The REPL system makes it very easy to try out small chunks of code. 77 | 78 | You are not restricted to single line statements. If the Python interpreter 79 | decides that what you have written on a line cannot be a complete statement it 80 | will give you a continuation prompt of `...` until you complete the statement. 81 | 82 | ## Introducing Jupyter notebooks 83 | 84 | [**Jupyter**](https://jupyter.org/) originates from IPython, an effort to make Python 85 | development more interactive. Since its inception, the scope of the project 86 | has expanded to include **Ju**lia, **Pyt**hon, and **R**, so the name was changed to "Jupyter" 87 | as a reference to these core languages. Today, Jupyter supports even more 88 | languages, but we will be using it only for Python code. Specifically, we will 89 | be using **Jupyter notebooks**, which allows us to easily take notes about 90 | our analysis and view plots within the same document where we code. This 91 | facilitates sharing and reproducibility of analyses, and the notebook interface 92 | is easily accessible through any web browser. Jupyter notebooks are started 93 | from the terminal using 94 | 95 | ```bash 96 | $ jupyter notebook 97 | ``` 98 | 99 | Your browser should start automatically and look 100 | something like this: 101 | 102 | ![](fig/Python_jupyter_6.png){alt='Jupyter\_notebook\_list'} 103 | 104 | When you create a notebook from the *New* option, the new notebook will be displayed in a new 105 | browser tab and look like this. 106 | 107 | ![](fig/Python_jupyter_7.png){alt='Jupyter\_notebook'} 108 | 109 | Initially the notebook has no name other than 'Untitled'. If you click on 'Untitled' you will be 110 | given the option of changing the name to whatever you want. 111 | 112 | The notebook is divided into **cells**. Initially there will be a single input cell marked by `In [ ]:`. 113 | 114 | You can type Python code directly into the cell. You can split the code across 115 | several lines as needed. Unlike the REPL we looked at before, the code is not 116 | interpreted line by line. To interpret the code in a cell, you can click the 117 | *Run* button in the toolbar or from the *Cell* menu option, or use keyboard 118 | shortcuts (e.g., Shift\+Return). All of the code in that cell will then be 119 | executed. 120 | 121 | The results are shown in a separate `Out [1]:` cell immediately below. A new input 122 | cell (`In [ ]:`) is created for you automatically. 123 | 124 | ![](fig/Python_jupyter_8.png){alt='Jupyter\_notebook\_cell'} 125 | 126 | When a cell is run, it is given a number along with the corresponding output 127 | cell. If you have a notebook with many cells in it you can run the cells in any 128 | order and also run the same cell many times. The number on the left hand side of 129 | the input cells increments, so you can always tell the order in which they were 130 | run. For example, a cell marked `In [5]:` was the fifth cell run in the sequence. 131 | 132 | Although there is an option to do so on the toolbar, you do not have to manually 133 | save the notebook. This is done automatically by the Jupyter system. 134 | 135 | Not only are the contents of the `In [ ]:` cells saved, but so are the `Out [ ]:` cells. 136 | This allows you to create complete documents with both your code and the output 137 | of the code in a single place. You can also change the cell type from 138 | Python code to Markdown using the *Cell > Cell Type* option. [**Markdown**](https://en.wikipedia.org/wiki/Markdown) is 139 | a simple formatting system which allows you to create documentation for your 140 | code, again all within the same notebook structure. 141 | 142 | The notebook itself is stored as specially-formatted text file with an `.ipynb` 143 | extension. These files can be opened and run by others with Jupyter installed. This allows you to 144 | share your code inputs, outputs, and 145 | Markdown documentation with others. You can also export the notebook to HTML, PDF, and 146 | many other formats to make sharing even easier. 147 | 148 | :::::::::::::::::::::::::::::::::::::::: keypoints 149 | 150 | - Python is an interpreted language 151 | - The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments 152 | - Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared 153 | - Jupyter notebooks is a complete IDE (Integrated Development Environment) 154 | 155 | :::::::::::::::::::::::::::::::::::::::::::::::::: 156 | 157 | 158 | -------------------------------------------------------------------------------- /episodes/03-control-structures.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Python control structures 3 | teaching: 20 4 | exercises: 25 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Change program flow using available language constructs 10 | - Demonstrate how to execute a section of code a fixed number of times 11 | - Demonstrate how to conditionally execute a section of code 12 | - Demonstrate how to execute a section of code on a list of items 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - What constructs are available for changing the flow of a program? 19 | - How can I repeat an action many times? 20 | - How can I perform the same task(s) on a set of items? 21 | 22 | :::::::::::::::::::::::::::::::::::::::::::::::::: 23 | 24 | ## Programs are rarely linear 25 | 26 | Most programs do not work by executing a simple sequential set of statements. The code is constructed so that decisions and different paths through the program can be taken based on changes in variable values. 27 | 28 | To make this possible all programming language have a set of control structures which allow this to happen. 29 | 30 | In this episode we are going to look at how we can create loops and branches in our Python code. 31 | Specifically we will look at three control structures, namely: 32 | 33 | - if..else.. 34 | - while... 35 | - for ... 36 | 37 | ## The `if` statement and variants 38 | 39 | The simple `if` statement allows the program to branch based on the evaluation of an expression 40 | 41 | The basic format of the `if` statement is: 42 | 43 | ```python 44 | if expression : 45 | statement 1 46 | statement 2 47 | ... 48 | statement n 49 | 50 | statement always executed 51 | ``` 52 | 53 | If the expression evaluates to `True` then the statements 1 to n will be executed followed by `statement always executed` . If the expression is `False`, only `statement always executed` is executed. Python knows which lines of code are related to the `if` statement by the indentation, no extra syntax is necessary. 54 | 55 | Below are some examples: 56 | 57 | ```python 58 | print("\nExample 1\n") 59 | 60 | value = 5 61 | threshold= 4 62 | print("value is", value, "threshold is ",threshold) 63 | if value > threshold : 64 | print(value, "is bigger than ", threshold) 65 | 66 | print("\nExample 2\n") 67 | 68 | 69 | high_threshold = 6 70 | print("value is", value, "new threshold is ",high_threshold) 71 | if value > high_threshold : 72 | print(value , "is above ", high_threshold, "threshold") 73 | 74 | print("\nExample 3\n") 75 | 76 | 77 | mid_threshold = 5 78 | print("value is", value, "final threshold is ",mid_threshold) 79 | if value == mid_threshold : 80 | print("value, ", value, " and threshold,", mid_threshold, ", are equal") 81 | ``` 82 | 83 | ```output 84 | Example 1 85 | 86 | value is 5 threshold is 4 87 | 5 is bigger than 4 88 | 89 | Example 2 90 | 91 | value is 5 new threshold is 6 92 | 93 | Example 3 94 | 95 | value is 5 final threshold is 5 96 | value, 5, and threshold, 5, are equal 97 | ``` 98 | 99 | In the examples above there are three things to notice: 100 | 101 | 1. The colon `:` at the end of the `if` line. Leaving this out is a common error. 102 | 2. The indentation of the print statement. If you remembered the `:` on the line before, Jupyter (or any other Python IDE) will automatically do the indentation for you. All of the statements indented at this level are considered to be part of the `if` statement. This is a feature fairly unique to Python, that it cares about the indentation. If there is too much, or too little indentation, you will get an error. 103 | 3. The `if` statement is ended by removing the indent. There is no explicit end to the `if` statement as there is in many other programming languages 104 | 105 | In the last example, notice that in Python the operator used to check equality is `==`. 106 | 107 | ::::::::::::::::::::::::::::::::::::::: challenge 108 | 109 | ## Exercise 110 | 111 | Add another if statement to example 2 that will check if b is greater than or equal to a 112 | 113 | ::::::::::::::: solution 114 | 115 | ## Solution 116 | 117 | ```python 118 | print("\nExample 2a\n") 119 | 120 | a= 3 121 | b= 4 122 | print("a is", a, "b is",b) 123 | if a > b : 124 | print(a, "is bigger than ", b) 125 | if a <= b : 126 | print(b, "is bigger than or equal to ", a) 127 | ``` 128 | 129 | ::::::::::::::::::::::::: 130 | 131 | :::::::::::::::::::::::::::::::::::::::::::::::::: 132 | 133 | Instead of using two separate `if` statements to decide which is larger we can use the `if ... else ...` construct 134 | 135 | ```python 136 | # if ... else ... 137 | 138 | value = 4 139 | threshold = 5 140 | print("value = ", value, "and threshold = ", threshold) 141 | 142 | if value > threshold : 143 | print("above threshold") 144 | else : 145 | print("below threshold") 146 | ``` 147 | 148 | ```output 149 | value = 4 and threshold = 5 150 | below threshold 151 | ``` 152 | 153 | ::::::::::::::::::::::::::::::::::::::: challenge 154 | 155 | ## Exercise 156 | 157 | Repeat above with different operators '\<' , '==' 158 | 159 | 160 | :::::::::::::::::::::::::::::::::::::::::::::::::: 161 | 162 | A further extension of the `if` statement is the `if ... elif ...else` version. 163 | 164 | The example below allows you to be more specific about the comparison of a and b. 165 | 166 | ```python 167 | # if ... elif ... else ... endIf 168 | 169 | a = 5 170 | b = 4 171 | print("a = ", a, "and b = ", b) 172 | 173 | if a > b : 174 | print(a, " is greater than ", b) 175 | elif a == b : 176 | print(a, " equals ", b) 177 | else : 178 | print(a, " is less than ", b) 179 | ``` 180 | 181 | ```output 182 | a = 5 and b = 4 183 | 5 is greater than 4 184 | ``` 185 | 186 | The overall structure is similar to the `if ... else` statement. There are three additional things to notice: 187 | 188 | 1. Each `elif` clause has its own test expression. 189 | 2. You can have as many `elif` clauses as you need 190 | 3. Execution of the whole statement stops after an `elif` expression is found to be True. Therefore the ordering of the `elif` clause can be significant. 191 | 192 | ## The `while` loop 193 | 194 | The while loop is used to repeatedly execute lines of code until some condition becomes False. 195 | 196 | For the loop to terminate, there has to be something in the code which will potentially change the condition. 197 | 198 | ```python 199 | # while loop 200 | n = 10 201 | cur_sum = 0 202 | # sum of n numbers 203 | i = 1 204 | while i <= n : 205 | cur_sum = cur_sum + i 206 | i = i + 1 207 | print("The sum of the numbers from 1 to", n, "is ", cur_sum) 208 | ``` 209 | 210 | ```output 211 | The sum of the numbers from 1 to 10 is 55 212 | ``` 213 | 214 | Points to note: 215 | 216 | 1. The condition clause (i \<= n) in the while statement can be anything which when evaluated would return a Boolean value of either True of False. Initially i has been set to 1 (before the start of the loop) and therefore the condition is `True`. 217 | 2. The clause can be made more complex by use of parentheses and `and` and `or` operators amongst others 218 | 3. The statements after the while clause are only executed if the condition evaluates as True. 219 | 4. Within the statements after the while clause there should be something which potentially will make the condition evaluate as `False` next time around. If not the loop will never end. 220 | 5. In this case the last statement in the loop changes the value of i which is part of the condition clause, so hopefully the loop will end. 221 | 6. We called our variable `cur_sum` and not `sum` because `sum` is a builtin function (try typing it in, notice the editor 222 | changes it to green). If we define `sum = 0` now we can't use the function `sum` in this Python session. 223 | 224 | ::::::::::::::::::::::::::::::::::::::: challenge 225 | 226 | ## Exercise - Things that can go wrong with while loops 227 | 228 | In the examples below, without running them try to decide why we will not get the required answer. 229 | Run each, one at a time, and then correct them. Remember that when the input next to a notebook cell 230 | is [\*] your Python interpreter is still working. 231 | 232 | ```python 233 | # while loop - summing the numbers 1 to 10 234 | n = 10 235 | cur_sum = 0 236 | # sum of n numbers 237 | i = 0 238 | while i <= n : 239 | i = i + 1 240 | cur_sum = cur_sum + i 241 | 242 | print("The sum of the numbers from 1 to", n, "is ", cur_sum) 243 | ``` 244 | 245 | ```python 246 | # while loop - summing the numbers 1 to 10 247 | n = 10 248 | cur_sum = 0 249 | boolvalue = False 250 | # sum of n numbers 251 | i = 0 252 | while i <= n and boolvalue: 253 | cur_sum = cur_sum + i 254 | i = i + 1 255 | 256 | print("The sum of the numbers from 1 to", n, "is ", cur_sum) 257 | ``` 258 | 259 | ```python 260 | # while loop - summing the numbers 1 to 10 261 | n = 10 262 | cur_sum = 0 263 | # sum of n numbers 264 | i = 0 265 | while i != n : 266 | cur_sum = cur_sum + i 267 | i = i + 1 268 | 269 | print("The sum of the numbers from 1 to", n, "is ", cur_sum) 270 | ``` 271 | 272 | ```python 273 | # while loop - summing the numbers 1.1 to 9.9 i. steps of 1.1 274 | n = 9.9 275 | cur_sum = 0 276 | # sum of n numbers 277 | i = 0 278 | while i != n : 279 | cur_sum = cur_sum + i 280 | i = i + 1.1 281 | print(i) 282 | 283 | print("The sum of the numbers from 1.1 to", n, "is ", sum) 284 | ``` 285 | 286 | ::::::::::::::: solution 287 | 288 | ## Solution 289 | 290 | 1. Because i is incremented before the sum, you are summing 1 to 11. 291 | 2. The Boolean value is set to False the loop will never be executed. 292 | 3. When i does equal 10 the expression is False and the loop does not execute so we have only summed 1 to 9 293 | 4. Because you cannot guarantee the internal representation of Float, you should never try to compare them for equality. In this particular case the i never 'equals' n and so the loop never ends. - If you did try running this, you can stop it using Ctrl\+c in a terminal or going to the kernel menu of a notebook and choosing interrupt. 294 | 295 | 296 | 297 | ::::::::::::::::::::::::: 298 | 299 | :::::::::::::::::::::::::::::::::::::::::::::::::: 300 | 301 | ## The `for` loop 302 | 303 | The for loop, like the while loop repeatedly executes a set of statements. The difference is that in the for loop we know in at the outset how often the statements in the loop will be executed. We don't have to rely on a variable being changed within the looping statements. 304 | 305 | The basic format of the `for` statement is 306 | 307 | ```python 308 | for variable_name in some_sequence : 309 | statement1 310 | statement2 311 | ... 312 | statementn 313 | ``` 314 | 315 | The key part of this is the `some_sequence`. The phrase used in the documentation is that it must be 'iterable'. That means, you can count through the sequence, starting at the beginning and stopping at the end. 316 | 317 | There are many examples of things which are iterable some of which we have already come across. 318 | 319 | - Lists are iterable - they don't have to contain numbers, you iterate over the elements in the list. 320 | - The `range()` function 321 | - The characters in a string 322 | 323 | ```python 324 | print("\nExample 1\n") 325 | for i in [1,2,3] : 326 | print(i) 327 | 328 | print("\nExample 2\n") 329 | for name in ["Tom", "Dick", "Harry"] : 330 | print(name) 331 | 332 | print("\nExample 3\n") 333 | for name in ["Tom", 42, 3.142] : 334 | print(name) 335 | 336 | print("\nExample 4\n") 337 | for i in range(3) : 338 | print(i) 339 | 340 | print("\nExample 5\n") 341 | for i in range(1,4) : 342 | print(i) 343 | 344 | print("\nExample 6\n") 345 | for i in range(2, 11, 2) : 346 | print(i) 347 | 348 | print("\nExample 7\n") 349 | for i in "ABCDE" : 350 | print(i) 351 | 352 | print("\nExample 8\n") 353 | longString = "The quick brown fox jumped over the lazy sleeping dog" 354 | for word in longString.split() : 355 | print(word) 356 | ``` 357 | 358 | ```output 359 | Example 1 360 | 361 | 1 362 | 2 363 | 3 364 | 365 | Example 2 366 | 367 | Tom 368 | Dick 369 | Harry 370 | 371 | Example 3 372 | 373 | Tom 374 | 42 375 | 3.142 376 | 377 | Example 4 378 | 379 | 0 380 | 1 381 | 2 382 | 383 | Example 5 384 | 385 | 1 386 | 2 387 | 3 388 | 389 | Example 6 390 | 391 | 2 392 | 4 393 | 6 394 | 8 395 | 10 396 | 397 | Example 7 398 | 399 | A 400 | B 401 | C 402 | D 403 | E 404 | 405 | Example 8 406 | 407 | The 408 | quick 409 | brown 410 | fox 411 | jumped 412 | over 413 | the 414 | lazy 415 | sleeping 416 | dog 417 | ``` 418 | 419 | ::::::::::::::::::::::::::::::::::::::: challenge 420 | 421 | ## Exercise 422 | 423 | Suppose that we have a string containing a set of 4 different types of values separated by `,` like this: 424 | 425 | ```python 426 | variablelist = "01/01/2010,34.5,Yellow,True" 427 | ``` 428 | 429 | Research the `split()` method and then rewrite example 8 from the `for` loop section above so that it prints the 4 components of `variablelist` 430 | 431 | ::::::::::::::: solution 432 | 433 | ## Solution 434 | 435 | ```python 436 | # From the for loop section above 437 | variablelist = "01/01/2010,34.5,Yellow,True" 438 | for word in variablelist.split(",") : 439 | print(word) 440 | ``` 441 | 442 | The format of `variablelist` is very much like that of a record in a csv file. In later episodes we will see how we can extract these values and assign them to variables for further processing rather than printing them out. 443 | 444 | ::::::::::::::::::::::::: 445 | 446 | :::::::::::::::::::::::::::::::::::::::::::::::::: 447 | 448 | :::::::::::::::::::::::::::::::::::::::: keypoints 449 | 450 | - Most programs will require 'Loops' and 'Branching' constructs. 451 | - The `if`, `elif`, `else` statements allow for branching in code. 452 | - The `for` and `while` statements allow for looping through sections of code 453 | - The programmer must provide a condition to end a `while` loop. 454 | 455 | :::::::::::::::::::::::::::::::::::::::::::::::::: 456 | 457 | 458 | -------------------------------------------------------------------------------- /episodes/04-reusable.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Creating re-usable code 3 | teaching: 25 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Describe the syntax for a user defined function 10 | - Create and use simple functions 11 | - Explain the advantages of using functions 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - What are user defined functions? 18 | - How can I automate my code for re-use? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | ## Defining a function 23 | 24 | We have already made use of several Python builtin functions like `print`, `list` and `range`. 25 | 26 | In addition to the functions provided by Python, you can write your own functions. 27 | 28 | Functions are used when a section of code needs to be repeated at various different points in a program. It saves you re-writing it all. In reality you rarely need to repeat the exact same code. Usually there will be some variation in variable values needed. Because of this, when you create a function you are allowed to specify a set of `parameters` which represent variables in the function. 29 | 30 | In our use of the `print` function, we have provided whatever we want to `print`, as a `parameter`. Typically whenever we use the `print` function, we pass a different `parameter` value. 31 | 32 | The ability to specify parameters make functions very flexible. 33 | 34 | ```python 35 | def get_item_count(items_str,sep): 36 | ''' 37 | This function takes a string with a list of items and the character that they're separated by and returns the number of items 38 | ''' 39 | items_list = items_str.split(sep) 40 | num_items = len(items_list) 41 | return num_items 42 | 43 | items_owned = "bicycle;television;solar_panel;table" 44 | print(get_item_count(items_owned,';')) 45 | ``` 46 | 47 | ```output 48 | 4 49 | ``` 50 | 51 | ![](fig/functionAnatomy.png){alt='Python\_Repl'} 52 | 53 | Points to note: 54 | 55 | 1. The definition of a function (or procedure) starts with the def keyword and is followed by the name of the function with any parameters used by the function in parentheses. 56 | 2. The definition clause is terminated with a `:` which causes indentation on the next and subsequent lines. All of these lines form the statements which make up the function. The function ends after the indentation is removed. 57 | 3. Within the function, the parameters behave as variables whose initial values will be those that they were given when the function was called. 58 | 4. functions have a return statement which specifies the value to be returned. This is the value assigned to the variable on the left-hand side of the call to the function. (power in the example above) 59 | 5. You call (run the code) of a function simply by providing its name and values for its parameters the same way you would for any builtin function. 60 | 6. Once the definition of the function has been executed, it becomes part of Python for the current session and can be used anywhere. 61 | 7. Like any other builtin function you can use `shift` + `tab` in Jupyter to see the parameters. 62 | 8. At the beginning of the function code we have a multiline `comment` denoted by the `'''` at the beginning and end. This kind of comment is known as a `docstring` and can be used anywhere in Python code as a documentation aid. It is particularly common, and indeed best practice, to use them to give a brief description of the function at the beginning of a function definition in this way. This is because this description will be displayed along with the parameters when you use the help() function or `shift` + `tab` in Jupyter. 63 | 9. The variable `x` defined within the function only exists within the function, it cannot be used outside in the main program. 64 | 65 | In our `get_item_count` function we have two parameters which must be provided every time the function is used. You need to provide the parameters in the right order or to explicitly name the parameter you are referring to and use the `=` sign to give it a value. 66 | 67 | In many cases of functions we want to provide default values for parameters so the user doesn't have to. We can do this in the following way 68 | 69 | ```python 70 | def get_item_count(items_str,sep=';'): 71 | ''' 72 | This function takes a string with a list of items and the character that they're separated by and returns the number of items 73 | ''' 74 | items_list = items_str.split(sep) 75 | num_items = len(items_list) 76 | return num_items 77 | 78 | 79 | print(get_item_count(items_owned)) 80 | ``` 81 | 82 | ```output 83 | 4 84 | ``` 85 | 86 | The only change we have made is to provide a default value for the `sep` parameter. Now if the user does not provide a value, then the value of 2 will be used. Because `items_str` is the first parameter we can specify its value by position. We could however have explicitly named the parameters we were referring to. 87 | 88 | ```python 89 | print(get_item_count(items_owned, sep = ',')) 90 | print(get_item_count(items_str = items_owned, sep=';')) 91 | ``` 92 | 93 | ```output 94 | 1 95 | 4 96 | ``` 97 | 98 | ::::::::::::::::::::::::::::::::::::::: challenge 99 | 100 | ## Volume of a cube 101 | 102 | 1. Write a function definition to calculate the volume of a cuboid. The function will use three parameters `h`, `w` 103 | and `l` and return the volume. 104 | 105 | 2. Supposing that in addition to the volume I also wanted to calculate the surface area and the sum of all of the edges. Would I (or should I) have three separate functions or could I write a single function to provide all three values together? 106 | 107 | ::::::::::::::: solution 108 | 109 | ## Solution 110 | 111 | - A function to calculate the volume of a cuboid could be: 112 | 113 | ```python 114 | def calculate_vol_cuboid(h, w, len): 115 | """ 116 | Calculates the volume of a cuboid. 117 | Takes in h, w, len, that represent height, width, and length of the cube. 118 | Returns the volume. 119 | """ 120 | volume = h * w * len 121 | return volume 122 | ``` 123 | 124 | - It depends. As a rule-of-thumb, we want our function to **do one thing and one thing only, and to do it well.** 125 | If we always have to calculate these three pieces of information, the 'one thing' could be 126 | 'calculate the volume, surface area, and sum of all edges of a cube'. Our function would look like this: 127 | 128 | ```python 129 | # Method 1 - single function 130 | def calculate_cuboid(h, w, len): 131 | """ 132 | Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth). 133 | 134 | Returns the volume, surface area, and sum of edges of the cuboid. 135 | """ 136 | volume = h * w * len 137 | surface_area = 2 * (h * w + h * len + len * w) 138 | edges = 4 * (h + w + len) 139 | return volume, surface_area, edges 140 | ``` 141 | 142 | It may be better, however, to break down our function into separate ones - one for each piece of information we are 143 | calculating. Our functions would look like this: 144 | 145 | ```python 146 | # Method 2 - separate functions 147 | def calc_volume_of_cuboid(h, w, len): 148 | """ 149 | Calculates the volume of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). 150 | """ 151 | volume = h * w * len 152 | return volume 153 | 154 | 155 | def calc_surface_area_of_cuboid(h, w, len): 156 | """ 157 | Calculates the surface area of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). 158 | """ 159 | surface_area = 2 * (h * w + h * len + len * w) 160 | return surface_area 161 | 162 | 163 | def calc_sum_of_edges_of_cuboid(h, w, len): 164 | """ 165 | Calculates the sum of edges of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). 166 | """ 167 | sum_of_edges = 4 * (h + w + len) 168 | return sum_of_edges 169 | ``` 170 | 171 | We could then rewrite our first solution: 172 | 173 | ```python 174 | def calculate_cuboid(h, w, len): 175 | """ 176 | Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth). 177 | 178 | Returns the volume, surface area, and sum of edges of the cuboid. 179 | """ 180 | volume = calc_volume_of_cuboid(h, w, len) 181 | surface_area = calc_surface_area_of_cuboid(h, w, len) 182 | edges = calc_sum_of_edges_of_cuboid(h, w, len) 183 | 184 | return volume, surface_area, edges 185 | ``` 186 | 187 | ::::::::::::::::::::::::: 188 | 189 | :::::::::::::::::::::::::::::::::::::::::::::::::: 190 | 191 | ## Using libraries 192 | 193 | The functions we have created above only exist for the duration of the session in which they have been defined. If you start a new Jupyter notebook you will have to run the code to define them again. 194 | 195 | If all of your code is in a single file or notebook this isn't really a problem. 196 | 197 | There are however many (thousands) of useful functions which other people have written and have made available to all Python users by creating libraries (also referred to as packages or modules) of functions. 198 | 199 | You can find out what all of these libraries are and their contents by visiting the main (python.org) site. 200 | 201 | We need to go through a 2-step process before we can use them in our own programs. 202 | 203 | Step 1. use the `pip` command from the commandline. `pip` is installed as part of the Python install and is used to fetch the package from the Internet and install it in your Python configuration. 204 | 205 | ```bash 206 | $ pip install 207 | ``` 208 | 209 | pip stands for Python install package and is a commandline function. Because we are using the Anaconda distribution of Python, all of the packages that we will be using in this lesson are already installed for us, so we can move straight on to step 2. 210 | 211 | Step 2. In your Python code include an `import package-name` statement. Once this is done, you can use all of the functions contained within the package. 212 | 213 | As all of these packages are produced by 3rd parties independently of each other, there is the strong possibility that there may be clashes in function names. To allow for this, when you are calling a function from a package that you have imported, you do so by prefixing the function name with the package name. This can make for long-winded function names so the `import` statement allows you to specify an `alias` for the package name which you must then use instead of the package name. 214 | 215 | In future episodes, we will be importing the `csv`, `json`, `pandas`, `numpy` and `matplotlib` modules. We will describe their use as we use them. 216 | 217 | The code that we will use is shown below 218 | 219 | ```python 220 | import csv 221 | import json 222 | import pandas as pd 223 | import numpy as np 224 | import matplotlib.pyplot as plt 225 | ``` 226 | 227 | The first two we don't alias as they have short names. The last three we do. Matplotlib is a very large library broken up into what can be thought of as sub-libraries. As we will only be using the functions contained in the `pyplot` sub-library we can specify that explicitly when we import. This saves time and space. It does not effect how we call the functions in our code. 228 | 229 | The `alias` we use (specified after the `as` keyword) is entirely up to us. However those shown here for `pandas`, `numpy` and `matplotlib` are nearly universally adopted conventions used for these popular libraries. If you are searching for code examples for these libraries on the Internet, using these aliases will appear most of the time. 230 | 231 | :::::::::::::::::::::::::::::::::::::::: keypoints 232 | 233 | - Functions are used to create re-usable sections of code 234 | - Using parameters with functions make them more flexible 235 | - You can use functions written by others by importing the libraries containing them into your code 236 | 237 | :::::::::::::::::::::::::::::::::::::::::::::::::: 238 | 239 | 240 | -------------------------------------------------------------------------------- /episodes/06-date-and-time.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Dates and Time 3 | teaching: 15 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Describe some of the datetime functions available in Python 10 | - Describe the use of format strings to describe the layout of a date and/or time string 11 | - Make use of date arithmetic 12 | 13 | :::::::::::::::::::::::::::::::::::::::::::::::::: 14 | 15 | :::::::::::::::::::::::::::::::::::::::: questions 16 | 17 | - How are dates and time represented in Python? 18 | - How can I manipulate dates and times? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | ## Date and Times in Python 23 | 24 | Python can be very flexible in how it interprets 'strings' which you want to be considered as a date, time, or date and time, but you have to tell Python how the various parts of the date and/or time are represented in your 'string'. You can do this by creating a `format`. In a `format`, different case sensitive characters preceded by the `%` character act as placeholders for parts of the date/time, for example `%Y` represents year formatted as 4 digit number such as 2014. 25 | 26 | A full list of the characters used and what they represent can be found towards the end of the [datetime](https://docs.python.org/3/library/datetime.html) section of the official Python documentation. 27 | 28 | There is a `today()` method which allows you to get the current date and time. 29 | By default it is displayed in a format similar to the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) standard format. 30 | 31 | To use the date and time functions you need to import the `datetime` module. 32 | 33 | ```python 34 | from datetime import datetime 35 | 36 | today = datetime.today() 37 | print('ISO :', today) 38 | ``` 39 | 40 | ```output 41 | ISO : 2018-04-12 16:19:17.177441 42 | ``` 43 | 44 | We can use our own formatting instead. For example, if we wanted words instead of number and the 4 digit year at the end we could use the following. 45 | 46 | ```python 47 | format = "%a %b %d %H:%M:%S %Y" 48 | 49 | today_str = today.strftime(format) 50 | print('strftime:', today_str) 51 | print(type(today_str)) 52 | 53 | today_date = datetime.strptime(today_str, format) 54 | print('strptime:', today_date.strftime(format)) 55 | print(type(today_date)) 56 | ``` 57 | 58 | ```output 59 | strftime: Thu Apr 12 16:19:17 2018 60 | 61 | strptime: Thu Apr 12 16:19:17 2018 62 | 63 | ``` 64 | 65 | `strftime` converts a datetime object to a string and `strptime` creates a datetime object from a string. 66 | When you print them using the same format string, they look the same. 67 | 68 | The format of the date fields in the SAFI\_results.csv file have been generated automatically to comform to the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) standard. 69 | 70 | When we read the file and extract the date fields, they are of type string. Before we can use them as dates, we need to convert them into Python date objects. 71 | 72 | In the format string we use below, the `-` , `:` , `T` and `Z` characters are just that, characters in the string representing the date/time. 73 | Only the character preceded with `%` have special meanings. 74 | 75 | Having converted the strings to datetime objects, there are a variety of methods that we can use to extract different components of the date/time. 76 | 77 | ```python 78 | from datetime import datetime 79 | 80 | 81 | format = "%Y-%m-%dT%H:%M:%S.%fZ" 82 | f = open('SAFI_results.csv', 'r') 83 | 84 | #skip the header line 85 | line = f.readline() 86 | 87 | # next line has data 88 | line = f.readline() 89 | 90 | strdate_start = line.split(',')[3] # A04_start 91 | strdate_end = line.split(',')[4] # A05_end 92 | 93 | print(type(strdate_start), strdate_start) 94 | print(type(strdate_end), strdate_end) 95 | 96 | 97 | # the full date and time 98 | datetime_start = datetime.strptime(strdate_start, format) 99 | print(type(datetime_start)) 100 | datetime_end = datetime.strptime(strdate_end, format) 101 | 102 | print('formatted date and time', datetime_start) 103 | print('formatted date and time', datetime_end) 104 | 105 | 106 | # the date component 107 | date_start = datetime.strptime(strdate_start, format).date() 108 | print(type(date_start)) 109 | date_end = datetime.strptime(strdate_end, format).date() 110 | 111 | print('formatted start date', date_start) 112 | print('formatted end date', date_end) 113 | 114 | # the time component 115 | time_start = datetime.strptime(strdate_start, format).time() 116 | print(type(time_start)) 117 | time_end = datetime.strptime(strdate_end, format).time() 118 | 119 | print('formatted start time', time_start) 120 | print('formatted end time', time_end) 121 | 122 | 123 | f.close() 124 | ``` 125 | 126 | ```output 127 | 2017-03-23T09:49:57.000Z 128 | 2017-04-02T17:29:08.000Z 129 | 130 | formatted date and time 2017-03-23 09:49:57 131 | formatted date and time 2017-04-02 17:29:08 132 | 133 | formatted start date 2017-03-23 134 | formatted end date 2017-04-02 135 | 136 | formatted start time 09:49:57 137 | formatted end time 17:29:08 138 | ``` 139 | 140 | ## Components of dates and times 141 | 142 | For a date or time we can also extract individual components of them. 143 | They are held internally in the datetime datastructure. 144 | 145 | ```python 146 | # date parts. 147 | print('formatted end date', date_end) 148 | print(' end date year', date_end.year) 149 | print(' end date month', date_end.month) 150 | print(' end date day', date_end.day) 151 | print (type(date_end.day)) 152 | 153 | # time parts. 154 | 155 | print('formatted end time', time_end) 156 | print(' end time hour', time_end.hour) 157 | print(' end time minutes', time_end.minute) 158 | print(' end time seconds', time_end.second) 159 | print(type(time_end.second)) 160 | ``` 161 | 162 | ```output 163 | formatted end date 2017-04-02 164 | end date year 2017 165 | end date month 4 166 | end date day 2 167 | 168 | formatted end time 17:29:08 169 | end time hour 17 170 | end time minutes 29 171 | end time seconds 8 172 | 173 | ``` 174 | 175 | ## Date arithmetic 176 | 177 | We can also do arithmetic with the dates. 178 | 179 | ```python 180 | date_diff = datetime_end - datetime_start 181 | date_diff 182 | print(type(datetime_start)) 183 | print(type(date_diff)) 184 | print(date_diff) 185 | 186 | date_diff = datetime_start - datetime_end 187 | print(type(date_diff)) 188 | print(date_diff) 189 | ``` 190 | 191 | ```output 192 | 193 | 194 | 10 days, 7:39:11 195 | 196 | -11 days, 16:20:49 197 | ``` 198 | 199 | ::::::::::::::::::::::::::::::::::::::: challenge 200 | 201 | ## Exercise 202 | 203 | How do you interpret the last result? 204 | 205 | 206 | :::::::::::::::::::::::::::::::::::::::::::::::::: 207 | 208 | The code below calculates the time difference between supposedly starting the survey and ending the survey (for each respondent). 209 | 210 | ```python 211 | from datetime import datetime 212 | 213 | format = "%Y-%m-%dT%H:%M:%S.%fZ" 214 | 215 | f = open('SAFI_results.csv', 'r') 216 | 217 | line = f.readline() 218 | 219 | for line in f: 220 | #print(line) 221 | strdate_start = line.split(',')[3] 222 | strdate_end = line.split(',')[4] 223 | 224 | datetime_start = datetime.strptime(strdate_start, format) 225 | datetime_end = datetime.strptime(strdate_end, format) 226 | date_diff = datetime_end - datetime_start 227 | print(datetime_start, datetime_end, date_diff ) 228 | 229 | 230 | f.close() 231 | ``` 232 | 233 | ```output 234 | 2017-03-23 09:49:57 2017-04-02 17:29:08 10 days, 7:39:11 235 | 2017-04-02 09:48:16 2017-04-02 17:26:19 7:38:03 236 | 2017-04-02 14:35:26 2017-04-02 17:26:53 2:51:27 237 | 2017-04-02 14:55:18 2017-04-02 17:27:16 2:31:58 238 | 2017-04-02 15:10:35 2017-04-02 17:27:35 2:17:00 239 | 2017-04-02 15:27:25 2017-04-02 17:28:02 2:00:37 240 | 2017-04-02 15:38:01 2017-04-02 17:28:19 1:50:18 241 | 2017-04-02 15:59:52 2017-04-02 17:28:39 1:28:47 242 | 2017-04-02 16:23:36 2017-04-02 16:42:08 0:18:32 243 | ... 244 | ``` 245 | 246 | ::::::::::::::::::::::::::::::::::::::: challenge 247 | 248 | ## Exercise 249 | 250 | 1. In the `SAFI_results.csv` file the `A01_interview_date field` (index 1) contains a date in the form of 'dd/mm/yyyy'. Read the file and calculate the differences in days (because the interview date is only given to the day) between the `A01_interview_date` values and the `A04_start` values. You will need to create a format string for the `A01_interview_date` field. 251 | 252 | 2. Looking at the results here and from the previous section of code. Do you think the use of the smartphone data entry system for the survey was being used in real time? 253 | 254 | ::::::::::::::: solution 255 | 256 | ## Solution 257 | 258 | ```python 259 | from datetime import datetime 260 | 261 | format1 = "%Y-%m-%dT%H:%M:%S.%fZ" 262 | format2 = "%d/%m/%Y" 263 | 264 | f = open('SAFI_results.csv', 'r') 265 | 266 | line = f.readline() 267 | 268 | for line in f: 269 | A01 = line.split(',')[1] 270 | A04 = line.split(',')[3] 271 | 272 | datetime_A04 = datetime.strptime(A04, format1) 273 | datetime_A01 = datetime.strptime(A01, format2) 274 | date_diff = datetime_A04 - datetime_A01 275 | print(datetime_A04, datetime_A01, date_diff.days ) 276 | 277 | f.close() 278 | ``` 279 | 280 | ::::::::::::::::::::::::: 281 | 282 | :::::::::::::::::::::::::::::::::::::::::::::::::: 283 | 284 | :::::::::::::::::::::::::::::::::::::::: keypoints 285 | 286 | - Date and Time functions in Python come from the datetime library, which needs to be imported 287 | - You can use format strings to have dates/times displayed in any representation you like 288 | - Internally date and times are stored in special data structures which allow you to access the component parts of dates and times 289 | 290 | :::::::::::::::::::::::::::::::::::::::::::::::::: 291 | 292 | 293 | -------------------------------------------------------------------------------- /episodes/08-Pandas.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Reading data from a file using Pandas 3 | teaching: 15 4 | exercises: 5 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain what a module is and how they are used in Python 10 | - Describe what the Python Data Analysis Library (pandas) is 11 | - Load the Python Data Analysis Library (pandas) 12 | - Use read\_csv to read tabular data into Python 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - What is Pandas? 19 | - How do I read files using Pandas? 20 | - What is the difference between reading files using Pandas and other methods of reading files? 21 | 22 | :::::::::::::::::::::::::::::::::::::::::::::::::: 23 | 24 | ## What is Pandas? 25 | 26 | pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way. 27 | 28 | Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty. 29 | 30 | Particular features of pandas that we will be looking at over this and the next couple of episodes include: 31 | 32 | - Reading data stored in CSV files (other file formats can be read as well) 33 | - Slicing and subsetting data in Dataframes (tables!) 34 | - Dealing with missing data 35 | - Reshaping data (long -> wide, wide -> long) 36 | - Inserting and deleting columns from data structures 37 | - Aggregating data using data grouping facilities using the split-apply-combine paradigm 38 | - Joining of datasets (after they have been loaded into Dataframes) 39 | 40 | If you are wondering why I write pandas with a lower case 'p' it is because it is the name of the package and Python is case sensitive. 41 | 42 | ## Importing the pandas library 43 | 44 | Importing the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of `pd`. We will follow the same convention. 45 | 46 | ```python 47 | import pandas as pd 48 | ``` 49 | 50 | ## Pandas data structures 51 | 52 | There are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure. 53 | 54 | We will mainly be looking at the Dataframe. 55 | 56 | We can easily create a Pandas Dataframe by reading a .csv file 57 | 58 | ## Reading a csv file 59 | 60 | When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer. 61 | 62 | The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size. 63 | 64 | In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically. 65 | 66 | For our examples in this episode we are going to use the SN7577.tab file. This is available for download [here](data/SN7577.tab) and the description of the file is available [here](https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=7577&type=data%20catalogue#!/details) 67 | 68 | We are going to read in our SN7577.tab file. Although this is a tab delimited file we will still use the pandas `read_csv` method, but we will explicitly tell the method that the separator is the tab character and not a comma which is the default. 69 | 70 | ```python 71 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') 72 | ``` 73 | 74 | ::::::::::::::::::::::::::::::::::::::: challenge 75 | 76 | ## Exercise 77 | 78 | What happens if you forget to specify `sep='\t'` when reading a tab delimited dataset 79 | 80 | ::::::::::::::: solution 81 | 82 | ## Solution 83 | 84 | ```python 85 | df_SN7577_oops = pd.read_csv("SN7577.tab") 86 | print(df_SN7577_oops.shape) 87 | print(df_SN7577_oops) 88 | ``` 89 | 90 | If you allow pandas to assume that your columns are separated by commas (the default) and there aren't any, then each record will be treated as a single column. So the shape is given as 1286 rows (correct) but only one column. 91 | When the contents is display the only column name is the complete first record. Notice the `\t` used to represent the tab characters in the output. This is the same format we used to specify the tab separator when we correctly read in the file. 92 | 93 | ::::::::::::::::::::::::: 94 | 95 | :::::::::::::::::::::::::::::::::::::::::::::::::: 96 | 97 | ## Getting information about a Dataframe 98 | 99 | You can find out the type of the variable `df_SN7577` by using the `type` function. 100 | 101 | ```python 102 | print(type(df_SN7577)) 103 | ``` 104 | 105 | ```output 106 | 107 | ``` 108 | 109 | You can see the contents by simply entering the variable name. You can see from the output that it is a tabular format. The column names have been taken from the first record of the file. On the left hand side is a column with no name. The entries here have been provided by pandas and act as an index to reference the individual rows of the Dataframe. 110 | 111 | The `read_csv()` function has an `index_col` parameter which you can use to indicate which of the columns in the file you wish to use as the index instead. As the SN7577 dataset doesn't have a column which would uniquely identify each row we cannot do that. 112 | 113 | Another thing to notice about the display is that it is truncated. By default you will see the first and last 30 rows. For the columns you will always get the first few columns and typically the last few depending on display space. 114 | 115 | ```python 116 | df_SN7577 117 | ``` 118 | 119 | Similar information can be obtained with `df_SN7577.head()` But here you are only returned the first 5 rows by default. 120 | 121 | ```python 122 | df_SN7577.head() 123 | ``` 124 | 125 | ::::::::::::::::::::::::::::::::::::::: challenge 126 | 127 | ## Exercise 128 | 129 | 1. As well as the `head()` method there is a `tail()` method. What do you think it does? Try it. 130 | 2. Both methods accept a single numeric parameter. What do you think it does? Try it. 131 | 132 | :::::::::::::::::::::::::::::::::::::::::::::::::: 133 | 134 | You can obtain other basic information about your Dataframe of data with: 135 | 136 | ```python 137 | # How many rows? 138 | print(len(df_SN7577)) 139 | # How many rows and columns - returned as a tuple 140 | print(df_SN7577.shape) 141 | #How many 'cells' in the table 142 | print(df_SN7577.size) 143 | # What are the column names 144 | print(df_SN7577.columns) 145 | # what are the data types of the columns? 146 | print(df_SN7577.dtypes) 147 | ``` 148 | 149 | ```output 150 | 1286 151 | (1286, 202) 152 | 259772 153 | Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 154 | 'Q5avi', 155 | ... 156 | 'numhhd', 'numkid', 'numkid2', 'numkid31', 'numkid32', 'numkid33', 157 | 'numkid34', 'numkid35', 'numkid36', 'wts'], 158 | dtype='object', length=202) 159 | Q1 int64 160 | Q2 int64 161 | Q3 int64 162 | ... 163 | Length: 202, dtype: object 164 | ``` 165 | 166 | ::::::::::::::::::::::::::::::::::::::: challenge 167 | 168 | ## Exercise 169 | 170 | When we asked for the column names and their data types, the output was abridged, i.e. we didn't get the values for all of the columns. Can you write a small piece of code which will return all of the values 171 | 172 | ::::::::::::::: solution 173 | 174 | ## Solution 175 | 176 | ```python 177 | for name in df_SN7577.columns: 178 | print(name) 179 | ``` 180 | 181 | ::::::::::::::::::::::::: 182 | 183 | :::::::::::::::::::::::::::::::::::::::::::::::::: 184 | 185 | :::::::::::::::::::::::::::::::::::::::: keypoints 186 | 187 | - pandas is a Python library containing functions and data structures to assist in data analysis 188 | - pandas data structures are the Series (like a vector) and the Dataframe (like a table) 189 | - the pandas `read_csv` function allows you to read an entire `csv` file into a Dataframe 190 | 191 | :::::::::::::::::::::::::::::::::::::::::::::::::: 192 | 193 | 194 | -------------------------------------------------------------------------------- /episodes/09-extracting-data.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Extracting row and columns 3 | teaching: 15 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Define indexing as it relates to data structures 10 | - Select specific columns from a data frame 11 | - Select specific rows from a data frame based on conditional expressions 12 | - Using indexes to access rows and columns 13 | - Copy a data frame 14 | - Add columns to a data frame 15 | - Analyse datasets having missing/null values 16 | 17 | :::::::::::::::::::::::::::::::::::::::::::::::::: 18 | 19 | :::::::::::::::::::::::::::::::::::::::: questions 20 | 21 | - How can I extract specific rows and columns from a Dataframe? 22 | - How can I add or delete columns from a Dataframe? 23 | - How can I find and change missing values in a Dataframe? 24 | 25 | :::::::::::::::::::::::::::::::::::::::::::::::::: 26 | 27 | We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe. 28 | 29 | ```python 30 | import pandas as pd 31 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') 32 | ``` 33 | 34 | ### Selecting rows and columns from a pandas Dataframe 35 | 36 | If we know which columns we want before we read the data from the file we can tell `read_csv()` to only import those columns by specifying columns either by their index number (starting at 0) as a list to the `usecols` parameter. Alternatively we can also provide a list of column names. 37 | 38 | ```python 39 | df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175]) 40 | print(df_SN7577_some_cols.shape) 41 | print(df_SN7577_some_cols.columns) 42 | df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups']) 43 | print(df_SN7577_some_cols.columns) 44 | ``` 45 | 46 | ```output 47 | (1286, 6) 48 | Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object') 49 | Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object') 50 | ``` 51 | 52 | Let us assume for now that we read in the complete file which is now in the Dataframe `df_SN7577`, how can we now refer to specific columns? 53 | 54 | There are two ways of doing this using the column names (or labels): 55 | 56 | ```python 57 | # Both of these statements are the same 58 | print(df_SN7577['Q1']) 59 | # and 60 | print(df_SN7577.Q1) 61 | ``` 62 | 63 | ```output 64 | 0 1 65 | 1 3 66 | 2 10 67 | 3 9 68 | ... 69 | ``` 70 | 71 | If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of `'Q1'` we could also have provided a list of strings. Remember that lists are enclosed in `[]`. 72 | 73 | ```python 74 | print(df_SN7577[['Q1', 'Q2', 'Q3']]) 75 | ``` 76 | 77 | ```python 78 | Q1 Q2 Q3 79 | 0 1 -1 1 80 | 1 3 -1 1 81 | 2 10 3 2 82 | 3 9 -1 10 83 | ... 84 | ``` 85 | 86 | ::::::::::::::::::::::::::::::::::::::: challenge 87 | 88 | ## Exercise 89 | 90 | What happens if you: 91 | 92 | 1. List the columns you want out of order from the way they appear in the file? 93 | 2. Put the same column name in twice? 94 | 3. Put in a non-existing column name? (a.k.a Typo) 95 | 96 | ::::::::::::::: solution 97 | 98 | ## Solution 99 | 100 | ```python 101 | print(df_SN7577[['Q3', 'Q2']]) 102 | print(df_SN7577[['Q3', 'Q2', 'Q3']]) 103 | print(df_SN7577[['Q33', 'Q2']]) 104 | ``` 105 | 106 | ::::::::::::::::::::::::: 107 | 108 | :::::::::::::::::::::::::::::::::::::::::::::::::: 109 | 110 | ## Filtering by Rows 111 | 112 | You can filter the Dataframe by rows by specifying a range in the form of `a:b`. `a` is the first row and `b` is one beyond the last row required. 113 | 114 | ```python 115 | # select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe) 116 | df_SN7577_some_rows = df_SN7577[1:4] 117 | df_SN7577_some_rows 118 | ``` 119 | 120 | ::::::::::::::::::::::::::::::::::::::: challenge 121 | 122 | ## Exercise 123 | 124 | What happens if we ask for a single row instead of a range? 125 | 126 | ::::::::::::::: solution 127 | 128 | ## Solution 129 | 130 | ```python 131 | df_SN7577[1] 132 | ``` 133 | 134 | You get an error if you only specify `1`. You need to use `:1` or `0:1` to get the first row returned. The `:` is always required. You can use `:` by itself to return all of the rows. 135 | 136 | ::::::::::::::::::::::::: 137 | 138 | :::::::::::::::::::::::::::::::::::::::::::::::::: 139 | 140 | ## Using criteria to filter rows 141 | 142 | It is more likely that you will want to select rows from the Dataframe based on some criteria, such as "all rows where the value for Q2 is -1". 143 | 144 | ```python 145 | df_SN7577_some_rows = df_SN7577[(df_SN7577.Q2 == -1)] 146 | df_SN7577_some_rows 147 | ``` 148 | 149 | The criteria can be more complex and isn't limited to a single column's values: 150 | 151 | ```python 152 | df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)] 153 | df_SN7577_some_rows 154 | ``` 155 | 156 | We can combine the row selection with column selection: 157 | 158 | ```python 159 | df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']] 160 | df_SN7577_some_rows 161 | ``` 162 | 163 | Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows. 164 | 165 | There is however another way of selecting rows using the row index: 166 | 167 | ```python 168 | df_SN7577_some_rows = df_SN7577.iloc[1:4] 169 | df_SN7577_some_rows 170 | ``` 171 | 172 | Using the `iloc` method gives the same results as our previous example. 173 | 174 | However, now we can specify a single value and more importantly we can use the `range()` function to indicate the records that we want. This can be useful for making pseudo-random selections of rows from across the Dataframe. 175 | 176 | ```python 177 | # Select the first row from the Dataframe 178 | df_SN7577_some_rows = df_SN7577.iloc[0] 179 | df_SN7577_some_rows 180 | # select every 100th record from the Dataframe. 181 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)] 182 | df_SN7577_some_rows 183 | ``` 184 | 185 | You can also specify column ranges using the `iloc` method again using the column index numbers: 186 | 187 | ```python 188 | # columns 0,1,2 and 3 189 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4] 190 | df_SN7577_some_rows 191 | # columns 0,1,2,78 and 95 192 | df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]] 193 | df_SN7577_some_rows 194 | ``` 195 | 196 | There is also a `loc` method which allows you to use the column names. 197 | 198 | ```python 199 | # columns 0,1,2,78 and 95 using the column names and changing 'iloc' to 'loc' 200 | df_SN7577_some_rows = df_SN7577.loc[range(0, len(df_SN7577), 100),['Q1', 'Q2', 'Q3', 'Q18bii', 'access6' ]] 201 | df_SN7577_some_rows 202 | ``` 203 | 204 | ## Sampling 205 | 206 | Pandas does have a `sample` method which allows you to extract a sample of the records from the Dataframe. 207 | 208 | ```python 209 | df_SN7577.sample(10, replace=False) # ten records, do not select same record twice (this is the default) 210 | df_SN7577.sample(frac=0.05, random_state=1) # 5% of records , same records if run again 211 | ``` 212 | 213 | :::::::::::::::::::::::::::::::::::::::: keypoints 214 | 215 | - Import specific columns when reading in a .csv with the `usecols` parameter 216 | - We easily can chain boolean conditions when filtering rows of a pandas dataframe 217 | - The `loc` and `iloc` methods allow us to get rows with particular labels and at particular integer locations respectively 218 | - pandas has a handy `sample` method which allows us to extract a sample of rows from a dataframe 219 | 220 | :::::::::::::::::::::::::::::::::::::::::::::::::: 221 | 222 | 223 | -------------------------------------------------------------------------------- /episodes/10-aggregations.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Data Aggregation using Pandas 3 | teaching: 20 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Access and summarize data stored in a Data Frame 10 | - Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame 11 | - Understand missing data 12 | - Changing to and from 'NaN' values 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - How can I summarise the data in a data frame? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | ## Using Pandas functions to summarise data in a Data Frame 23 | 24 | For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset. 25 | 26 | We shall use the SAFI\_results.csv dataset to demonstrate how we can obtain these pieces of information 27 | 28 | ```python 29 | import pandas as pd 30 | df_SAFI = pd.read_csv("SAFI_results.csv") 31 | df_SAFI 32 | ``` 33 | 34 | For numeric variables we can obtain a variety of basic statistical information by using the `describe()` method. 35 | 36 | ```python 37 | df_SAFI.describe() 38 | ``` 39 | 40 | This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as `NaN` you will get a warning message. 41 | 42 | You can also `.describe()` on a single variable basis. 43 | 44 | ```python 45 | df_SAFI['B_no_membrs'].describe() 46 | ``` 47 | 48 | There are also a set of methods which allow us to obtain individual values. 49 | 50 | ```python 51 | print(df_SAFI['B_no_membrs'].min()) 52 | print(df_SAFI['B_no_membrs'].max()) 53 | print(df_SAFI['B_no_membrs'].mean()) 54 | print(df_SAFI['B_no_membrs'].std()) 55 | print(df_SAFI['B_no_membrs'].count()) 56 | print(df_SAFI['B_no_membrs'].sum()) 57 | ``` 58 | 59 | ```output 60 | 2 61 | 19 62 | 7.190839694656488 63 | 3.1722704895263734 64 | 131 65 | 942 66 | ``` 67 | 68 | Unlike the `describe()` method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed. 69 | 70 | We can do the same thing for the `E19_period_use` variable 71 | 72 | ```python 73 | print(df_SAFI['E19_period_use'].min()) 74 | print(df_SAFI['E19_period_use'].max()) 75 | print(df_SAFI['E19_period_use'].mean()) 76 | print(df_SAFI['E19_period_use'].std()) 77 | print(df_SAFI['E19_period_use'].count()) 78 | print(df_SAFI['E19_period_use'].sum()) 79 | ``` 80 | 81 | ``` 82 | 1.0 83 | 45.0 84 | 12.043478260869565 85 | 8.583030848015385 86 | 92 87 | 1108.0 88 | ``` 89 | 90 | {: output} 91 | 92 | ::::::::::::::::::::::::::::::::::::::: challenge 93 | 94 | ## Exercise 95 | 96 | Compare the count values returned for the `B_no_membrs` and the `E19_period_use` variables. 97 | 98 | 1. Why do you think they are different? 99 | 2. How does this affect the calculation of the mean values? 100 | 101 | ::::::::::::::: solution 102 | 103 | ## Solution 104 | 105 | 1. We know from when we originally displayed the contents of the `df_SAFI` Dataframe that there are 131 rows in it. This matches the value for the `B_no_membrs` count. The count for `E19_period_use` however is only 92. If you look at the values in the `E19_period_use` column using 106 | 107 | ```python 108 | df_SAFI['E19_period_use'] 109 | ``` 110 | 111 | you will see that there are several `NaN` values. They also occurred when we used `describe()` on the full Dataframe. `NaN` stands for Not a Number, ie. the value is missing. There are only 92 non-missing values and this is what is reported by the `count()` method. This value is also used in the calculation of the mean and std values. 112 | 113 | ::::::::::::::::::::::::: 114 | 115 | :::::::::::::::::::::::::::::::::::::::::::::::::: 116 | 117 | ## Dealing with missing values 118 | 119 | We can find out how many variables in our Dataframe contains any `NaN` values with the code 120 | 121 | ```python 122 | df_SAFI.isnull().sum() 123 | ``` 124 | 125 | ``` 126 | Column1 0 127 | A01_interview_date 0 128 | A03_quest_no 0 129 | A04_start 0 130 | ... 131 | ``` 132 | 133 | {: output} 134 | 135 | or for a specific variable 136 | 137 | ```python 138 | df_SAFI['E19_period_use'].isnull().sum() 139 | ``` 140 | 141 | ``` 142 | 39 143 | ``` 144 | 145 | {: output} 146 | 147 | Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do. 148 | 149 | We have been using data from two very different sources. 150 | 151 | The SN7577 dataset is provided by the [UK Data Service](https://www.ukdataservice.ac.uk). Datasets from the UK data Service, have already been 'cleaned' and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as '-1' or 'Not Specified'. In cases like these it may be appropriate to replace these values with 'NaN' before you try to process the data further. 152 | 153 | The SAFI dataset we have been using comes from a project called 'Studying African Farmer-led Irrigation'. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of 'NaN' values in a record (one complete questionnaire) where you would still consider the record to be complete. 154 | 155 | We have already seen how we can check for missing values. There are three other actions we need to be able to do: 156 | 157 | 1. Remove complete rows which contain `NaN` 158 | 2. Replace `NaN` with a value of our choice 159 | 3. Replace specific values with `NaN` 160 | 161 | With these options we can ensure that the data is suitable for the further processing we have planned. 162 | 163 | ### Completely remove rows with NaNs 164 | 165 | The `dropna()` method will delete all rows if *any* of the variables contain an `NaN`. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning. 166 | 167 | ```python 168 | df_SAFI = pd.read_csv("SAFI_results.csv") 169 | print(df_SAFI.shape) 170 | df_SAFI.dropna(inplace=True) 171 | print(df_SAFI.shape) 172 | ``` 173 | 174 | ``` 175 | (131, 55) 176 | (0, 55) 177 | ``` 178 | 179 | {: output} 180 | 181 | Because there are variables in the SAFI dataset which are all `NaN` using the `dropna()` method effectively deletes all of the rows from the Dataframe, probably not what you wanted. Instead we can use the `notnull()` method as a row selection criteria and delete the rows where a specific variable has `NaN` values. 182 | 183 | ```python 184 | df_SAFI = pd.read_csv("SAFI_results.csv") 185 | print(df_SAFI.shape) 186 | df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())] 187 | print(df_SAFI.shape) 188 | ``` 189 | 190 | ``` 191 | (131, 55) 192 | (39, 55) 193 | ``` 194 | 195 | {: output} 196 | 197 | ### Replace NaN with a value of our choice 198 | 199 | The `E19_period_use` variable answers the question: "For how many years have you been irrigating the land?". In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run 200 | 201 | ``` 202 | df_SAFI['E19_period_use'].describe() 203 | ``` 204 | 205 | we get a count value of 92 and all of the other statistics are based on this count value. 206 | 207 | Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is *not* irrigated has been irrigated for 0 years, technically correct. 208 | 209 | To see what happens we can convert all of the NaN values in the `E19_period_use` column to 0 with the following code: 210 | 211 | ```python 212 | df_SAFI['E19_period_use'].fillna(0, inplace=True) 213 | ``` 214 | 215 | If we now run the `describe()` again you can see that all of the statistic have been changed because the calculations are NOW based on a count of 131. Probably not what we would have wanted. 216 | 217 | Conveniently this allows us to demonstrate our 3rd action. 218 | 219 | ### Replace specific values with NaN 220 | 221 | Although we can recognise `NaN` with methods like `isnull()` or `dropna()` actually creating a `NaN` value and putting it into a Dataframe, requires the `numpy` module. The following code will replace our 0 values with `NaN`. We can demonstrate that this has occurred by running the `describe()` again and see that we now have our original values back. 222 | 223 | ```python 224 | import numpy as np 225 | df_SAFI['E19_period_use'].replace(0, np.NaN, inplace = True) 226 | df_SAFI['E19_period_use'].describe() 227 | ``` 228 | 229 | ## Categorical variables 230 | 231 | For categorical variables, numerical statistics don't make any sense. 232 | For a categorical variable we can obtain a list of unique values used by the variable by using the `unique()` method. 233 | 234 | ```python 235 | df_SAFI = pd.read_csv("SAFI_results.csv") 236 | pd.unique(df_SAFI['C01_respondent_roof_type']) 237 | ``` 238 | 239 | ``` 240 | array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object) 241 | ``` 242 | 243 | {: output} 244 | 245 | Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the `groupby` method. 246 | 247 | Having performed the `groupby()` we can them `describe()` the results. The format is similar to that which we have seen before except that the 'grouped by' variable appears to the left and there is a set of statistics for each unique value of the variable. 248 | 249 | ```python 250 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type') 251 | grouped_data.describe() 252 | ``` 253 | 254 | You can group by more than one variable at a time by providing them as a list. 255 | 256 | ```python 257 | grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type']) 258 | grouped_data.describe() 259 | ``` 260 | 261 | You can also obtain individual statistics if you want. 262 | 263 | ```python 264 | A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count() 265 | A11_years_farm 266 | ``` 267 | 268 | ``` 269 | C01_respondent_roof_type C02_respondent_wall_type 270 | grass burntbricks 22 271 | muddaub 42 272 | sunbricks 9 273 | mabatipitched burntbricks 6 274 | muddaub 3 275 | ... 276 | ``` 277 | 278 | {: output} 279 | 280 | ::::::::::::::::::::::::::::::::::::::: challenge 281 | 282 | ## Exercise 283 | 284 | 1. Read in the SAFI\_results.csv dataset. 285 | 2. Get a list of the different `C01_respondent_roof_type` values. 286 | 3. Groupby `C01_respondent_roof_type` and describe the results. 287 | 4. Remove rows with NULL values for `E_no_group_count`. 288 | 5. repeat steps 2 \& 3 and compare the results. 289 | 290 | ::::::::::::::: solution 291 | 292 | ## Solution 293 | 294 | ```python 295 | # Steps 1 and 2 296 | import numpy as np 297 | df_SAFI = pd.read_csv("SAFI_results.csv") 298 | print(df_SAFI.shape) 299 | print(pd.unique(df_SAFI['C01_respondent_roof_type'])) 300 | ``` 301 | 302 | ```python 303 | # Step 3 304 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type') 305 | grouped_data.describe() 306 | ``` 307 | 308 | ```python 309 | # steps 4 and 5 310 | df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())] 311 | grouped_data = df_SAFI.groupby('C01_respondent_roof_type') 312 | print(df_SAFI.shape) 313 | print(pd.unique(df_SAFI['C01_respondent_roof_type'])) 314 | grouped_data.describe() 315 | ``` 316 | 317 | `E_no_group_count` is related to whether or not farm plots are irrigated or not. It has no obvious connection to farm buildings. 318 | By restricting the data to non-irrigated plots we have accidentally? removed one of the roof\_types completely. 319 | 320 | ::::::::::::::::::::::::: 321 | 322 | :::::::::::::::::::::::::::::::::::::::::::::::::: 323 | 324 | :::::::::::::::::::::::::::::::::::::::: keypoints 325 | 326 | - Summarising numerical and categorical variables is a very common requirement 327 | - Missing data can interfere with how statistical summaries are calculated 328 | - Missing data can be replaced or created depending on requirement 329 | - Summarising or aggregation can be done over single or multiple variables at the same time 330 | 331 | :::::::::::::::::::::::::::::::::::::::::::::::::: 332 | 333 | 334 | -------------------------------------------------------------------------------- /episodes/11-joins.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Joining Pandas Dataframes 3 | teaching: 25 4 | exercises: 10 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Understand why we would want to join Dataframes 10 | - Know what is needed for a join to be possible 11 | - Understand the different types of joins 12 | - Understand what the joined results tell us about our data 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - How can I join two Dataframes with a common key? 19 | 20 | :::::::::::::::::::::::::::::::::::::::::::::::::: 21 | 22 | ## Joining Dataframes 23 | 24 | ### Why do we want to do this 25 | 26 | There are many occasions when we have related data spread across multiple files. 27 | 28 | The data can be related to each other in different ways. How they are related and how completely we can join the data 29 | from the datasets will vary. 30 | 31 | In this episode we will consider different scenarios and show we might join the data. We will use csv files and in all 32 | cases the first step will be to read the datasets into a pandas Dataframe from where we will do the joining. The csv 33 | files we are using are cut down versions of the SN7577 dataset to make the displays more manageable. 34 | 35 | First, let's download the datafiles. They are listed in the [setup page][setup-page] for the lesson. Alternatively, 36 | you can download the [GitHub repository for this lesson][gh-repo]. The data files are in the 37 | *data* directory. If you're using Jupyter, make sure to place these files in the same directory where your notebook 38 | file is. 39 | 40 | ### Scenario 1 - Two data sets containing the same columns but different rows of data 41 | 42 | Here we want to add the rows from one Dataframe to the rows of the other Dataframe. 43 | In order to do this we can use the `pd.concat()` function. 44 | 45 | ```python 46 | import pandas as pd 47 | 48 | df_SN7577i_a = pd.read_csv("SN7577i_a.csv") 49 | df_SN7577i_b = pd.read_csv("SN7577i_b.csv") 50 | ``` 51 | 52 | Have a quick look at what these Dataframes look like with 53 | 54 | ```python 55 | print(df_SN7577i_a) 56 | print(df_SN7577i_b) 57 | ``` 58 | 59 | ``` 60 | Id Q1 Q2 Q3 Q4 61 | 0 1 1 -1 1 8 62 | 1 2 3 -1 1 4 63 | 2 3 10 3 2 6 64 | 3 4 9 -1 10 10 65 | ... 66 | 67 | Id Q1 Q2 Q3 Q4 68 | 0 1277 10 10 4 6 69 | 1 1278 2 -1 5 4 70 | 2 1279 2 -1 4 5 71 | 3 1280 1 -1 2 3 72 | ... 73 | ``` 74 | 75 | {: output} 76 | 77 | The `concat()` function appends the rows from the two Dataframes to create the df\_all\_rows Dataframe. When you list this out you can see that all of the data rows are there, however, there is a problem with the `index`. 78 | 79 | ```python 80 | df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b]) 81 | df_all_rows 82 | ``` 83 | 84 | We didn't explicitly set an index for any of the Dataframes we have used. For `df_SN7577i_a` and `df_SN7577i_b` default 85 | indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries. 86 | 87 | This is really only a problem if you need to access a row by its index. We can fix the problem with the following code. 88 | 89 | ```python 90 | df_all_rows=df_all_rows.reset_index(drop=True) 91 | 92 | # or, alternatively, there's the `ignore_index` option in the `pd.concat()` function: 93 | df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b], ignore_index=True) 94 | 95 | df_all_rows 96 | ``` 97 | 98 | What if the columns in the Dataframes are not the same? 99 | 100 | ```python 101 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") 102 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") 103 | df_all_rows = pd.concat([df_SN7577i_aa, df_SN7577i_bb]) 104 | df_all_rows 105 | ``` 106 | 107 | In this case `df_SN7577i_aa` has no Q4 column and `df_SN7577i_bb` has no `Q3` column. When they are concatenated, the 108 | resulting Dataframe has a column for `Q3` and `Q4`. For the rows corresponding to `df_SN7577i_aa` the values in the `Q4` 109 | column are missing and denoted by `NaN`. The same applies to `Q3` for the `df_SN7577i_bb` rows. 110 | 111 | ### Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe 112 | 113 | ```python 114 | df_SN7577i_c = pd.read_csv("SN7577i_c.csv") 115 | df_SN7577i_d = pd.read_csv("SN7577i_d.csv") 116 | df_all_cols = pd.concat([df_SN7577i_c, df_SN7577i_d], axis = 1) 117 | df_all_cols 118 | ``` 119 | 120 | We use the `axis=1` parameter to indicate that it is the columns that need to be joined together. Notice that the `Id` 121 | column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not 122 | necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem. 123 | 124 | ### Scenario 3 - Using merge to join columns 125 | 126 | We can join columns from two Dataframes using the `merge()` function. This is similar to the SQL 'join' functionality. 127 | 128 | A detailed discussion of different join types is given in the [SQL lesson](./episodes/sql...). 129 | 130 | You specify the type of join you want using the `how` parameter. The default is the `inner` join which returns the 131 | columns from both tables where the `key` or common column values match in both Dataframes. 132 | 133 | The possible values of the `how` parameter are shown in the picture below (taken from the Pandas documentation) 134 | 135 | ![](fig/pandas_join_types.png){alt='pandas\_join\_types'} 136 | 137 | The different join types behave in the same way as they do in SQL. In Python/pandas, any missing values are shown as `NaN` 138 | 139 | In order to `merge` the Dataframes we need to identify a column common to both of them. 140 | 141 | ```python 142 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner') 143 | df_cd 144 | ``` 145 | 146 | In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to 147 | join on. In this example the `Id` column 148 | 149 | Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using 150 | the `on` parameter. 151 | 152 | ```python 153 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id') 154 | ``` 155 | 156 | In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you 157 | can use the `left_on` and `right_on` parameters to specify them separately. 158 | 159 | ```python 160 | df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', left_on = 'Id', right_on = 'Id') 161 | ``` 162 | 163 | ::::::::::::::::::::::::::::::::::::::: challenge 164 | 165 | ## Practice with data 166 | 167 | 1. Examine the contents of the `SN7577i_aa` and `SN7577i_bb` csv files using Excel or equivalent. 168 | 2. Using the `SN7577i_aa` and `SN7577i_bb` csv files, create a Dataframe which is the result of an outer join using the `Id` column to join on. 169 | 3. What do you notice about the column names in the new Dataframe? 170 | 4. Using `shift`\+`tab` in Jupyter examine the possible parameters for the `merge()` function. 171 | 5. re-write the code so that the columns names which are common to both files have suffixes indicating the filename from which they come 172 | 6. If you add the parameter `indicator=True`, what additional information is provided in the resulting Dataframe? 173 | 174 | ::::::::::::::: solution 175 | 176 | ## Solution 177 | 178 | ```python 179 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") 180 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") 181 | df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id') 182 | df_aabb 183 | ``` 184 | 185 | ```python 186 | df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") 187 | df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") 188 | df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id',suffixes=('_aa', '_bb'), indicator = True) 189 | df_aabb 190 | ``` 191 | 192 | ::::::::::::::::::::::::: 193 | 194 | :::::::::::::::::::::::::::::::::::::::::::::::::: 195 | 196 | [setup-page]: https://datacarpentry.org/python-socialsci/setup.html 197 | [gh-repo]: https://github.com/datacarpentry/python-socialsci/archive/gh-pages.zip 198 | 199 | 200 | :::::::::::::::::::::::::::::::::::::::: keypoints 201 | 202 | - You can join pandas Dataframes in much the same way as you join tables in SQL 203 | - The `concat()` function can be used to concatenate two Dataframes by adding the rows of one to the other. 204 | - `concat()` can also combine Dataframes by columns but the `merge()` function is the preferred way 205 | - The `merge()` function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible. 206 | 207 | :::::::::::::::::::::::::::::::::::::::::::::::::: 208 | 209 | 210 | -------------------------------------------------------------------------------- /episodes/12-long-and-wide.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Wide and long data formats 3 | teaching: 20 4 | exercises: 15 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Explain difference between long and wide formats and why each might be used 10 | - Illustrate how to change between formats using the `melt()` and `pivot()` methods 11 | 12 | :::::::::::::::::::::::::::::::::::::::::::::::::: 13 | 14 | :::::::::::::::::::::::::::::::::::::::: questions 15 | 16 | - What are long and Wide formats? 17 | - Why would I want to change between them? 18 | 19 | :::::::::::::::::::::::::::::::::::::::::::::::::: 20 | 21 | ## Wide and long data formats 22 | 23 | In the SN7577 dataset that we have been using there is a group of columns which record which daily newspapers each respondent reads. Despite the un-informative names like 'daily1' each column refers to a current UK daily national or local newspaper. 24 | 25 | Whether the paper is read or not is recorded using the values of 0 or 1 as a boolean indicator. The advantage of using a column for each paper means that should a respondent read multiple newspapers, all of the required information can still be recorded in a single record. 26 | 27 | Recording information in this *wide* format is not always beneficial when trying to analyse the data. 28 | 29 | Pandas provides methods for converting data from *wide* to *long* format and from *long* to *wide* format 30 | 31 | The SN7577 dataset does not contain a variable that can be used to uniquely identify a row. This is often referred to as a 'primary key' field (or column). 32 | 33 | A dataset doesn't need to have such a key. None of the work we have done so far has required it. 34 | 35 | When we create a pandas Dataframe by importing a csv file, we have seen that pandas will create an index for the rows. This index can be used a bit like a key field, but as we have seen there can be problems with the index when we concatenate two Dataframes together. 36 | 37 | In the version of SN7577 that we are going to use to demonstrate long and wide formats we will add a new variable with the name of 'Id' and we will restrict the other columns to those starting with the word 'daily'. 38 | 39 | ```python 40 | import pandas as pd 41 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') 42 | ``` 43 | 44 | We will create a new Dataframe with a single column of 'Id'. 45 | 46 | ```python 47 | # create an 'Id' column 48 | df_papers1 = pd.DataFrame(pd.Series(range(1,1287)), index=None, columns=['Id']) 49 | ``` 50 | 51 | Using the range function, we can create values of `Id` starting with 1 and going up to 1286 (remember the second parameter to range is one past the last value used.) We have explicitly coded this value because we knew how many rows were in the dataset. If we didn't, we could have used 52 | 53 | ```python 54 | len(df_SN7577.index) +1 55 | ``` 56 | 57 | ``` 58 | 1287 59 | ``` 60 | 61 | {: output} 62 | 63 | We will create a 2nd Dataframe, based on SN7577 but containing only the columns starting with the word 'daily'. 64 | 65 | There are several ways of doing this, we'll cover the way that we have covered all of the prerequisites for. We will use the `filter` method of `pandas` with its `like` parameter. 66 | 67 | ```python 68 | df_papers2 = df_SN7577.filter(like= 'daily') 69 | ``` 70 | 71 | The value supplied to `like` can occur anywhere in the column name to be matched (and therefore selected). 72 | 73 | ::::::::::::::::::::::::::::::::::::::::: callout 74 | 75 | ## Another way 76 | 77 | If we knew the column numbers and they were all continuous we could use the `iloc` method and provide the index values of the range of columns we want. 78 | 79 | ```python 80 | df_papers2 = df_SN7577.iloc[:,118:143] 81 | ``` 82 | 83 | :::::::::::::::::::::::::::::::::::::::::::::::::: 84 | 85 | To create the Dataframe that we will use, we will concatenate the two Dataframes we have created. 86 | 87 | ```python 88 | df_papers = pd.concat([df_papers1, df_papers2], axis = 1) 89 | print(df_papers.index) 90 | print(df_papers.columns) 91 | ``` 92 | 93 | ``` 94 | RangeIndex(start=0, stop=1286, step=1) 95 | Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6', 96 | 'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12', 97 | 'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18', 98 | 'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24', 99 | 'daily25'], 100 | dtype='object') 101 | ``` 102 | 103 | {: output} 104 | 105 | We use `axis = 1` because we are joining by columns, the default is joining by rows (`axis=0`). 106 | 107 | ## From 'wide' to 'long' 108 | 109 | To make the displays more manageable we will use only the first eight 'daily' columns 110 | 111 | ```python 112 | ## using df_papers 113 | daily_list = df_papers.columns[1:8] 114 | 115 | df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list) 116 | 117 | # by default, the new columns created will be called 'variable' which is the name of the 'daily' 118 | # and 'value' which is the value of that 'daily' for that 'Id'. So, we will rename the columns 119 | 120 | df_daily_papers_long.columns = ['Id','Daily_paper','Value'] 121 | df_daily_papers_long 122 | ``` 123 | 124 | We now have a Dataframe that we can `groupby`. 125 | 126 | We want to `groupby` the `Daily_paper` and then sum the `Value`. 127 | 128 | ```python 129 | a = df_daily_papers_long.groupby('Daily_paper')['Value'].sum() 130 | a 131 | ``` 132 | 133 | ``` 134 | Daily_paper 135 | daily1 0 136 | daily2 26 137 | daily3 52 138 | ``` 139 | 140 | {: output} 141 | 142 | ## From Long to Wide 143 | 144 | The process can be reversed by using the `pivot()` method. 145 | Here we need to indicate which column (or columns) remain fixed (this will become an index in the new Dataframe), which column contains the values which are to become column names and which column contains the values for the columns. 146 | 147 | In our case we want to use the `Id` column as the fixed column, the `Daily_paper` column contains the column names and the `Value` column contains the values. 148 | 149 | ```python 150 | df_daily_papers_wide = df_daily_papers_long.pivot(index = 'Id', columns = 'Daily_paper', values = 'Value') 151 | ``` 152 | 153 | We can change our `Id` index back to an ordinary column with 154 | 155 | ```python 156 | df_daily_papers_wide.reset_index(level=0, inplace=True) 157 | ``` 158 | 159 | ::::::::::::::::::::::::::::::::::::::: challenge 160 | 161 | ## Exercise 162 | 163 | 1. Find out how many people take each of the daily newspapers by Title. 164 | 2. Which titles don't appear to be read by anyone? 165 | 166 | There is a file called Newspapers.csv which lists all of the newspapers Titles along with the corresponding 'daily' value 167 | 168 | Hint: Newspapers.csv contains both daily and Sunday newspapers you can filter out the Sunday papers with the following code: 169 | 170 | ```python 171 | df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))] 172 | ``` 173 | 174 | ::::::::::::::: solution 175 | 176 | ## Solution 177 | 178 | 1. Read in Newspapers.csv file and keep only the dailies. 179 | 180 | ```python 181 | df_newspapers = pd.read_csv("Newspapers.csv") 182 | df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))] 183 | df_newspapers 184 | ``` 185 | 186 | 2. Create the df\_papers Dataframe as we did before. 187 | 188 | ```python 189 | import pandas as pd 190 | df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') 191 | #create an 'Id' column 192 | df_papers1 = pd.DataFrame(pd.Series(range(1,1287)),index=None,columns=['Id']) 193 | df_papers2 = df_SN7577.filter(like= 'daily') 194 | df_papers = pd.concat([df_papers1, df_papers2], axis = 1) 195 | df_papers 196 | ``` 197 | 198 | 3. Create a list of all of the dailies, one way would be 199 | 200 | ```python 201 | daily_list = [] 202 | for i in range(1,26): 203 | daily_list.append('daily'+str(i)) 204 | ``` 205 | 206 | 4. Pass the list as the `value_vars` parameter to the `melt()` method 207 | 208 | ```python 209 | #use melt to create df_daily_papers_long 210 | df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list ) 211 | #Change the column names 212 | df_daily_papers_long.columns = ['Id', 'Daily_paper', 'Value'] 213 | ``` 214 | 215 | 5. `merge` the two Dataframes with a left join, because we want all of the Newspaper Titles to be included. 216 | 217 | ```python 218 | df_papers_taken = pd.merge(df_newspapers, df_daily_papers_long, how='left', left_on = 'Column_name',right_on = 'Daily_paper') 219 | ``` 220 | 221 | 6. Then `groupby` the 'Title' and sum the 'Value' 222 | 223 | ```python 224 | df_papers_taken.groupby('Title')['Value'].sum() 225 | ``` 226 | 227 | ::::::::::::::::::::::::: 228 | 229 | :::::::::::::::::::::::::::::::::::::::::::::::::: 230 | 231 | :::::::::::::::::::::::::::::::::::::::: keypoints 232 | 233 | - The `melt()` method can be used to change from wide to long format 234 | - The `pivot()` method can be used to change from the long to wide format 235 | - Aggregations are best done from data in the long format. 236 | 237 | :::::::::::::::::::::::::::::::::::::::::::::::::: 238 | 239 | 240 | -------------------------------------------------------------------------------- /episodes/14-sqlite.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Accessing SQLite Databases ' 3 | teaching: 35 4 | exercises: 25 5 | --- 6 | 7 | ::::::::::::::::::::::::::::::::::::::: objectives 8 | 9 | - Use the sqlite3 module to interact with a SQL database 10 | - Access data stored in SQLite using Python 11 | - Describe the difference in interacting with data stored as a CSV file versus in SQLite 12 | - Describe the benefits of accessing data using a database compared to a CSV file 13 | 14 | :::::::::::::::::::::::::::::::::::::::::::::::::: 15 | 16 | :::::::::::::::::::::::::::::::::::::::: questions 17 | 18 | - How can I access database tables using Pandas and Python? 19 | - What are the advantages of storing data in a database 20 | 21 | :::::::::::::::::::::::::::::::::::::::::::::::::: 22 | 23 | ## Introducing the sqlite3 module 24 | 25 | SQLite is a relational database system. Despite the 'Lite' in the name it can handle databases in excess of a Terabyte. The 'Lite'part really relates to the fact that it is a 'bare bones' system. It provides the mechanisms to create and query databases via a simple command line interface but not much else. In the SQL lesson we used a Firefox plugin to provide a GUI (Graphical User Interface) to the SQLite database engine. 26 | 27 | In this lesson we will use Python code using the sqlite3 module to access the engine. We can use Python code and the sqlite3 module to create, delete and query database tables. 28 | 29 | In practice we spend a lot of the time querying database tables. 30 | 31 | ## Pandas Dataframe v SQL table 32 | 33 | It is very easy and often very convenient to think of SQL tables and pandas Dataframes as being similar types of objects. All of the data manipulations, slicing, dicing, aggragetions and joins associated with SQL and SQL tables can all be accomplished with pandas methods operating on a pandas Dataframe. 34 | 35 | The difference is that the pandas Dataframe is held in memory within the Python environment. The SQL table can largely be on disc and when you access it, it is the SQLite database engine which is doing the work. This allows you to work with very large tables which your Python environment may not have the memory to hold completely. 36 | 37 | A typical use case for SQLite databases is to hold large datasets, you use SQL commands from Python to slice and dice and possibly aggregate the data within the database system to reduce the size to something that Python can comfortably process and then return the results to a Dataframe. 38 | 39 | ## Accessing data stored in SQLite using Python 40 | 41 | We will illustrate the use of the `sqlite3` module by connecting to an SQLite database using both core Python and also using pandas. 42 | 43 | The database that we will use is SN7577.sqlite This contains the data from the SN7577 dataset that we have used in other lessons. 44 | 45 | ## Connecting to an SQlite database 46 | 47 | The first thing we need to do is import the `sqlite3` library, We will import pandas at the same time for convenience. 48 | 49 | ```python 50 | import sqlite3 51 | import pandas as pd 52 | ``` 53 | 54 | We will start looking at the sqlite3 library by connecting to an existing database and returning the results of running a query. 55 | 56 | Initially we will do this without using Pandas and then we will repreat the exercise so that you can see the difference. 57 | 58 | The first thing we need to do is to make a connection to the database. An SQLite database is just a file. To make a connection to it we only need to use the sqlite3 `connect()` function and specify the database file as the first parameter. 59 | 60 | The connection is assigned to a variable. You could use any variable name, but 'con' is quite commonly used for this purpose 61 | 62 | ```python 63 | con = sqlite3.connect('SN7577.sqlite') 64 | ``` 65 | 66 | The next thing we need to do is to create a `cursor` for the connection and assign it to a variable. We do this using the `cursor` method of the connection object. 67 | 68 | The cursor allows us to pass SQL statements to the database, have them executed and then get the results back. 69 | 70 | To execute an SQL statement we use the `execute()` method of the cursor object. 71 | 72 | The only paramater we need to pass to `execute()` is a string which contains the SQL query we wish to execute. 73 | 74 | In our example we are passing a literal string. It could have been contained in a string variable. The string can contain any valid SQL query. It could also be a valid DDL statement such as a "CREATE TABLE ...". In this lesson however we will confine ourseleves to querying exiting database tables. 75 | 76 | ```python 77 | cur = con.cursor() 78 | cur.execute("SELECT * FROM SN7577") 79 | ``` 80 | 81 | ``` 82 | 83 | ``` 84 | 85 | {: output} 86 | 87 | The `execute()` method doesn't actually return any data, it just indicates that we want the data provided by running the SELECT statement. 88 | 89 | ::::::::::::::::::::::::::::::::::::::: challenge 90 | 91 | ## Exercise 92 | 93 | 1. What happens if you if you ask for a non existent table?, field within a table? or just any kind of syntax error? 94 | 95 | ::::::::::::::: solution 96 | 97 | ## Solution 98 | 99 | ```python 100 | cur = con.cursor() 101 | # notice the mistyping of 'SELECT' 102 | cur.execute("SELET * FROM SN7577") 103 | ``` 104 | 105 | In all cases an error message is returned. The error message is not from Python but from SQLite. It is the same error message that you would have got had you made the same errors in the SQLite plugin. 106 | 107 | ::::::::::::::::::::::::: 108 | 109 | :::::::::::::::::::::::::::::::::::::::::::::::::: 110 | 111 | Before we can make use of the results of the query we need to use the `fetchall()` method of the cursor. 112 | 113 | The `fetchall()` method returns a list. Each item in the list is a tuple containing the values from one row of the table. You can iterate through the items in a tuple in the same way as you would do so for a list. 114 | 115 | ```python 116 | cur = con.cursor() 117 | cur.execute("SELECT * FROM SN7577") 118 | rows = cur.fetchall() 119 | for row in rows: 120 | print(row) 121 | ``` 122 | 123 | ``` 124 | (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3 125 | ... 126 | ``` 127 | 128 | {: output} 129 | 130 | The output is the data only, you do not get the column names. 131 | 132 | The column names are available from the 'description' of the cursor. 133 | 134 | ```python 135 | colnames = [] 136 | for description in cur.description : 137 | colnames.append(description[0]) 138 | 139 | print(colnames) 140 | ``` 141 | 142 | ``` 143 | ['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a', 144 | ... 145 | ``` 146 | 147 | {: output} 148 | 149 | One reason for using a database is the size of the data involved. Consequently it may not be practial to use `fetchall()` as this will return the complete result of your query. 150 | 151 | An alternative is to use the `fetchone()` method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to `fetchone()` will return the next record. When there are no more records it will return 'None'. 152 | 153 | ```python 154 | cur = con.cursor() 155 | cur.execute("SELECT * FROM SN7577") 156 | row = cur.fetchone() 157 | print(row) 158 | row = cur.fetchone() 159 | print(row) 160 | ``` 161 | 162 | ``` 163 | (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 164 | ``` 165 | 166 | {: output} 167 | 168 | ::::::::::::::::::::::::::::::::::::::: challenge 169 | 170 | ## Exercise 171 | 172 | Can you write code to return the first 5 records from the SN7577 table in two different ways? 173 | 174 | ::::::::::::::: solution 175 | 176 | ## Solution 177 | 178 | ```python 179 | import sqlite3 180 | con = sqlite3.connect('SN7577.sqlite') 181 | cur = con.cursor() 182 | 183 | # we can use the SQLite 'limit' clause to restrict the number of rows returned and then use 'fetchall' 184 | cur.execute("SELECT * FROM SN7577 Limit 5") 185 | rows = cur.fetchall() 186 | 187 | for row in rows: 188 | print(row) 189 | 190 | # we can use 'fetchone' in a for loop 191 | cur.execute("SELECT * FROM SN7577") 192 | for i in range(1,6): 193 | print(cur.fetchone()) 194 | 195 | # a third way would be to use the 'fetchmany()' method 196 | 197 | cur.execute("SELECT * FROM SN7577") 198 | rows = cur.fetchmany(5) 199 | 200 | for row in rows: 201 | print(row) 202 | ``` 203 | 204 | ::::::::::::::::::::::::: 205 | 206 | :::::::::::::::::::::::::::::::::::::::::::::::::: 207 | 208 | ## Using Pandas to read a database table. 209 | 210 | When you use Pandas to read a database table, you connect to the database in the same way as before using the SQLite3 `connect()` function and providing the filename of the database file. 211 | 212 | Pandas has a method `read_sql_query` to which you provide both the string containing the SQL query you wish to run and also the connection variable. 213 | 214 | The results from running the query are placed in a pandas Dataframe with the table column names automatically added. 215 | 216 | ```python 217 | con = sqlite3.connect('SN7577.sqlite') 218 | df = pd.read_sql_query("SELECT * from SN7577", con) 219 | 220 | # verify that result of SQL query is stored in the Dataframe 221 | print(type(df)) 222 | print(df.shape) 223 | print(df.head()) 224 | 225 | con.close() 226 | ``` 227 | 228 | ## Saving a Dataframe as an SQLite table 229 | 230 | There may be occasions when it is convenient to save the data in you pandas Dataframe as an SQLite table for future use or for access to other systems. This can be done using the `to_sql()` method. 231 | 232 | ```python 233 | con = sqlite3.connect('SN7577.sqlite') 234 | df = pd.read_sql_query("SELECT * from SN7577", con) 235 | 236 | # select only the row where the response to Q1 is 10 meaning undecided voter 237 | df_undecided = df[df.Q1 == 10] 238 | print(df_undecided.shape) 239 | 240 | # Write the new Dataframe to a new SQLite table 241 | df_undecided.to_sql("Q1_undecided", con) 242 | 243 | # If you want to overwrite an existing SQLite table you can use the 'if_exists' parameter 244 | #df_undecided.to_sql("Q1_undecided", con, if_exists="replace") 245 | con.close() 246 | ``` 247 | 248 | ``` 249 | (335, 202) 250 | ``` 251 | 252 | {: output} 253 | 254 | ## Deleting an SQLite table 255 | 256 | If you have created tables in an SQLite database, you may also want to delete them. 257 | You can do this by using the sqlite3 cursor `execute()` method 258 | 259 | ```python 260 | con = sqlite3.connect('SN7577.sqlite') 261 | cur = con.cursor() 262 | 263 | cur.execute('drop table if exists Q1_undecided') 264 | 265 | con.close() 266 | ``` 267 | 268 | ::::::::::::::::::::::::::::::::::::::: challenge 269 | 270 | ## Exercise 271 | 272 | The code below creates an SQLite table as we have done in previous examples. Run this code to create the table. 273 | 274 | ``` 275 | con = sqlite3.connect('SN7577.sqlite') 276 | df_undecided = df[df.Q1 == 10] 277 | df_undecided.to_sql("Q1_undecided_v2", con) 278 | con.close() 279 | ``` 280 | 281 | Try using the following pandas code to delete (drop) the table. 282 | 283 | ```python 284 | pd.read_sql_query("drop table Q1_undecided_v2", con) 285 | ``` 286 | 287 | 1. What happens? 288 | 2. Run this line of code again, What is different? 289 | 3. Can you explain the difference and does the table now exist or not? 290 | 291 | ::::::::::::::: solution 292 | 293 | ## Solution 294 | 295 | 1. When the line of code is run the first time you get an error message : 'NoneType' object is not iterable. 296 | 297 | 2. When you run it a second time you get a different error message: 298 | DatabaseError: Execution failed on sql 'drop table Q1\_undecided\_v2': no such table: Q1\_undecided\_v2 299 | 300 | 3. the `read_sql_query()` method is designed to send the SQL containing your query to the SQLite execution engine, which will execute the SQL and return the output to pandas which will create a Dataframe from the results. 301 | 302 | The SQL statement we sent is valid SQL but it doesn't return rows from a table, it simply reports success of failure (in dropping the table in this case). The first time we run it the table is deleted and a response to the effect is returned. The resonse cannot be converted to a Dataframe, hence the first error message, which is a pandas error. 303 | 304 | When we run it for the second time, the table has already has already been dropped, so this time the error message is from SQLite saying the table didn't exist. Pandas recognises that this is an SQLite error message and simply passes it on to the user. 305 | 306 | The moral of the story: pandas may be better for getting data returned into a Dataframe, but there are some things best left to the sqlite functions directly. 307 | 308 | ::::::::::::::::::::::::: 309 | 310 | :::::::::::::::::::::::::::::::::::::::::::::::::: 311 | 312 | :::::::::::::::::::::::::::::::::::::::: keypoints 313 | 314 | - The SQLite database system is directly available from within Python 315 | - A database table and a pandas Dataframe can be considered similar structures 316 | - Using pandas to return all of the results from a query is simpler than using sqlite3 alone 317 | 318 | :::::::::::::::::::::::::::::::::::::::::::::::::: 319 | 320 | 321 | -------------------------------------------------------------------------------- /episodes/data/Newspapers.csv: -------------------------------------------------------------------------------- 1 | Column_name,Title 2 | daily1,THE GLASGOW HERALD 3 | daily2,THE INDEPENDENT 4 | daily3,THE DAILY TELEGRAPH 5 | daily4,THE GUARDIAN 6 | daily5,THE FINANCIAL TIMES 7 | daily6,THE TIMES 8 | daily7,THE SCOTSMAN 9 | daily8,DAILY EXPRESS 10 | daily9,DAILY MAIL 11 | daily10,DAILY RECORD 12 | daily11,THE SUN 13 | daily12,DAILY MIRROR 14 | daily13,DAILY STAR 15 | daily14,WESTERN MAIL (WALES) 16 | daily15,BELFAST TELEGRAPH 17 | daily16,IRISH NEWS 18 | daily17,NEWS LETTER (ULSTER) 19 | daily18,THE METRO 20 | daily19,EVENING STANDARD 21 | daily20,I NEWSPAPER 22 | daily21,BROADSHEET 23 | daily22,MID MARKET 24 | daily23,TABLOID 25 | daily24,NONE OF THESE 26 | daily25,DON'T KNOW 27 | sunday1,SUNDAY MAIL (SCOTLAND) 28 | sunday2,THE MAIL ON SUNDAY 29 | sunday3,SUNDAY POST 30 | sunday4,THE INDEPENDENT ON SUNDAY 31 | sunday5,SUNDAY TIMES 32 | sunday6,SUNDAY TELEGRAPH 33 | sunday7,SUNDAY EXPRESS 34 | sunday8,OBSERVER 35 | sunday9,NEWS OF THE WORLD 36 | sunday10,THE PEOPLE 37 | sunday11,SUNDAY MIRROR 38 | sunday12,SUNDAY SPORT 39 | sunday13,SCOTLAND ON SUNDAY 40 | sunday14,DAILY STAR SUNDAY 41 | sunday15,THE SUN (SUNDAY) 42 | sunday16,BROADSHEET 43 | sunday17,MID-MARKET 44 | sunday18,TABLOIDS 45 | sunday19,NONE OF THESE 46 | sunday20,DON'T KNOW 47 | -------------------------------------------------------------------------------- /episodes/data/Newspapers.txt: -------------------------------------------------------------------------------- 1 | daily1,"THE GLASGOW HERALD" 2 | daily2,"THE INDEPENDENT" 3 | daily3,"THE DAILY TELEGRAPH" 4 | daily4,"THE GUARDIAN" 5 | daily5,"THE FINANCIAL TIMES" 6 | daily6,"THE TIMES" 7 | daily7,"THE SCOTSMAN" 8 | daily8,"DAILY EXPRESS" 9 | daily9,"DAILY MAIL" 10 | daily10,"DAILY RECORD" 11 | daily11,"THE SUN" 12 | daily12,"DAILY MIRROR" 13 | daily13,"DAILY STAR" 14 | daily14,"WESTERN MAIL (WALES)" 15 | daily15,"BELFAST TELEGRAPH" 16 | daily16,"IRISH NEWS" 17 | daily17,"NEWS LETTER (ULSTER)" 18 | daily18,"THE METRO" 19 | daily19,"EVENING STANDARD" 20 | daily20,"I NEWSPAPER" 21 | daily21,"BROADSHEET" 22 | daily22,"MID MARKET" 23 | daily23,"TABLOID" 24 | daily24,"NONE OF THESE" 25 | daily25,"DON'T KNOW" 26 | sunday1,"SUNDAY MAIL (SCOTLAND)" 27 | sunday2,"THE MAIL ON SUNDAY" 28 | sunday3,"SUNDAY POST" 29 | sunday4,"THE INDEPENDENT ON SUNDAY" 30 | sunday5,"SUNDAY TIMES" 31 | sunday6,"SUNDAY TELEGRAPH" 32 | sunday7,"SUNDAY EXPRESS" 33 | sunday8,"OBSERVER" 34 | sunday9,"NEWS OF THE WORLD" 35 | sunday10,"THE PEOPLE" 36 | sunday11,"SUNDAY MIRROR" 37 | sunday12,"SUNDAY SPORT" 38 | sunday13,"SCOTLAND ON SUNDAY" 39 | sunday14,"DAILY STAR SUNDAY" 40 | sunday15,"THE SUN (SUNDAY)" 41 | sunday16,"BROADSHEET" 42 | sunday17,"MID-MARKET" 43 | sunday18,"TABLOIDS" 44 | sunday19,"NONE OF THESE" 45 | sunday20,"DON'T KNOW" 46 | 47 | -------------------------------------------------------------------------------- /episodes/data/Q1.txt: -------------------------------------------------------------------------------- 1 | 1,Conservative 2 | 2,Labour 3 | 3,Liberal Democrats (Lib Dem) 4 | 4,Scottish/Welsh Nationalist 5 | 5,Green Party 6 | 6,UK Independence Party 7 | 7,British National Party (BNP) 8 | 8,Other 9 | 9,Would not vote 10 | 10,Undecided 11 | 11,Refused 12 | -------------------------------------------------------------------------------- /episodes/data/Q6.txt: -------------------------------------------------------------------------------- 1 | 1,Very interested 2 | 2,Fairly interested 3 | 3,Not very interested 4 | 4,Not at all interested 5 | 5,Don't know 6 | 7 | -------------------------------------------------------------------------------- /episodes/data/Q7.txt: -------------------------------------------------------------------------------- 1 | 1,A great deal 2 | 2,A fair amount 3 | 3,Not very much 4 | 4,Nothing at all 5 | 5,Don't know 6 | -------------------------------------------------------------------------------- /episodes/data/SAFI_crops.csv: -------------------------------------------------------------------------------- 1 | Farm,plot_no,plot_area,crop_no,crop_name 2 | 01,1,0.5,1,maize 3 | 01,2,0.5,1,maize 4 | 01,1,1.0,1,maize 5 | 01,2,1.5,1,tomatoes 6 | 01,3,1.0,1,vegetable 7 | 03,1,1.0,1,maize 8 | 04,1,1.0,1,maize 9 | 04,2,1.0,1,maize 10 | 04,3,1.0,1,sorghum 11 | 05,1,1.5,1,maize 12 | 05,2,0.5,1,maize 13 | 6,1,1.5,1,maize 14 | 7,1,2.0,1,maize 15 | 7,2,1.0,1,tomatoes 16 | 7,3,1.0,1,beans 17 | 7,4,1.0,1,vegetable 18 | 08,1,1.0,1,maize 19 | 08,2,2.0,1,maize 20 | 9,1,1.5,1,maize 21 | 9,2,1.0,1,tomatoes 22 | 9,3,1.0,1,vegetable 23 | 10,1,3.0,1,maize 24 | 10,2,1.5,1,tomatoes 25 | 11,1,1.0,1,maize 26 | 11,2,1.0,1,maize 27 | 12,1,3.5,1,maize 28 | 12,2,1.5,1,tomatoes 29 | 13,1,12.0,1,maize 30 | 13,2,1.0,1,beans 31 | 13,3,1.0,1,vegetable 32 | 13,4,1.0,1,onion 33 | 14,1,1.0,1,maize 34 | 14,2,1.0,1,maize 35 | 14,3,1.0,1,maize 36 | 15,1,15.0,1,maize 37 | 15,2,1.0,1,beans 38 | 15,3,0.5,1,ngogwe 39 | 16,1,3.5,1,maize 40 | 17,1,1.0,1,maize 41 | 17,2,0.3,1,vegetable 42 | 17,3,0.3,1,other 43 | 18,1,1.0,1,maize 44 | 18,2,0.5,1,pigeonpeas 45 | 19,1,5.5,1,maize 46 | 19,2,1.0,1,beans 47 | 20,1,1.5,1,maize 48 | 20,2,0.5,1,beans 49 | 21,1,1.0,1,maize 50 | 21,2,1.0,1,vegetable 51 | 21,3,1.0,1,tomatoes 52 | 21,4,1.0,1,cabbage 53 | 22,1,3.0,1,maize 54 | 23,1,10.0,1,maize 55 | 23,2,1.0,1,amendoim 56 | 24,1,2.0,1,maize 57 | 24,2,2.0,1,tomatoes 58 | 24,3,0.5,1,peanut 59 | 25,1,6.0,1,maize 60 | 25,2,1.0,1,tomatoes 61 | 25,3,1.0,1,cabbage 62 | 26,1,5.0,1,maize 63 | 26,2,1.0,1,beans 64 | 27,1,4.0,1,maize 65 | 27,2,1.0,1,beans 66 | 28,1,1.0,1,maize 67 | 28,2,0.5,1,tomatoes 68 | 28,3,0.5,1,vegetable 69 | 29,1,1.5,1,maize 70 | 29,2,0.2,1,tomatoes 71 | 29,3,0.5,1,beans 72 | 30,1,5.0,1,maize 73 | 31,1,1.0,1,maize 74 | 32,1,10.0,1,maize 75 | 32,2,3.0,1,sorghum 76 | 32,3,0.5,1,sunflower 77 | 32,4,1.0,1,peanut 78 | 32,5,0.5,1,tomatoes 79 | 32,6,0.3,1,onion 80 | 32,7,0.2,1,cabbage 81 | 32,8,0.5,1,vegetable 82 | 33,1,4.0,1,maize 83 | 33,2,2.0,1,tomatoes 84 | 34,1,1.0,1,tomatoes 85 | 34,2,1.5,1,maize 86 | 35,1,1.0,1,tomatoes 87 | 35,2,4.0,1,maize 88 | 36,1,1.0,1,maize 89 | 36,2,2.0,1,tomatoes 90 | 36,3,1.0,1,vegetable 91 | 37,1,1.0,1,maize 92 | 38,1,0.5,1,maize 93 | 38,2,1.0,1,tomatoes 94 | 39,1,7.0,1,maize 95 | 40,1,4.0,1,maize 96 | 40,2,2.0,1,tomatoes 97 | 41,1,5.0,1,maize 98 | 42,1,0.5,1,maize 99 | 42,2,0.5,1,other 100 | 43,1,5.0,1,maize 101 | 43,2,1.0,1,amendoim 102 | 43,2,1.0,2, 103 | 43,3,1.0,1,vegetable 104 | 43,4,1.0,1,tomatoes 105 | 43,4,1.0,2,peanut 106 | 44,1,1.5,1,maize 107 | 45,1,1.5,1,maize 108 | 45,2,1.0,1,tomatoes 109 | 45,3,1.0,1,beans 110 | 46,1,2.0,1,maize 111 | 46,2,2.0,1,tomatoes 112 | 46,3,1.0,1,vegetable 113 | 46,4,1.0,1,onion 114 | 46,5,0.5,1,beans 115 | 47,1,2.5,1,maize 116 | 47,2,0.5,1,tomatoes 117 | 48,1,3.5,1,maize 118 | 48,1,3.5,2,sorghum 119 | 49,1,3.0,1,maize 120 | 49,1,3.0,2,sorghum 121 | 50,1,0.5,1,maize 122 | 50,1,0.5,2,tomatoes 123 | 50,1,0.5,3,piri_piri 124 | 51,1,1.0,1,maize 125 | 52,1,2.0,1,maize 126 | 21,1,5.0,1,maize 127 | 21,2,1.0,1,beans 128 | 21,2,1.0,2,tomatoes 129 | 21,2,1.0,3,cabbage 130 | 21,2,1.0,4,vegetable 131 | 21,3,1.0,1,maize 132 | 54,1,1.0,1,maize 133 | 54,1,1.0,2,tomatoes 134 | 55,1,9.0,1,maize 135 | 56,1,5.5,1,maize 136 | 56,2,1.0,1,tomatoes 137 | 57,1,1.0,1,maize 138 | 57,2,0.5,1,other 139 | 58,1,3.0,1,beans 140 | 58,2,0.5,1,other 141 | 58,3,10.0,1,maize 142 | 59,1,0.5,1,maize 143 | 59,1,0.5,2,other 144 | 59,2,1.0,1,maize 145 | 60,1,2.0,1,maize 146 | 60,2,1.0,1,tomatoes 147 | 60,3,1.0,1,peanut 148 | 61,1,2.0,1,tomatoes 149 | 61,1,2.0,2,onion 150 | 61,1,2.0,3,cabbage 151 | 61,1,2.0,4,vegetable 152 | 61,1,2.0,5,piri_piri 153 | 61,2,6.0,1,maize 154 | 62,1,0.5,1,maize 155 | 62,2,0.5,1,sorghum 156 | 63,1,1.5,1,maize 157 | 63,1,1.5,2,sesame 158 | 64,1,2.0,1,maize 159 | 64,2,0.5,1,sesame 160 | 65,1,2.0,1,maize 161 | 65,2,1.0,1,tomatoes 162 | 66,1,8.0,1,beans 163 | 66,1,8.0,2,tomatoes 164 | 66,1,8.0,3,onion 165 | 66,1,8.0,4,vegetable 166 | 67,1,6.0,1,maize 167 | 67,1,6.0,2,tomatoes 168 | 67,1,6.0,3,vegetable 169 | 68,1,1.5,1,maize 170 | 68,2,1.5,1,piri_piri 171 | 68,3,2.0,1,maize 172 | 69,1,3.0,1,maize 173 | 69,1,3.0,2,tomatoes 174 | 70,1,8.0,1,maize 175 | 70,1,8.0,2, 176 | 70,2,14.0,1,maize 177 | 70,3,0.2,1,maize 178 | 71,1,4.0,1,maize 179 | 71,2,4.0,1,tomatoes 180 | 127,1,1.0,1,maize 181 | 133,1,8.0,1,maize 182 | 133,1,8.0,2,vegetable 183 | 133,1,8.0,3,peanut 184 | 152,1,12.0,1,maize 185 | 152,1,12.0,2,sorghum 186 | 152,2,2.0,1,beans 187 | 152,2,2.0,2,tomatoes 188 | 152,2,2.0,3,vegetable 189 | 153,1,1.0,1,maize 190 | 155,1,2.0,1,maize 191 | 155,2,1.0,1,other 192 | 178,1,4.0,1,maize 193 | 178,2,1.0,1,beans 194 | 178,2,1.0,2,vegetable 195 | 178,3,2.0,1,maize 196 | 177,1,2.5,1,tomatoes 197 | 177,1,2.5,2,onion 198 | 177,1,2.5,3,vegetable 199 | 177,2,4.8,1,maize 200 | 180,1,6.0,1,maize 201 | 180,1,6.0,2,sorghum 202 | 180,2,1.0,1,tomatoes 203 | 181,1,3.0,1,maize 204 | 181,1,3.0,2,tomatoes 205 | 181,1,3.0,3,vegetable 206 | 182,1,5.0,1,maize 207 | 182,1,5.0,2,beans 208 | 182,1,5.0,3,bananas 209 | 182,1,5.0,4,tomatoes 210 | 182,1,5.0,5,vegetable 211 | 186,1,2.0,1,maize 212 | 186,1,2.0,2,beans 213 | 186,1,2.0,3,tomatoes 214 | 186,1,2.0,4,vegetable 215 | 187,1,3.0,1,beans 216 | 187,1,3.0,2,tomatoes 217 | 187,1,3.0,3,vegetable 218 | 187,2,2.0,1,maize 219 | 195,1,2.0,1,maize 220 | 195,2,0.5,1,tomatoes 221 | 195,3,1.0,1,beans 222 | 196,1,6.0,1,maize 223 | 196,2,1.5,1,tomatoes 224 | 196,2,1.5,2,vegetable 225 | 197,1,6.5,1,maize 226 | 197,2,3.0,1,beans 227 | 197,2,3.0,2,tomatoes 228 | 197,2,3.0,3,vegetable 229 | 198,1,2.0,1,maize 230 | 198,2,1.0,1,other 231 | 201,1,6.0,1,maize 232 | 201,2,2.0,1,maize 233 | 202,1,3.0,1,maize 234 | 202,2,0.5,1,tomatoes 235 | 72,1,2.0,1,maize 236 | 72,2,3.0,1,maize 237 | 72,3,1.5,1,tomatoes 238 | 73,1,0.5,1,tomatoes 239 | 73,2,3.0,1,maize 240 | 73,3,0.5,1,beans 241 | 76,1,6.0,1,maize 242 | 76,2,6.0,1,sorghum 243 | 76,3,6.0,1,peanut 244 | 76,4,3.0,1,cabbage 245 | 76,5,3.0,1,tomatoes 246 | 76,6,3.0,1,vegetable 247 | 83,1,5.0,1,maize 248 | 83,2,0.5,1,maize 249 | 83,2,0.5,2,other 250 | 85,1,1.0,1,maize 251 | 85,2,1.5,1,beans 252 | 85,2,1.5,2, 253 | 85,3,3.0,1,tomatoes 254 | 89,1,1.5,1,maize 255 | 89,2,3.0,1,maize 256 | 89,2,3.0,2,beans 257 | 89,2,3.0,3,vegetable 258 | 89,2,3.0,4,other 259 | 101,1,4.0,1,maize 260 | 101,2,3.0,1,beans 261 | 101,2,3.0,2,tomatoes 262 | 103,1,12.5,1,maize 263 | 103,1,12.5,2,sorghum 264 | 103,2,1.0,1,beans 265 | 103,2,1.0,2,bananas 266 | 103,2,1.0,3,vegetable 267 | 103,2,1.0,4,other 268 | 102,1,5.0,1,maize 269 | 102,2,2.0,1,tomatoes 270 | 102,2,2.0,2,onion 271 | 78,1,12.0,1,maize 272 | 78,2,1.0,1,maize 273 | 78,2,1.0,2,beans 274 | 78,2,1.0,3,tomatoes 275 | 78,2,1.0,4,onion 276 | 80,1,3.0,1,maize 277 | 80,1,3.0,2,sorghum 278 | 80,1,3.0,3,tomatoes 279 | 80,1,3.0,4,onion 280 | 80,1,3.0,5,vegetable 281 | 104,1,5.0,1,maize 282 | 104,1,5.0,2,sorghum 283 | 104,2,2.0,1,beans 284 | 104,2,2.0,2,tomatoes 285 | 104,3,0.5,1,beans 286 | 104,3,0.5,2,tomatoes 287 | 104,3,0.5,3,vegetable 288 | 104,4,1.0,1,tomatoes 289 | 104,4,1.0,2,baby_corn 290 | 105,1,3.0,1,maize 291 | 105,2,1.0,1,tomatoes 292 | 105,2,1.0,2,vegetable 293 | 106,1,22.0,1,baby_corn 294 | 106,2,16.0,1,piri_piri 295 | 106,3,4.0,1,maize 296 | 106,4,1.0,1,cabbage 297 | 106,5,2.0,1,tomatoes 298 | 109,1,5.0,1,maize 299 | 109,2,0.5,1,pigeonpeas 300 | 109,3,6.0,1,sorghum 301 | 110,1,6.0,1,maize 302 | 110,2,5.0,1,maize 303 | 110,3,3.0,1,maize 304 | 113,1,1.0,1,maize 305 | 113,2,1.0,1,beans 306 | 113,2,1.0,2,onion 307 | 113,3,0.5,1,other 308 | 118,1,3.0,1,maize 309 | 125,1,1.0,1,maize 310 | 125,2,0.5,1,beans 311 | 125,3,3.5,1,tomatoes 312 | 125,3,3.5,2,onion 313 | 125,3,3.5,3,vegetable 314 | 125,3,3.5,4,cucumber 315 | 119,1,2.0,1,maize 316 | 119,1,2.0,2,potatoes 317 | 119,2,0.5,1,maize 318 | 115,1,8.0,1,maize 319 | 115,2,3.0,1,maize 320 | 108,1,2.0,1,maize 321 | 108,1,2.0,2,sorghum 322 | 108,2,2.0,1,maize 323 | 108,3,1.5,1,vegetable 324 | 116,1,5.0,1,maize 325 | 117,1,12.0,1,maize 326 | 144,1,6.0,1,maize 327 | 144,2,1.0,1,tomatoes 328 | 143,1,4.0,1,maize 329 | 143,2,4.0,1,maize 330 | 143,3,5.5,1,beans 331 | 143,3,5.5,2,tomatoes 332 | 143,3,5.5,3,piri_piri 333 | 150,1,2.0,1,maize 334 | 150,2,0.5,1,other 335 | 159,1,2.5,1,maize 336 | 159,2,0.5,1,sorghum 337 | 160,1,2.0,1,maize 338 | 160,2,2.0,1,beans 339 | 160,3,1.5,1,tomatoes 340 | 165,1,1.0,1,maize 341 | 165,2,0.5,1,beans 342 | 166,1,0.5,1,tomatoes 343 | 166,2,1.0,1,maize 344 | 166,3,2.0,1,maize 345 | 167,1,1.5,1,maize 346 | 167,2,0.5,1,tomatoes 347 | 174,1,7.0,1,maize 348 | 174,2,3.0,1,tomatoes 349 | 174,3,1.0,1,tomatoes 350 | 175,1,2.0,1,maize 351 | 175,2,1.0,1,tomatoes 352 | 175,2,1.0,2,vegetable 353 | 189,1,7.0,1,maize 354 | 189,2,3.0,1,piri_piri 355 | 189,3,3.0,1,baby_corn 356 | 191,1,1.0,1,tomatoes 357 | 191,2,5.0,1,maize 358 | 192,1,0.5,1,tomatoes 359 | 192,2,0.5,1,maize 360 | 192,3,0.5,1,baby_corn 361 | 126,1,4.5,1,maize 362 | 126,1,4.5,2,beans 363 | 126,1,4.5,3,peanut 364 | 126,2,1.0,1,tomatoes 365 | 193,1,3.0,1,maize 366 | 193,2,2.5,1,tomatoes 367 | 193,3,0.5,1,cabbage 368 | 193,4,1.5,1,maize 369 | 194,1,5.0,1,maize 370 | 194,2,0.5,1,tomatoes 371 | 199,1,5.0,1,maize 372 | 199,2,1.0,1,vegetable 373 | 200,1,0.5,1,maize 374 | 200,2,4.5,1,maize 375 | -------------------------------------------------------------------------------- /episodes/data/SAFI_grass_roof_burntbricks.csv: -------------------------------------------------------------------------------- 1 | Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID 2 | 4,17/11/2016,5,2017-04-02T15:10:35.000Z,2017-04-02T17:27:35.000Z,Province1,District1,Ward2,Village2,18,no,no,40,yes,no,yes,no,7,grass,burntbricks,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,4,2,7,,10,689,-19.11221722,33.48342524,uuid:2c867811-9696-4966-9866-f35c3e97d02d 3 | 8,16/11/2016,9,2017-04-02T16:23:36.000Z,2017-04-02T16:42:08.000Z,Province1,District1,Ward2,Village3,16,no,no,6,yes,no,yes,no,8,grass,burntbricks,,earth,no,2,1,yes,3,yes,yes,6,yes,no,no,,no,never,,3,no,,more_half,yes,yes,,yes,no,3,3,8,,11,701,-19.11221518,33.48343695,uuid:846103d2-b1db-4055-b502-9cd510bb7b37 4 | 12,21/11/2016,13,2017-04-03T03:58:43.000Z,2017-04-03T04:19:36.000Z,Province1,District1,Ward2,Village2,7,yes,no,8,yes,no,yes,no,6,grass,burntbricks,,earth,no,1,1,no,4,yes,yes,7,yes,no,no,,no,never,,4,no,,more_half,yes,no,,yes,no,3,2,6,,15,706,-19.11236935,33.48355635,uuid:6c00c145-ee3b-409c-8c02-2c8d743b6918 5 | 13,21/11/2016,14,2017-04-03T04:19:57.000Z,2017-04-03T04:50:05.000Z,Province1,District1,Ward2,Village2,20,yes,no,20,yes,yes,no,yes,10,grass,burntbricks,,earth,no,3,3,no,3,no,,,,,,,,,3,,,,,yes,yes,,yes,no,3,3,10,,11,698,-19.11222089,33.4834388,uuid:9b21467f-1116-4340-a3b1-1ab64f13c87d 6 | 19,21/11/2016,20,2017-04-03T14:04:50.000Z,2017-04-03T14:20:04.000Z,Province1,District1,Ward2,Village2,24,yes,no,1,yes,yes,yes,yes,6,grass,burntbricks,,earth,yes,1,1,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,yes,1,2,6,,27,700,-19.11147317,33.47619213,uuid:d1005274-bf52-4e79-8380-3350dd7c2bac 7 | 26,21/11/2016,27,2017-04-05T04:59:42.000Z,2017-04-05T05:14:45.000Z,Province1,District1,Ward2,Village1,36,no,no,36,no,no,no,no,7,grass,burntbricks,,earth,no,3,2,yes,2,no,,,,,,,,,2,,,,,no,no,,yes,no,3,3,7,,14,679,-19.0430007,33.40508367,uuid:3197cded-1fdc-4c0c-9b10-cfcc0bf49c4d 8 | 39,17/11/2016,40,2017-04-06T08:44:51.000Z,2017-04-06T09:03:47.000Z,Province1,District1,Ward2,Village2,23,yes,yes,23,no,yes,yes,yes,9,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,23,yes,no,yes,no,no,never,,2,no,,more_half,no,no,,yes,no,1,3,9,,22.112,0,-19.0433618,33.4046671,uuid:c0b34854-eede-4e81-b183-ef58a45bfc34 9 | 56,16/11/2016,57,2017-04-08T06:26:22.000Z,2017-04-08T06:39:40.000Z,Province1,District1,Ward2,Village3,20,yes,no,27,yes,yes,yes,yes,4,grass,burntbricks,,earth,no,4,1,no,2,yes,no,20,yes,no,no,,no,never,,2,no,,less_half,no,no,,no,no,1,2,4,,10,695,-19.11227947,33.48338576,uuid:a7184e55-0615-492d-9835-8f44f3b03a71 10 | 59,16/11/2016,60,2017-04-08T09:03:01.000Z,2017-04-08T09:20:18.000Z,Province1,District2,Ward2,Village3,12,yes,no,15,yes,yes,yes,yes,8,grass,burntbricks,,earth,no,3,2,no,3,yes,yes,12,yes,no,no,,no,never,,3,no,,more_half,yes,yes,,yes,no,4,2,8,,11,694,-19.11225763,33.48341208,uuid:85465caf-23e4-4283-bb72-a0ef30e30176 11 | 70,18/11/2016,71,2017-04-09T15:00:19.000Z,2017-04-09T15:19:22.000Z,Province1,District1,Ward2,Village1,12,yes,no,14,yes,yes,yes,yes,6,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,1,yes,no,yes,no,no,more_once,,2,no,,more_half,no,yes,,no,no,3,2,6,,4,696,-19.11220603,33.48332601,uuid:761f9c49-ec93-4932-ba4c-cc7b78dfcef1 12 | 71,16/11/2016,127,2017-04-09T05:16:06.000Z,2017-04-09T05:27:41.000Z,Province1,District1,Ward2,Village3,10,yes,no,18,no,no,no,no,4,grass,burntbricks,,earth,no,3,8,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,1,2,4,,8,676,-19.11221396,33.48341359,uuid:f6d04b41-b539-4e00-868a-0f62b427587d 13 | 73,24/11/2016,152,2017-04-09T05:47:31.000Z,2017-04-09T06:16:11.000Z,Province1,District1,Ward2,Village1,15,yes,no,16,yes,no,yes,no,10,grass,burntbricks,,cement,no,3,1,no,2,yes,yes,16,yes,no,yes,no,no,once,,2,no,,abt_half,no,no,,yes,yes,3,3,10,,11,702,-19.1121034,33.48344669,uuid:59738c17-1cda-49ee-a563-acd76f6bc487 14 | 74,24/11/2016,153,2017-04-09T06:16:49.000Z,2017-04-09T06:28:48.000Z,Province1,District1,Ward2,Village1,41,no,no,41,yes,yes,yes,yes,5,grass,burntbricks,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,2,5,,12,691,-19.11223572,33.48345393,uuid:7e7961ca-fa1c-4567-9bfa-a02f876e4e03 15 | 75,24/11/2016,155,2017-04-09T06:35:16.000Z,2017-04-09T06:48:01.000Z,Province1,District1,Ward2,Village2,5,no,no,4,no,no,no,no,4,grass,burntbricks,,earth,no,2,1,no,2,no,,,,,,,,,2,,,,,yes,no,,yes,no,1,2,4,,13,713,-19.11220708,33.48342364,uuid:77b3021b-a9d6-4276-aaeb-5bfcfd413852 16 | 83,28/11/2016,195,2017-04-09T16:13:19.000Z,2017-04-09T16:35:24.000Z,Province1,District1,Ward2,Village2,7,yes,no,48,yes,yes,no,no,5,grass,burntbricks,,earth,no,3,1,no,3,yes,yes,5,yes,no,no,,no,never,,3,no,,more_half,yes,no,,no,no,3,2,5,,9,706,-19.11220559,33.48342104,uuid:2c132929-9c8f-450a-81ff-367360ce2c19 17 | 86,28/11/2016,198,2017-04-09T19:15:21.000Z,2017-04-09T19:27:56.000Z,Province1,District1,Ward2,Village2,11,no,no,49,no,no,no,no,3,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,11,yes,no,no,,no,never,,2,no,,less_half,yes,no,,yes,no,1,3,3,,7,716,-19.11212338,33.48338774,uuid:28c64954-739c-444c-a6e0-355878e471c8 18 | 111,11/05/2017,116,2017-05-11T06:09:56.000Z,2017-05-11T06:22:19.000Z,Province1,District1,Ward2,Village1a,21,yes,no,25,no,no,no,no,5,grass,burntbricks,,cement,no,2,3,yes,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,3,3,5,,20,0,-19.1114691,33.4761047,uuid:cfee6297-2c0e-4f8a-94cc-9aaee0bd64cb 19 | 114,18/05/2017,143,2017-05-18T05:55:04.000Z,2017-05-18T06:37:10.000Z,Province1,District1,Ward2,Village1,24,yes,no,24,yes,no,no,no,10,grass,burntbricks,,earth,no,3,2,yes,3,yes,yes,12,yes,no,no,,no,frequently,,3,no,,more_half,yes,no,,yes,no,3,3,10,,1911,0,-19.1124845,33.4763322,uuid:9a096a12-b335-468c-b3cc-1191180d62de 20 | 118,03/06/2017,165,2017-06-03T05:32:33.000Z,2017-06-03T05:51:49.000Z,Province1,District1,Ward2,Village1,14,no,no,14,no,no,no,no,9,grass,burntbricks,,earth,no,1,1,yes,2,yes,yes,10,yes,no,no,,no,never,,2,no,,less_half,no,no,,yes,no,3,3,9,,9,708,-19.11217437,33.48346513,uuid:62f3f7af-f0f3-4f88-b9e0-acf8baa49ae4 21 | 121,03/06/2017,174,2017-06-03T06:50:47.000Z,2017-06-03T07:20:21.000Z,Province1,District1,Ward2,Village1,13,no,no,25,yes,yes,yes,yes,12,grass,burntbricks,,cement,no,2,2,yes,3,yes,yes,13,yes,yes,no,,no,never,,3,no,,more_half,yes,no,,yes,no,3,3,12,,3,703,-19.11216751,33.48340539,uuid:43ec6132-478c-4f87-878d-fb3c0c4d0c74 22 | 125,03/06/2017,192,2017-06-03T16:17:55.000Z,2017-06-03T17:16:39.000Z,Province1,District1,Ward2,Village3,15,yes,no,20,yes,yes,yes,yes,9,grass,burntbricks,,cement,yes,4,1,no,3,yes,yes,5,yes,no,no,,no,once,,3,no,,more_half,yes,yes,,yes,no,1,3,9,,4,705,-19.11215404,33.48335905,uuid:f94409a6-e461-4e4c-a6fb-0072d3d58b00 23 | 126,18/05/2017,126,2017-05-18T04:13:37.000Z,2017-05-18T04:35:47.000Z,Province1,District1,Ward2,Village1,5,yes,no,7,yes,yes,yes,yes,3,grass,burntbricks,,earth,no,1,1,no,2,yes,yes,4,yes,no,no,,no,more_once,,2,no,,less_half,yes,yes,,yes,no,3,3,3,,7,700,-19.11219355,33.48337856,uuid:69caea81-a4e5-4e8d-83cd-9c18d8e8d965 24 | -------------------------------------------------------------------------------- /episodes/data/SAFI_grass_roof_muddaub.csv: -------------------------------------------------------------------------------- 1 | Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID 2 | 0,17/11/2016,1,2017-03-23T09:49:57.000Z,2017-04-02T17:29:08.000Z,Province1,District1,Ward2,Village2,11,no,no,4,no,yes,no,yes,3,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,2,3,,14,698,-19.11225943,33.48345609,uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef 3 | 1,17/11/2016,1,2017-04-02T09:48:16.000Z,2017-04-02T17:26:19.000Z,Province1,District1,Ward2,Village2,2,yes,no,9,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,no,3,yes,yes,2,yes,no,yes,no,no,once,,3,no,,more_half,yes,no,,yes,no,3,2,7,,19,690,-19.11247712,33.48341568,uuid:099de9c9-3e5e-427b-8452-26250e840d6e 4 | 5,17/11/2016,6,2017-04-02T15:27:25.000Z,2017-04-02T17:28:02.000Z,Province1,District1,Ward2,Village2,3,no,no,3,no,no,no,no,3,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,2,3,,12,692,-19.1121959,33.48339187,uuid:daa56c91-c8e3-44c3-a663-af6a49a2ca70 5 | 6,17/11/2016,7,2017-04-02T15:38:01.000Z,2017-04-02T17:28:19.000Z,Province1,District1,Ward2,Village2,20,no,no,38,yes,no,yes,no,6,grass,muddaub,,earth,no,1,1,yes,4,yes,yes,10,yes,no,no,,no,never,,4,no,,more_half,no,no,,no,no,1,3,6,,11,709,-19.11221904,33.48336498,uuid:ae20a58d-56f4-43d7-bafa-e7963d850844 6 | 15,24/11/2016,16,2017-04-03T05:29:24.000Z,2017-04-03T05:40:53.000Z,Province1,District1,Ward2,Village2,24,yes,no,47,yes,yes,yes,yes,6,grass,muddaub,,earth,yes,2,1,yes,1,no,,,,,,,,,1,,,,,yes,yes,,no,no,4,3,6,,9,709,-19.11210678,33.48344397,uuid:d17db52f-4b87-4768-b534-ea8f9704c565 7 | 17,21/11/2016,18,2017-04-03T12:27:04.000Z,2017-04-03T12:39:48.000Z,Province1,District1,Ward2,Village2,6,no,no,20,yes,yes,no,yes,4,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,no,3,2,4,,17,685,-19.11133515,33.47630848,uuid:7ffe7bd1-a15c-420c-a137-e1f006c317a3 8 | 21,21/11/2016,22,2017-04-03T16:28:52.000Z,2017-04-03T16:40:47.000Z,Province1,District1,Ward2,Village2,14,no,no,20,yes,yes,yes,yes,4,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,2,4,,9,722,-19.11212691,33.48350764,uuid:a51c3006-8847-46ff-9d4e-d29919b8ecf9 9 | 27,21/11/2016,28,2017-04-05T05:14:49.000Z,2017-04-05T05:36:18.000Z,Province1,District1,Ward2,Village1,2,no,no,2,no,no,no,no,2,grass,muddaub,,earth,no,1,1,yes,3,yes,yes,2,yes,no,no,,no,more_once,,3,no,,more_half,no,yes,,yes,no,1,3,2,,7,721,-19.04290893,33.40506932,uuid:1de53318-a8cf-4736-99b1-8239f8822473 10 | 29,21/11/2016,30,2017-04-05T06:05:58.000Z,2017-04-05T06:20:39.000Z,Province1,District1,Ward2,Village1,22,yes,yes,22,no,no,no,no,7,grass,muddaub,,earth,no,1,2,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,2,7,,5,669,-19.04300478,33.40505449,uuid:59341ead-92be-45a9-8545-6edf9f94fdc6 11 | 30,21/11/2016,31,2017-04-05T06:21:20.000Z,2017-04-05T06:38:26.000Z,Province1,District1,Ward2,Village1,15,yes,no,2,yes,yes,yes,yes,3,grass,muddaub,,earth,no,7,1,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,1,3,3,,4,704,-19.04302176,33.40509382,uuid:cb06eb49-dd39-4150-8bbe-a599e074afe8 12 | 32,21/11/2016,33,2017-04-05T08:08:19.000Z,2017-04-05T08:25:48.000Z,Province1,District1,Ward2,Village1,20,yes,no,34,yes,yes,yes,yes,8,grass,muddaub,,cement,no,2,1,no,2,yes,yes,20,yes,no,no,,no,more_once,,2,no,,more_half,no,no,,yes,no,2,2,8,,5,695,-19.04414887,33.40383602,uuid:0fbd2df1-2640-4550-9fbd-7317feaa4758 13 | 34,17/11/2016,35,2017-04-05T16:22:13.000Z,2017-04-05T16:50:25.000Z,Province1,District1,Ward2,Village3,45,yes,no,45,no,no,no,no,5,grass,muddaub,,earth,no,3,1,no,2,yes,yes,20,yes,no,yes,no,no,more_once,,2,no,,more_half,no,yes,,yes,no,2,3,5,,11,733,-19.11211362,33.48342515,uuid:ff7496e7-984a-47d3-a8a1-13618b5683ce 14 | 37,17/11/2016,38,2017-04-05T17:28:12.000Z,2017-04-05T17:50:57.000Z,Province1,District1,Ward2,Village2,19,yes,yes,19,no,no,no,no,10,grass,muddaub,,earth,no,3,1,no,2,yes,yes,9,yes,no,yes,no,no,never,,2,no,,more_half,no,yes,,yes,yes,3,3,10,,9,696,-19.11222939,33.48337467,uuid:81309594-ff58-4dc1-83a7-72af5952ee08 15 | 40,17/11/2016,41,2017-04-06T09:03:50.000Z,2017-04-06T09:14:05.000Z,Province1,District1,Ward2,Village2,22,yes,no,22,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,no,,no,no,2,3,7,,13,679,-19.04339398,33.40485363,uuid:b3ba34d8-eea1-453d-bc73-c141bcbbc5e5 16 | 42,17/11/2016,43,2017-04-06T09:31:56.000Z,2017-04-06T09:53:53.000Z,Province1,District1,Ward2,Village3,3,no,no,29,no,no,no,no,7,grass,muddaub,,earth,no,2,1,no,4,yes,no,3,yes,no,no,,no,never,,4,no,,abt_half,yes,no,,no,no,2,2,7,,30,605,-19.04303063,33.40472726,uuid:b4dff49f-ef27-40e5-a9d1-acf287b47358 17 | 43,17/11/2016,44,2017-04-06T14:44:32.000Z,2017-04-06T14:53:01.000Z,Province1,District1,Ward2,Village3,3,no,no,6,no,no,no,no,2,grass,muddaub,,earth,no,2,1,no,1,no,,,,,,,,,1,,,,,yes,no,,yes,no,3,2,2,,11,716,-19.04315107,33.40458039,uuid:f9fadf44-d040-4fca-86c1-2835f79c4952 18 | 44,17/11/2016,45,2017-04-06T14:53:04.000Z,2017-04-06T15:11:57.000Z,Province1,District1,Ward2,Village3,25,yes,no,7,no,no,no,no,9,grass,muddaub,,earth,no,2,1,no,3,yes,yes,20,yes,no,no,,no,never,,3,yes,,more_half,yes,no,,no,no,4,3,9,,28,703,-19.04312371,33.40466493,uuid:e3554d22-35b1-4fb9-b386-dd5866ad5792 19 | 46,17/11/2016,47,2017-04-07T14:05:25.000Z,2017-04-07T14:19:45.000Z,Province1,District1,Ward2,Village3,2,yes,no,2,yes,yes,yes,yes,2,grass,muddaub,,earth,no,1,1,yes,2,yes,yes,2,yes,no,yes,no,no,once,,2,no,,more_half,no,no,,no,no,1,3,2,,5,689,-19.11226093,33.48339791,uuid:2d0b1936-4f82-4ec3-a3b5-7c3c8cd6cc2b 20 | 47,16/11/2016,48,2017-04-07T14:19:49.000Z,2017-04-07T14:40:23.000Z,Province1,District1,Ward2,Village3,48,yes,no,58,yes,no,yes,no,7,grass,muddaub,,earth,no,1,1,no,1,no,,,,,,,,,1,,,,,no,no,,yes,no,3,3,7,,12,689,-19.11222978,33.48353345,uuid:e180899c-7614-49eb-a97c-40ed013a38a2 21 | 49,16/11/2016,50,2017-04-07T14:56:01.000Z,2017-04-07T15:26:23.000Z,Province1,District1,Ward2,Village3,6,yes,no,7,no,no,no,no,6,grass,muddaub,,earth,no,1,1,no,1,yes,yes,1,no,no,yes,no,no,never,,1,no,,abt_half,yes,no,,yes,no,1,2,6,,12,718,-19.11220496,33.48344521,uuid:4267c33c-53a7-46d9-8bd6-b96f58a4f92c 22 | 50,16/11/2016,51,2017-04-07T15:27:45.000Z,2017-04-07T15:39:10.000Z,Province1,District1,Ward2,Village3,11,yes,no,30,yes,no,yes,no,5,grass,muddaub,,cement,no,1,1,no,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,3,5,,12,709,-19.11221446,33.48338443,uuid:18ac8e77-bdaf-47ab-85a2-e4c947c9d3ce 23 | 53,16/11/2016,54,2017-04-08T05:36:55.000Z,2017-04-08T05:52:15.000Z,Province1,District1,Ward2,Village3,10,no,yes,15,yes,no,yes,no,7,grass,muddaub,,earth,no,3,1,no,1,yes,yes,10,yes,no,no,,no,never,,1,no,,more_half,no,no,,yes,no,1,2,7,,9,681,-19.11220247,33.48335527,uuid:273ab27f-9be3-4f3b-83c9-d3e1592de919 24 | 54,16/11/2016,55,2017-04-08T05:52:32.000Z,2017-04-08T06:05:41.000Z,Province1,District1,Ward2,Village3,23,yes,no,23,yes,no,yes,no,9,grass,muddaub,,earth,no,2,2,no,1,no,,,,,,,,,1,,,,,no,no,,yes,no,1,2,9,,11,702,-19.11228974,33.48333393,uuid:883c0433-9891-4121-bc63-744f082c1fa0 25 | 58,16/11/2016,59,2017-04-08T08:52:05.000Z,2017-04-08T09:02:34.000Z,Province1,District1,Ward2,Village3,60,no,no,60,yes,yes,yes,yes,2,grass,muddaub,,earth,no,1,3,yes,2,no,,,,,,,,,2,,,,,no,no,,no,no,3,2,2,,13,683,-19.1123395,33.48333251,uuid:1936db62-5732-45dc-98ff-9b3ac7a22518 26 | 60,16/11/2016,61,2017-04-08T10:47:11.000Z,2017-04-08T11:14:09.000Z,Province1,District1,Ward2,Village3,14,yes,no,14,no,yes,yes,yes,10,grass,muddaub,,earth,no,4,1,no,2,yes,yes,13,yes,no,yes,yes,no,more_once,,2,no,,more_half,yes,no,,yes,no,3,3,10,,13,712,-19.11218035,33.48341635,uuid:2401cf50-8859-44d9-bd14-1bf9128766f2 27 | 61,16/11/2016,62,2017-04-08T13:27:58.000Z,2017-04-08T13:41:21.000Z,Province1,District1,Ward2,Village3,5,no,no,5,yes,no,no,no,5,grass,muddaub,,earth,no,3,1,no,2,no,,,,,,,,,2,,,,,no,yes,,no,no,1,3,5,,18,719,-19.11216869,33.48339699,uuid:c6597ecc-cc2a-4c35-a6dc-e62c71b345d6 28 | 62,16/11/2016,63,2017-04-08T13:41:39.000Z,2017-04-08T13:52:07.000Z,Province1,District1,Ward2,Village3,1,yes,no,10,yes,yes,no,yes,4,grass,muddaub,,earth,no,1,1,yes,1,no,,,,,,,,,1,,,,,no,yes,,no,no,1,3,4,,25,702,-19.11220024,33.4833903,uuid:86ed4328-7688-462f-aac7-d6518414526a 29 | 63,16/11/2016,64,2017-04-08T13:52:30.000Z,2017-04-08T14:02:24.000Z,Province1,District1,Ward2,Village3,1,no,no,1,no,no,no,no,6,grass,muddaub,,earth,no,2,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,3,6,,9,704,-19.1121962,33.48339576,uuid:28cfd718-bf62-4d90-8100-55fafbe45d06 30 | 68,16/11/2016,69,2017-04-09T22:08:07.000Z,2017-04-09T22:21:08.000Z,Province1,District1,Ward2,Village3,12,yes,no,12,no,no,no,no,4,grass,muddaub,,earth,no,2,1,no,1,yes,yes,12,yes,no,no,,no,more_once,,1,no,,more_half,no,no,,yes,no,1,3,4,,8,708,-19.11216693,33.48340465,uuid:f86933a5-12b8-4427-b821-43c5b039401d 31 | 78,25/11/2016,180,2017-04-09T08:23:05.000Z,2017-04-09T08:42:02.000Z,Province1,District1,Ward2,Village1,4,no,no,50,yes,yes,yes,yes,7,grass,muddaub,,earth,no,1,1,yes,2,yes,yes,4,yes,no,no,,no,never,,2,no,,abt_half,no,yes,,yes,no,3,3,7,,4,701,-19.11204921,33.48346659,uuid:ece89122-ea99-4378-b67e-a170127ec4e6 32 | 81,28/11/2016,186,2017-04-09T15:20:26.000Z,2017-04-09T15:46:14.000Z,Province1,District1,Ward1,Village2,24,no,no,24,yes,no,yes,no,7,grass,muddaub,,earth,no,1,1,no,1,yes,yes,21,yes,no,no,,no,more_once,,1,no,,more_half,no,no,,no,no,2,3,7,,10,690,-19.11232028,33.48346266,uuid:268bfd97-991c-473f-bd51-bc80676c65c6 33 | 82,28/11/2016,187,2017-04-09T15:48:14.000Z,2017-04-09T16:12:46.000Z,Province1,District1,Ward2,Village2,1,yes,no,43,yes,no,no,yes,5,grass,muddaub,,earth,no,3,2,yes,2,yes,yes,24,yes,no,yes,no,no,more_once,,2,no,,more_half,no,no,,yes,no,4,3,5,,13,691,-19.1122345,33.48349248,uuid:0a42c9ee-a840-4dda-8123-15c1bede5dfc 34 | 87,21/11/2016,201,2017-04-09T19:31:47.000Z,2017-04-09T19:45:38.000Z,Province1,District1,Ward2,Village2,6,yes,no,6,yes,yes,yes,yes,4,grass,muddaub,,earth,no,1,2,no,2,no,,,,,,,,,2,,,,,no,yes,,yes,no,2,2,4,,10,685,-19.11217226,33.48338504,uuid:9e79a31c-3ea5-44f0-80f9-a32db49422e3 35 | 89,26/04/2017,72,2017-04-26T15:46:24.000Z,2017-04-26T16:13:33.000Z,Province1,District1,Ward2,Village1,24,yes,no,24,yes,yes,yes,yes,6,grass,muddaub,,earth,no,2,1,no,3,yes,yes,4,yes,no,yes,yes,no,more_once,,3,no,,more_half,no,yes,,yes,no,3,2,6,,5,716,-19.11221489,33.48347358,uuid:c4a2c982-244e-45a5-aa4b-71fa53f99e18 36 | 95,27/04/2017,101,2017-04-27T16:42:02.000Z,2017-04-27T18:11:54.000Z,Province1,District1,Ward2,Village2,10,no,no,4,yes,yes,no,no,3,grass,muddaub,,earth,no,1,1,no,2,yes,yes,10,yes,no,no,,no,never,,2,no,,abt_half,no,yes,,no,yes,1,3,3,,11,702,-19.11223501,33.48341767,uuid:3c174acd-e431-4523-9ad6-eb14cddca805 37 | 106,04/05/2017,118,2017-05-04T10:26:35.000Z,2017-05-04T10:46:35.000Z,Province1,District1,Ward2,Village1,6,no,no,25,yes,yes,yes,no,5,grass,muddaub,,earth,no,2,1,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,3,5,,9,713,-19.11220573,33.48344178,uuid:77335b2e-8812-4a35-b1e5-ca9ab626dfea 38 | 108,04/05/2017,119,2017-05-04T11:16:57.000Z,2017-05-04T11:38:38.000Z,Province1,District1,Ward2,Village1a,14,no,yes,14,yes,no,no,no,3,grass,muddaub,,earth,no,4,1,no,2,yes,no,3,yes,no,no,,no,never,,2,no,,more_half,no,no,,no,no,4,3,3,,5,706,-19.11225658,33.48334839,uuid:fa201fce-4e94-44b8-b435-c558c2e1ed55 39 | 112,11/05/2017,117,2017-05-11T06:28:02.000Z,2017-05-11T06:55:35.000Z,Province1,District1,Ward2,Village1a,1,no,no,28,yes,yes,yes,no,10,grass,muddaub,,cement,no,1,4,no,1,no,,,,,,,,,1,,,,,no,yes,,yes,no,1,3,10,,20,0,-19.1114691,33.4761047,uuid:3fe626b3-c794-48e1-a80f-5bfe440c507b 40 | 115,18/05/2017,150,2017-05-18T10:37:37.000Z,2017-05-18T10:56:00.000Z,Province1,District1,Ward2,Village1,8,no,no,8,no,yes,no,yes,7,grass,muddaub,,earth,no,2,1,no,2,yes,yes,8,yes,no,no,,no,never,,2,no,,less_half,no,yes,,no,yes,1,3,7,,17,709,-19.11147739,33.47618369,uuid:92613d0d-e7b1-4d62-8ea4-451d7cd0a982 41 | 119,03/06/2017,166,2017-06-03T05:53:28.000Z,2017-06-03T06:25:06.000Z,Province1,District1,Ward2,Village1,16,no,no,16,yes,yes,yes,yes,11,grass,muddaub,,earth,no,2,1,no,3,yes,yes,2,yes,no,no,,no,never,,3,no,,less_half,no,yes,,yes,no,1,2,11,,1799.999,0,-19.1138589,33.4826653,uuid:40aac732-94df-496c-97ba-5b67f59bcc7a 42 | 120,03/06/2017,167,2017-06-03T06:25:09.000Z,2017-06-03T06:45:06.000Z,Province1,District1,Ward2,Village1,16,no,no,24,yes,no,yes,no,8,grass,muddaub,,earth,no,5,1,no,2,yes,yes,16,yes,no,no,,no,never,,2,no,,less_half,no,yes,,yes,yes,3,2,8,,2000,0,-19.1149887,33.4882685,uuid:a9d1a013-043b-475d-a71b-77ed80abe970 43 | 128,04/06/2017,194,2017-06-04T10:13:36.000Z,2017-06-04T10:32:06.000Z,Province1,District1,Ward2,Village1,5,no,no,5,no,no,yes,no,4,grass,muddaub,,earth,no,2,1,no,2,yes,yes,2,yes,no,no,,no,more_once,,2,no,,more_half,no,yes,,no,no,1,3,4,,10,719,-19.11227133,33.48347111,uuid:95c11a30-d44f-40c4-8ea8-ec34fca6bbbf 44 | -------------------------------------------------------------------------------- /episodes/data/SN7577.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/data/SN7577.sqlite -------------------------------------------------------------------------------- /episodes/data/SN7577i_a.csv: -------------------------------------------------------------------------------- 1 | Id,Q1,Q2,Q3,Q4 2 | 1,1,-1,1,8 3 | 2,3,-1,1,4 4 | 3,10,3,2,6 5 | 4,9,-1,10,10 6 | 5,10,2,6,1 7 | 6,1,-1,1,1 8 | 7,1,-1,1,8 9 | 8,1,-1,1,1 10 | 9,9,-1,10,10 11 | 10,2,-1,1,1 12 | -------------------------------------------------------------------------------- /episodes/data/SN7577i_aa.csv: -------------------------------------------------------------------------------- 1 | Id,Q1,Q2,Q3 2 | 1,1,-1,1 3 | 2,3,-1,1 4 | 3,10,3,2 5 | 4,9,-1,10 6 | 5,10,2,6 7 | 6,1,-1,1 8 | 7,1,-1,1 9 | 8,1,-1,1 10 | 9,9,-1,10 11 | 10,2,-1,1 12 | -------------------------------------------------------------------------------- /episodes/data/SN7577i_b.csv: -------------------------------------------------------------------------------- 1 | Id,Q1,Q2,Q3,Q4 2 | 1277,10,10,4,6 3 | 1278,2,-1,5,4 4 | 1279,2,-1,4,5 5 | 1280,1,-1,2,3 6 | 1281,10,2,3,4 7 | 1282,2,-1,3,6 8 | 1283,10,10,2,10 9 | 1284,9,-1,8,9 10 | 1285,11,11,1,2 11 | 1286,10,6,6,6 12 | -------------------------------------------------------------------------------- /episodes/data/SN7577i_bb.csv: -------------------------------------------------------------------------------- 1 | Id,Q1,Q2,Q4 2 | 1277,10,10,6 3 | 1278,2,-1,4 4 | 1279,2,-1,5 5 | 1280,1,-1,3 6 | 1281,10,2,4 7 | 1282,2,-1,6 8 | 1283,10,10,10 9 | 1284,9,-1,9 10 | 1285,11,11,2 11 | 1286,10,6,6 12 | -------------------------------------------------------------------------------- /episodes/data/SN7577i_c.csv: -------------------------------------------------------------------------------- 1 | Id,maritl,numhhd 2 | 1,6,3 3 | 2,4,3 4 | 3,6,2 5 | 4,4,1 6 | 5,4,1 7 | 6,2,2 8 | 7,2,2 9 | 8,2,2 10 | 9,6,2 11 | 10,6,1 12 | -------------------------------------------------------------------------------- /episodes/data/SN7577i_d.csv: -------------------------------------------------------------------------------- 1 | Id,Q1,Q2 2 | 1,1,-1 3 | 2,3,-1 4 | 3,10,3 5 | 4,9,-1 6 | 5,10,2 7 | 6,1,-1 8 | 7,1,-1 9 | 8,1,-1 10 | 9,9,-1 11 | 10,2,-1 12 | -------------------------------------------------------------------------------- /episodes/fig/Python_date_format_01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_date_format_01.png -------------------------------------------------------------------------------- /episodes/fig/Python_function_parameters_9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_function_parameters_9.png -------------------------------------------------------------------------------- /episodes/fig/Python_install_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_install_1.png -------------------------------------------------------------------------------- /episodes/fig/Python_install_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_install_2.png -------------------------------------------------------------------------------- /episodes/fig/Python_jupyter_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_6.png -------------------------------------------------------------------------------- /episodes/fig/Python_jupyter_7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_7.png -------------------------------------------------------------------------------- /episodes/fig/Python_jupyter_8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyter_8.png -------------------------------------------------------------------------------- /episodes/fig/Python_jupyterl_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_jupyterl_6.png -------------------------------------------------------------------------------- /episodes/fig/Python_repl_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_3.png -------------------------------------------------------------------------------- /episodes/fig/Python_repl_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_4.png -------------------------------------------------------------------------------- /episodes/fig/Python_repl_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repl_5.png -------------------------------------------------------------------------------- /episodes/fig/Python_repll_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/Python_repll_3.png -------------------------------------------------------------------------------- /episodes/fig/barplot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/barplot1.png -------------------------------------------------------------------------------- /episodes/fig/boxplot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot1.png -------------------------------------------------------------------------------- /episodes/fig/boxplot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot2.png -------------------------------------------------------------------------------- /episodes/fig/boxplot3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/boxplot3.png -------------------------------------------------------------------------------- /episodes/fig/functionAnatomy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/functionAnatomy.png -------------------------------------------------------------------------------- /episodes/fig/histogram1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/histogram1.png -------------------------------------------------------------------------------- /episodes/fig/histogram3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/histogram3.png -------------------------------------------------------------------------------- /episodes/fig/lm1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/lm1.png -------------------------------------------------------------------------------- /episodes/fig/pandas_join_types.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/pandas_join_types.png -------------------------------------------------------------------------------- /episodes/fig/scatter1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/scatter1.png -------------------------------------------------------------------------------- /episodes/fig/scatter2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/datacarpentry/python-socialsci/f20f736d789fb014fbab0836e68359ba9a015c3f/episodes/fig/scatter2.png -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | --- 2 | maintainers: 3 | - Stephen Childs 4 | - Geoffrey Boushey 5 | - Annajiat Alim Rasel 6 | permalink: index.html 7 | site: sandpaper::sandpaper_site 8 | --- 9 | 10 | **Lesson Maintainers:** {{ page.maintainers | join: ', ' }} 11 | 12 | Python is a general purpose programming language that is useful for writing scripts to work effectively and reproducibly with data. 13 | 14 | This is an introduction to Python designed for participants with no programming experience. These lessons can be taught in a day (~ 6 hours). They start with some basic information about Python syntax, the Jupyter notebook interface, and move through how to import CSV files, using the pandas package to work with data frames, how to calculate summary information from a data frame, and a brief introduction to plotting. The last lesson demonstrates how to work with databases directly from Python. 15 | 16 | :::::::::::::::::::::::::::::::::::::::::: prereq 17 | 18 | ## Getting Started 19 | 20 | Data Carpentry's teaching is hands-on, so participants are encouraged to use 21 | their own computers to ensure the proper setup of tools for an efficient 22 | workflow. 23 | 24 | **These lessons assume no prior knowledge of the skills or tools.** 25 | 26 | To get started, follow the directions in the "[Setup](learners/setup.md)" tab to 27 | download data to your computer and follow any installation instructions. 28 | 29 | 30 | :::::::::::::::::::::::::::::::::::::::::::::::::: 31 | 32 | :::::::::::::::::::::::::::::::::::::::::: prereq 33 | 34 | ## Prerequisites 35 | 36 | This lesson requires a working copy of **Python**. 37 | 38 | To most effectively use these materials, please make sure to install 39 | everything *before* working through this lesson and download data files mentioned in the [Setup](learners/setup.md) tab. 40 | 41 | 42 | :::::::::::::::::::::::::::::::::::::::::::::::::: 43 | 44 | :::::::::::::::::::::::::::::::::::::::::: prereq 45 | 46 | ## For Instructors 47 | 48 | If you are teaching this lesson in a workshop, please see the 49 | [Instructor notes](instructors/instructor-notes.md). 50 | 51 | 52 | :::::::::::::::::::::::::::::::::::::::::::::::::: 53 | 54 | 55 | -------------------------------------------------------------------------------- /instructors/instructor-notes.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Instructor Notes 3 | --- 4 | 5 | ## Setup 6 | 7 | The setup instruction for installing the Anaconda distribution of Python is in the [setup](../learners/setup.md) file. 8 | The Anaconda distribution contains all of the 3rd party libraries used. 9 | 10 | PIP is referred to in the text but it shouldn't need to be used. 11 | 12 | It is assumed that Jupyter notebooks will be used for all of the coding. (The shell is used in explaining REPL) 13 | 14 | How to start Jupyter is included in the setup instructions. 15 | 16 | ## The datasets used 17 | 18 | All of the datasets used have been placed in the data folder. 19 | 20 | They should be downloaded to the local machine before use. 21 | 22 | The code examples are written on the assumption that the data files are in the same folder as the notebook. 23 | 24 | ## The Lessons 25 | 26 | It is assumed that all of the work will be done in Jupyter notebooks. There are no specific instructions when to start a notebook, at each new episode would be appropriate. 27 | 28 | If a new notebook is created or a new Jupyter session is started. Any mudules that need to be used will have to be imported again. 29 | 30 | Each section of code within an episode will not necessarily re import modules that are needed for that section. 31 | 32 | [Introduction to Python](link) 33 | 34 | An introduction to the benefits of the Python language. 35 | 36 | Comparison between interpreted and compiled languages. 37 | 38 | An explanation and example of using REPL. 39 | 40 | An Introduction to using Jupyter notebooks. 41 | 42 | [Python basics ](link) 43 | 44 | Using Jupyter - creating new cells, hiding output, changing cell type. 45 | 46 | Python variables and data types. 47 | 48 | Using Print and builtin functions. 49 | 50 | Python Help and Internet help for Python. 51 | 52 | Strings and String functions. 53 | 54 | Lists and range function (Dictionaries kept for later, when needed). 55 | 56 | [Python control structures](link) 57 | 58 | Explanation of typical program structure and the need for control structures. 59 | 60 | Explanation and examples of the `if`, `while` and `for` constructs. 61 | 62 | [Creating re-usable code](link) 63 | 64 | Explanation of functions and why we use them. 65 | 66 | Creating functions. 67 | 68 | Using parameters in functions. 69 | 70 | Python libraries and how to install and use them. 71 | 72 | [Processing data from a file](link) 73 | 74 | This episode starts to use some of the control structures and files to create complete small programs with 75 | proper 'Input - Processing - Output' structure. 76 | 77 | Different approaches to reading files ie. one record at a time v reading the whole file. 78 | 79 | Opening and closing files and file handles. 80 | 81 | Description of a csv file being a list of strings. 82 | 83 | Iterating over complete files. 84 | 85 | Use of the `split` function to parse a line from a csv file and extracting specific elements. 86 | 87 | Writing extracted information to a file. 88 | 89 | The Python dictionary explanation and examples. 90 | 91 | Creating and populating a dictionary on the fly from data read from a file. 92 | 93 | [Date and Time](link) 94 | 95 | Need to import the datetime module. 96 | 97 | Explanation of format strings to provide a representation of the dates and times. 98 | 99 | Need to convert date/time strings to a datetime structure before using them 100 | 101 | The ability to extract individual components from and do arithmetic with datetime structures 102 | 103 | [Processing JSON data](link) 104 | 105 | This episode is programatically the most complex. However the examples are built up as gently as possible. 106 | The solution to the 2nd exercise needs to be understood to make it worthwhile continuing with the coding. 107 | 108 | The episode starts with further details of the Dictionary object and how it can be used to represent nested data structures. 109 | 110 | The JSON data format is described and compared to a Python dictionary. 111 | 112 | An example of using the json module to read a file in JSON format is explained. 113 | 114 | How to make the printed JSON more presentable using `json.dumops` parameters is demonstrated. 115 | 116 | Systematically extracting single items from a dictionary is covered followed by extracting all such entries 117 | from a file of JSON. 118 | 119 | The overall aim is to demonstrate extracting a set of specific fields from a JSON formatted file and writing them to a flat structured csv file making subsequent processing more straight forward. 120 | 121 | [Pandas](link) 122 | 123 | This episode introduces the pandas module and explains the advantages of using it for data analysis. 124 | 125 | The two key pandas data structures are introduced. 126 | 127 | Examples of reading a csv file into a dataframe, optionally selecting only specific columns. 128 | 129 | Various methods for extract information about the dataframe are covered. 130 | 131 | [Extracting rows and columns](link) 132 | 133 | Basic pandas methods for extracting row and columns from a dataframe are covered. 134 | 135 | For row selection the emphasis is on specifying criteria, although random selection is also covered. 136 | 137 | [Aggregations\_and\_missing\_data](link) 138 | 139 | Explain why we want to summarise data. 140 | 141 | Introduce the basic aggregation methods for a dataframe and individual columns. 142 | 143 | The examples of the aggregations will show up 'NaN' values. Missing data and its representations are discussed. 144 | 145 | The effects of missing data in summarisation is discussed. 146 | 147 | Ways of recognising and dealing with missing data are discussed and demonstrated. 148 | 149 | Summarising categorical variables using the `groupby` method is discussed and demonstrated. 150 | 151 | [Joins](link) 152 | 153 | Explain why we want to join dataframes and the necessary conditions for doing so. 154 | 155 | Simple concatenation of rows using `concat` and the effect of the columns not being the same. 156 | 157 | The downside of using `concat` to join by columns. 158 | 159 | The use of `merge` and its similarity with the SQL JOIN clause. 160 | 161 | The different types of joins available with `merge` 162 | 163 | Discussion and examples of what different join types tell you about the data. 164 | 165 | [Long\_and\_wide\_data\_formats](link) 166 | 167 | The difference between wide and long formats 168 | 169 | Use and examples of `melt` and `pivot` methods to convert between them. 170 | 171 | Because the SN7577 dataset has no 'key' column, the setup for the examples is more complex, but it makes use of previously used techniques as well as introducing different ways of selecting columns from a dataframe. 172 | 173 | The final exercise is quite long, but brings together the concepts learned in this and the previous two episodes. 174 | 175 | [Data visualisation using Matplotlib](link) 176 | 177 | The basic aim of this episode is to illustrate how simple graphs can be produced using matplotlib. 178 | 179 | This episode uses entirely random data produced by numpy methods 180 | 181 | Four commonly used graph types are demonstrated; Bar charts, Histograms, Scatter plots and Boxplots. 182 | 183 | The tight integration between pandas and matplotlib is discussed and illustrated. 184 | 185 | How graphs can be built up from individual components is covered. 186 | 187 | Saving the produced graph to a file is explained and demonstrated. 188 | 189 | [Accessing SQLite Databases](link) 190 | 191 | A comparison between an relational database table and a pandas dataframe is made. 192 | 193 | The possible advantages of using data from a database (SQLite in this case) rather than having it in a dataframe is explained. 194 | 195 | The construction of an SQLite database as a single file is covered 196 | 197 | The sqlite3 module is introduced and connection strings are explained. 198 | 199 | The use of the connection and the cursor are explained. 200 | 201 | The `execute` method is used to run an SQL query on the database system is demonstrated. 202 | 203 | Various ways of retrieving the results of the query are covered. 204 | 205 | Pandas is used to run a similar query, its greater simplicity is explained. 206 | 207 | The problem with running DML statements with pandas is illustrated in the final exercise. 208 | 209 | 210 | -------------------------------------------------------------------------------- /learners/discuss.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Discussion 3 | --- 4 | 5 | FIXME 6 | 7 | 8 | -------------------------------------------------------------------------------- /learners/reference.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Glossary' 3 | --- 4 | 5 | ## Glossary 6 | 7 | 0-based indexing 8 | : is a way of assigning indices to elements in a sequential, ordered data structure 9 | starting from 0, i.e. where the first element of the sequence has index 0. 10 | 11 | attribute 12 | : a property of an object that can be viewed, accessed with a `.` but no `()` ex: `df.dtypes` 13 | 14 | boolean 15 | : a data type that can be `True` or `False` 16 | 17 | cast 18 | : the process of changing the type of a variable, in python the data type names operate as functions for casting, for example `int(3.5)` 19 | 20 | CSV (file) 21 | : is an acronym which stands for Comma-Separated Values file. CSV files store 22 | tabular data, either numbers, strings, or a combination of the two, in plain 23 | text with columns separated by a comma and rows by the carriage return character. 24 | 25 | database 26 | : is an organized collection of data. 27 | 28 | DataFrame 29 | : is a two-dimensional labeled data structure with columns of (potentially) 30 | different type. 31 | 32 | data structure 33 | : is a particular way of organizing data in memory. 34 | 35 | data type 36 | : is a particular kind of item that can be assigned to a variable, defined by 37 | by the values it can take, the programming language in use and the operations 38 | that can be performed on it. examples: `int` (integer), `str` (string), float, boolean, list 39 | 40 | docstring 41 | : is an recommended documentation string to describe what a Python function does. 42 | 43 | faceting 44 | : is the act of plotting relationships between set variables in multiple subsets 45 | of the data with the results appearing as different panels in the same figure. 46 | 47 | float 48 | : is a Python data type designed to store positive and negative decimal numbers 49 | by means of a floating point representation. 50 | 51 | function 52 | : is a group of related statements that perform a specific task. 53 | 54 | integer 55 | : is a Python data type designed to store positive and negative integer numbers. 56 | 57 | interactive mode 58 | : is an online mode of operation in which the user writes the commands directly 59 | on the command line one-by-one and execute them immediately by pressing a button 60 | on the keyword, usually Enter. 61 | 62 | join key 63 | : is a variable or an array representing the column names over which pandas.DataFrame.join() 64 | merge together columns of different data sets. 65 | 66 | library 67 | : is a set of functions and methods grouped together to perform some specific 68 | sort of tasks. 69 | 70 | list 71 | : is a Python data structure designed to contain sequences of integers, floats, 72 | strings and any combination of the previous. The sequence is ordered and indexed 73 | by integers, starting from 0. Elements of a list can be accessed by their index 74 | and can be modified. 75 | 76 | loop 77 | : is a sequence of instructions that is continually repeated until a condition 78 | is satisfied. 79 | 80 | method 81 | : a function that is specific to a type of data, accessed via `.` and requires `()` to run, for example `df.sum()` 82 | 83 | NaN 84 | : is an acronym for Not-a-Number and represents that either a value is missing or 85 | the calculation cannot output any meaningful result. 86 | 87 | None 88 | : is an object that represents no value. 89 | 90 | scripting mode 91 | : is an offline mode of operation in which the user writes the commands to be 92 | executed in a text file (with .py extension for Python) which is then compiled 93 | or interpreted to run the program. Notes that Python interprets script on 94 | run-time and compiles a binary version of the program to speed up the execution time. 95 | 96 | sequential (data structure) 97 | : is an ordered group of objects stored in memory which can be accessed specifying 98 | their index, i.e. their position, in the structure. 99 | 100 | string 101 | : is a Python data type designed to store sequences of characters. 102 | 103 | tuple 104 | : is a Python data structure designed to contain sequences of integers, floats, 105 | strings and any combination of the previous. The sequence is ordered and indexed 106 | by integers, starting from 0. Elements of a tuple can be accessed by their index 107 | but cannot be modified. 108 | 109 | variable 110 | : a named quantity that can store a value, a variable can store any type, but always one type for a given value. 111 | 112 | ## Jupyter Notebook Hints 113 | 114 | `Esc` will take you into command mode where you can navigate around your notebook with arrow keys. 115 | 116 | ### While in command mode: 117 | 118 | - A to insert a new cell above the current cell, 119 | - B to insert a new cell below. 120 | - M to change the current cell to Markdown, 121 | - Y to change it back to code 122 | - D + D (press the key twice) to delete the current cell 123 | - Enter will take you from command mode back into edit mode for the given cell. 124 | 125 | ### while in edit mode: 126 | 127 | - Ctrl + Shift + \- will split the current cell into two from where your cursor is. 128 | - Shift + Enter will run the current cell 129 | 130 | ### Full Shortcut Listing 131 | 132 | Cmd + Shift + P (or Ctrl + Shift + P on Linux and Windows) 133 | 134 | 135 | -------------------------------------------------------------------------------- /learners/setup.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Setup 3 | --- 4 | 5 | :::::::::::::::::::::::::::::::::::::::::: prereq 6 | 7 | ## Data 8 | 9 | Data for this lesson is from the [The SAFI Teaching Database](https://datacarpentry.org/socialsci-workshop/data) and [Audit of Political Engagement 11, 2013](https://doi.org/10.5255/UKDA-SN-7577-1). 10 | 11 | We will use the files listed below for the data in this lesson. You can download the files by clicking on the links. 12 | 13 | **Note: make sure to place the data files on the same folder that your notebook is running on.** 14 | 15 | - [SAFI\_clean.csv](/data/SAFI_clean.csv) 16 | - [SAFI\_results.csv](/data/SAFI_results.csv) 17 | - [SAFI\_full\_shortname.csv](/data/SAFI_full_shortname.csv) 18 | - [SAFI.json](/data/SAFI.json) 19 | - [SN7577.tab](/data/SN7577.tab) 20 | - [SN7577i\_a.csv](/data/SN7577i_a.csv) 21 | - [SN7577i\_aa.csv](/data/SN7577i_aa.csv) 22 | - [SN7577i\_b.csv](/data/SN7577i_b.csv) 23 | - [SN7577i\_bb.csv](/data/SN7577i_bb.csv) 24 | - [SN7577i\_c.csv](/data/SN7577i_c.csv) 25 | - [SN7577i\_d.csv](/data/SN7577i_d.csv) 26 | - [SN7577.sqlite](/data/SN7577.sqlite) 27 | - [Newspapers.csv](/data/Newspapers.csv) 28 | 29 | 30 | :::::::::::::::::::::::::::::::::::::::::::::::::: 31 | 32 | :::::::::::::::::::::::::::::::::::::::::: prereq 33 | 34 | ## Software 35 | 36 | [Python](https://python.org) is a popular language for 37 | scientific computing, and great for general-purpose programming as 38 | well. Installing all of its scientific packages individually can be 39 | a bit difficult, so we recommend an all-in-one installer. 40 | 41 | For this workshop we use Python version 3.x. 42 | 43 | ### Required Python Packages for this workshop 44 | 45 | - [Pandas](https://pandas.pydata.org/) 46 | - [Jupyter notebook](https://jupyter.org/) 47 | - [Numpy](https://www.numpy.org/) 48 | - [Matplotlib](https://matplotlib.org/) 49 | - [Seaborn](https://seaborn.pydata.org) 50 | 51 | 52 | :::::::::::::::::::::::::::::::::::::::::::::::::: 53 | 54 | ## Setup instructions for Python 55 | 56 | In order to complete the materials for the Python lesson, you will need Python to be installed on your machine. As many of the examples and exercises use Jupyter notebooks, you will need it to be installed as well. 57 | 58 | The [Anaconda](https://www.anaconda.com/download/) distribution of Python will allow you to install both Python and Jupyter notebooks as a single install. Anaconda will also install many other commonly used Python packages. 59 | 60 | ### How to install the Anaconda distribution of python 61 | 62 | 1. Follow the Anaconda link above to the Anaconda website. There are versions of Anaconda available for Windows, macOS, and Linux. The website will detect your operating system and provide a link to the appropriate download. 63 | 2. There will be two options, one for Python 2.x and another for Python 3.x. We will take the Python 3.x option. Python 2.x will eventually be phased out but is still provided for backward compatibility with some older optional Python modules. The majority of popular modules have been converted to work with Python 3.x. The actual value of x will vary depending on when you download. At the time of writing I am being offered Python 3.6 or Python 2.7. 64 | 3. For Windows and Linux there is the option of either a 64 bit (default) download or a 32 bit download. Unless you know that you have an old 32 bit pc you should choose the 64 bit installer. 65 | 4. Run the downloaded installer program. Accept the default settings until you are given the option to add Anaconda to your environmental Path variable. Despite the recommendation not to and the subsequent warning, you should select this option. This will make it easier later on to start Jupyter notebooks from any location. 66 | 5. The installation can take a few minutes. When finished you should be able to open a cmd prompt (Type cmd from Windows start and into the cmd window type python. You should get a display similar to that below. 67 | ![](/fig/Python_install_1.png){alt='Python Install'} 68 | 6. The `>>>` prompt tells you that you are in the Python environment. You can exit Python with the `exit()` command. 69 | 70 | ### Running Jupyter Notebooks in Windows 71 | 72 | 1. From file explorer navigate to where you can select the folder which contains your Jupyter Notebook notebooks (it can be empty initially). 73 | 2. Hold down the `shift` key and right-click the mouse 74 | 3. The pop-up menu items will include an option to start a cmd window or in the latest Windows release start a ‘PowerShell' window. Select whichever appears. 75 | 4. When the window opens, type the command `jupyter notebook`. 76 | 5. Several messages will appear in the command window. In addition your default web browser will open and display the Jupyter notebook home page. The main part of this is a file browser window starting at the folder you selected in step 1. 77 | 6. There may be existing notebooks which you can select and open in a new tab in your browser or there is a menu option to create a new notebook. 78 | ![](/fig/Python_install_2.png){alt='Python Install'} 79 | 7. The Jupyter package creates a small web services and opens your browser pointing at it. If your browser does not open, you can open it manually and specify ‘localhost:8888' as the URL. 80 | 8. Port 8888 is the default port used by the Jupyter web service, but if it is already in use it will increment the port number automatically. Either way the port number it does use is given in a message in the cmd/powershell window. 81 | 9. Once running, the cmd/powershell window will display additional messages, e.g. about saving notebooks, but there is no need to interact with it directly. The window can be minimized and ignored. 82 | 10. To shut Jupyter down, select the cmd/powershell window and type Ctrl\+c twice and then close the window. 83 | 84 | 85 | -------------------------------------------------------------------------------- /profiles/learner-profiles.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: FIXME 3 | --- 4 | 5 | This is a placeholder file. Please add content here. 6 | -------------------------------------------------------------------------------- /site/README.md: -------------------------------------------------------------------------------- 1 | This directory contains rendered lesson materials. Please do not edit files 2 | here. 3 | --------------------------------------------------------------------------------