├── .editorconfig
├── .github
└── workflows
│ ├── README.md
│ ├── pr-close-signal.yaml
│ ├── pr-comment.yaml
│ ├── pr-post-remove-branch.yaml
│ ├── pr-preflight.yaml
│ ├── pr-receive.yaml
│ ├── sandpaper-main.yaml
│ ├── sandpaper-version.txt
│ ├── update-cache.yaml
│ └── update-workflows.yaml
├── .gitignore
├── .zenodo.json
├── AUTHORS
├── CITATION
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── config.yaml
├── episodes
├── 00-intro.md
├── 01-format-data.md
├── 02-common-mistakes.md
├── 03-dates-as-data.md
├── 04-quality-assurance.md
├── 05-exporting-data.md
└── fig
│ ├── bad-formatting.png
│ ├── better-formatting.png
│ ├── comments-in-cells.png
│ ├── data-validation-numbers-LibreOffice-new.png
│ ├── data-validation-numbers-new.png
│ ├── data-validation-tab-LibreOffice-new.png
│ ├── data-validation-tab-new.png
│ ├── error-invalid-data-LibreOffice-new.png
│ ├── error-invalid-data-new.png
│ ├── error_alert-new.png
│ ├── error_alert_LibreOffice-new.png
│ ├── excel-to-csv.png
│ ├── excel_dates_1.jpg
│ ├── filled-range-of-values-LibreOffice-new.png
│ ├── input_message-new.png
│ ├── input_message_LibreOffice-new.png
│ ├── load-csv-calc.png
│ ├── load-csv-excel.png
│ ├── multiple-info.png
│ ├── multiple-tables-example.png
│ ├── multiple-tables-example2.png
│ ├── select-range-of-values-LibreOffice-new.png
│ ├── select-range-of-values-new.png
│ ├── single-info.png
│ ├── solution_exercise_1_dates.png
│ ├── spreadsheet_simple_data_01.png
│ ├── spreadsheets_Data_validation_05.png
│ ├── white_table_1.jpg
│ └── zeros-example.png
├── index.md
├── instructors
└── instructor-notes.md
├── learners
├── reference.md
└── setup.md
├── profiles
└── learner-profiles.md
└── site
└── README.md
/.editorconfig:
--------------------------------------------------------------------------------
1 | root = true
2 |
3 | [*]
4 | charset = utf-8
5 | insert_final_newline = true
6 | trim_trailing_whitespace = true
7 |
8 | [*.md]
9 | indent_size = 2
10 | indent_style = space
11 | max_line_length = 100 # Please keep this in sync with bin/lesson_check.py!
12 | trim_trailing_whitespace = false # keep trailing spaces in markdown - 2+ spaces are translated to a hard break (
)
13 |
14 | [*.r]
15 | max_line_length = 80
16 |
17 | [*.py]
18 | indent_size = 4
19 | indent_style = space
20 | max_line_length = 79
21 |
22 | [*.sh]
23 | end_of_line = lf
24 |
25 | [Makefile]
26 | indent_style = tab
27 |
--------------------------------------------------------------------------------
/.github/workflows/README.md:
--------------------------------------------------------------------------------
1 | # Carpentries Workflows
2 |
3 | This directory contains workflows to be used for Lessons using the {sandpaper}
4 | lesson infrastructure. Two of these workflows require R (`sandpaper-main.yaml`
5 | and `pr-receive.yaml`) and the rest are bots to handle pull request management.
6 |
7 | These workflows will likely change as {sandpaper} evolves, so it is important to
8 | keep them up-to-date. To do this in your lesson you can do the following in your
9 | R console:
10 |
11 | ```r
12 | # Install/Update sandpaper
13 | options(repos = c(carpentries = "https://carpentries.r-universe.dev/",
14 | CRAN = "https://cloud.r-project.org"))
15 | install.packages("sandpaper")
16 |
17 | # update the workflows in your lesson
18 | library("sandpaper")
19 | update_github_workflows()
20 | ```
21 |
22 | Inside this folder, you will find a file called `sandpaper-version.txt`, which
23 | will contain a version number for sandpaper. This will be used in the future to
24 | alert you if a workflow update is needed.
25 |
26 | What follows are the descriptions of the workflow files:
27 |
28 | ## Deployment
29 |
30 | ### 01 Build and Deploy (sandpaper-main.yaml)
31 |
32 | This is the main driver that will only act on the main branch of the repository.
33 | This workflow does the following:
34 |
35 | 1. checks out the lesson
36 | 2. provisions the following resources
37 | - R
38 | - pandoc
39 | - lesson infrastructure (stored in a cache)
40 | - lesson dependencies if needed (stored in a cache)
41 | 3. builds the lesson via `sandpaper:::ci_deploy()`
42 |
43 | #### Caching
44 |
45 | This workflow has two caches; one cache is for the lesson infrastructure and
46 | the other is for the lesson dependencies if the lesson contains rendered
47 | content. These caches are invalidated by new versions of the infrastructure and
48 | the `renv.lock` file, respectively. If there is a problem with the cache,
49 | manual invaliation is necessary. You will need maintain access to the repository
50 | and you can either go to the actions tab and [click on the caches button to find
51 | and invalidate the failing cache](https://github.blog/changelog/2022-10-20-manage-caches-in-your-actions-workflows-from-web-interface/)
52 | or by setting the `CACHE_VERSION` secret to the current date (which will
53 | invalidate all of the caches).
54 |
55 | ## Updates
56 |
57 | ### Setup Information
58 |
59 | These workflows run on a schedule and at the maintainer's request. Because they
60 | create pull requests that update workflows/require the downstream actions to run,
61 | they need a special repository/organization secret token called
62 | `SANDPAPER_WORKFLOW` and it must have the `public_repo` and `workflow` scope.
63 |
64 | This can be an individual user token, OR it can be a trusted bot account. If you
65 | have a repository in one of the official Carpentries accounts, then you do not
66 | need to worry about this token being present because the Carpentries Core Team
67 | will take care of supplying this token.
68 |
69 | If you want to use your personal account: you can go to
70 |
71 | to create a token. Once you have created your token, you should copy it to your
72 | clipboard and then go to your repository's settings > secrets > actions and
73 | create or edit the `SANDPAPER_WORKFLOW` secret, pasting in the generated token.
74 |
75 | If you do not specify your token correctly, the runs will not fail and they will
76 | give you instructions to provide the token for your repository.
77 |
78 | ### 02 Maintain: Update Workflow Files (update-workflow.yaml)
79 |
80 | The {sandpaper} repository was designed to do as much as possible to separate
81 | the tools from the content. For local builds, this is absolutely true, but
82 | there is a minor issue when it comes to workflow files: they must live inside
83 | the repository.
84 |
85 | This workflow ensures that the workflow files are up-to-date. The way it work is
86 | to download the update-workflows.sh script from GitHub and run it. The script
87 | will do the following:
88 |
89 | 1. check the recorded version of sandpaper against the current version on github
90 | 2. update the files if there is a difference in versions
91 |
92 | After the files are updated, if there are any changes, they are pushed to a
93 | branch called `update/workflows` and a pull request is created. Maintainers are
94 | encouraged to review the changes and accept the pull request if the outputs
95 | are okay.
96 |
97 | This update is run weekly or on demand.
98 |
99 | ### 03 Maintain: Update Package Cache (update-cache.yaml)
100 |
101 | For lessons that have generated content, we use {renv} to ensure that the output
102 | is stable. This is controlled by a single lockfile which documents the packages
103 | needed for the lesson and the version numbers. This workflow is skipped in
104 | lessons that do not have generated content.
105 |
106 | Because the lessons need to remain current with the package ecosystem, it's a
107 | good idea to make sure these packages can be updated periodically. The
108 | update cache workflow will do this by checking for updates, applying them in a
109 | branch called `updates/packages` and creating a pull request with _only the
110 | lockfile changed_.
111 |
112 | From here, the markdown documents will be rebuilt and you can inspect what has
113 | changed based on how the packages have updated.
114 |
115 | ## Pull Request and Review Management
116 |
117 | Because our lessons execute code, pull requests are a secruity risk for any
118 | lesson and thus have security measures associted with them. **Do not merge any
119 | pull requests that do not pass checks and do not have bots commented on them.**
120 |
121 | This series of workflows all go together and are described in the following
122 | diagram and the below sections:
123 |
124 | 
125 |
126 | ### Pre Flight Pull Request Validation (pr-preflight.yaml)
127 |
128 | This workflow runs every time a pull request is created and its purpose is to
129 | validate that the pull request is okay to run. This means the following things:
130 |
131 | 1. The pull request does not contain modified workflow files
132 | 2. If the pull request contains modified workflow files, it does not contain
133 | modified content files (such as a situation where @carpentries-bot will
134 | make an automated pull request)
135 | 3. The pull request does not contain an invalid commit hash (e.g. from a fork
136 | that was made before a lesson was transitioned from styles to use the
137 | workbench).
138 |
139 | Once the checks are finished, a comment is issued to the pull request, which
140 | will allow maintainers to determine if it is safe to run the
141 | "Receive Pull Request" workflow from new contributors.
142 |
143 | ### Receive Pull Request (pr-receive.yaml)
144 |
145 | **Note of caution:** This workflow runs arbitrary code by anyone who creates a
146 | pull request. GitHub has safeguarded the token used in this workflow to have no
147 | priviledges in the repository, but we have taken precautions to protect against
148 | spoofing.
149 |
150 | This workflow is triggered with every push to a pull request. If this workflow
151 | is already running and a new push is sent to the pull request, the workflow
152 | running from the previous push will be cancelled and a new workflow run will be
153 | started.
154 |
155 | The first step of this workflow is to check if it is valid (e.g. that no
156 | workflow files have been modified). If there are workflow files that have been
157 | modified, a comment is made that indicates that the workflow is not run. If
158 | both a workflow file and lesson content is modified, an error will occurr.
159 |
160 | The second step (if valid) is to build the generated content from the pull
161 | request. This builds the content and uploads three artifacts:
162 |
163 | 1. The pull request number (pr)
164 | 2. A summary of changes after the rendering process (diff)
165 | 3. The rendered files (build)
166 |
167 | Because this workflow builds generated content, it follows the same general
168 | process as the `sandpaper-main` workflow with the same caching mechanisms.
169 |
170 | The artifacts produced are used by the next workflow.
171 |
172 | ### Comment on Pull Request (pr-comment.yaml)
173 |
174 | This workflow is triggered if the `pr-receive.yaml` workflow is successful.
175 | The steps in this workflow are:
176 |
177 | 1. Test if the workflow is valid and comment the validity of the workflow to the
178 | pull request.
179 | 2. If it is valid: create an orphan branch with two commits: the current state
180 | of the repository and the proposed changes.
181 | 3. If it is valid: update the pull request comment with the summary of changes
182 |
183 | Importantly: if the pull request is invalid, the branch is not created so any
184 | malicious code is not published.
185 |
186 | From here, the maintainer can request changes from the author and eventually
187 | either merge or reject the PR. When this happens, if the PR was valid, the
188 | preview branch needs to be deleted.
189 |
190 | ### Send Close PR Signal (pr-close-signal.yaml)
191 |
192 | Triggered any time a pull request is closed. This emits an artifact that is the
193 | pull request number for the next action
194 |
195 | ### Remove Pull Request Branch (pr-post-remove-branch.yaml)
196 |
197 | Tiggered by `pr-close-signal.yaml`. This removes the temporary branch associated with
198 | the pull request (if it was created).
199 |
--------------------------------------------------------------------------------
/.github/workflows/pr-close-signal.yaml:
--------------------------------------------------------------------------------
1 | name: "Bot: Send Close Pull Request Signal"
2 |
3 | on:
4 | pull_request:
5 | types:
6 | [closed]
7 |
8 | jobs:
9 | send-close-signal:
10 | name: "Send closing signal"
11 | runs-on: ubuntu-22.04
12 | if: ${{ github.event.action == 'closed' }}
13 | steps:
14 | - name: "Create PRtifact"
15 | run: |
16 | mkdir -p ./pr
17 | printf ${{ github.event.number }} > ./pr/NUM
18 | - name: Upload Diff
19 | uses: actions/upload-artifact@v4
20 | with:
21 | name: pr
22 | path: ./pr
23 |
--------------------------------------------------------------------------------
/.github/workflows/pr-comment.yaml:
--------------------------------------------------------------------------------
1 | name: "Bot: Comment on the Pull Request"
2 |
3 | # read-write repo token
4 | # access to secrets
5 | on:
6 | workflow_run:
7 | workflows: ["Receive Pull Request"]
8 | types:
9 | - completed
10 |
11 | concurrency:
12 | group: pr-${{ github.event.workflow_run.pull_requests[0].number }}
13 | cancel-in-progress: true
14 |
15 |
16 | jobs:
17 | # Pull requests are valid if:
18 | # - they match the sha of the workflow run head commit
19 | # - they are open
20 | # - no .github files were committed
21 | test-pr:
22 | name: "Test if pull request is valid"
23 | runs-on: ubuntu-22.04
24 | if: >
25 | github.event.workflow_run.event == 'pull_request' &&
26 | github.event.workflow_run.conclusion == 'success'
27 | outputs:
28 | is_valid: ${{ steps.check-pr.outputs.VALID }}
29 | payload: ${{ steps.check-pr.outputs.payload }}
30 | number: ${{ steps.get-pr.outputs.NUM }}
31 | msg: ${{ steps.check-pr.outputs.MSG }}
32 | steps:
33 | - name: 'Download PR artifact'
34 | id: dl
35 | uses: carpentries/actions/download-workflow-artifact@main
36 | with:
37 | run: ${{ github.event.workflow_run.id }}
38 | name: 'pr'
39 |
40 | - name: "Get PR Number"
41 | if: ${{ steps.dl.outputs.success == 'true' }}
42 | id: get-pr
43 | run: |
44 | unzip pr.zip
45 | echo "NUM=$(<./NR)" >> $GITHUB_OUTPUT
46 |
47 | - name: "Fail if PR number was not present"
48 | id: bad-pr
49 | if: ${{ steps.dl.outputs.success != 'true' }}
50 | run: |
51 | echo '::error::A pull request number was not recorded. The pull request that triggered this workflow is likely malicious.'
52 | exit 1
53 | - name: "Get Invalid Hashes File"
54 | id: hash
55 | run: |
56 | echo "json<> $GITHUB_OUTPUT
59 | - name: "Check PR"
60 | id: check-pr
61 | if: ${{ steps.dl.outputs.success == 'true' }}
62 | uses: carpentries/actions/check-valid-pr@main
63 | with:
64 | pr: ${{ steps.get-pr.outputs.NUM }}
65 | sha: ${{ github.event.workflow_run.head_sha }}
66 | headroom: 3 # if it's within the last three commits, we can keep going, because it's likely rapid-fire
67 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
68 | fail_on_error: true
69 |
70 | # Create an orphan branch on this repository with two commits
71 | # - the current HEAD of the md-outputs branch
72 | # - the output from running the current HEAD of the pull request through
73 | # the md generator
74 | create-branch:
75 | name: "Create Git Branch"
76 | needs: test-pr
77 | runs-on: ubuntu-22.04
78 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
79 | env:
80 | NR: ${{ needs.test-pr.outputs.number }}
81 | permissions:
82 | contents: write
83 | steps:
84 | - name: 'Checkout md outputs'
85 | uses: actions/checkout@v4
86 | with:
87 | ref: md-outputs
88 | path: built
89 | fetch-depth: 1
90 |
91 | - name: 'Download built markdown'
92 | id: dl
93 | uses: carpentries/actions/download-workflow-artifact@main
94 | with:
95 | run: ${{ github.event.workflow_run.id }}
96 | name: 'built'
97 |
98 | - if: ${{ steps.dl.outputs.success == 'true' }}
99 | run: unzip built.zip
100 |
101 | - name: "Create orphan and push"
102 | if: ${{ steps.dl.outputs.success == 'true' }}
103 | run: |
104 | cd built/
105 | git config --local user.email "actions@github.com"
106 | git config --local user.name "GitHub Actions"
107 | CURR_HEAD=$(git rev-parse HEAD)
108 | git checkout --orphan md-outputs-PR-${NR}
109 | git add -A
110 | git commit -m "source commit: ${CURR_HEAD}"
111 | ls -A | grep -v '^.git$' | xargs -I _ rm -r '_'
112 | cd ..
113 | unzip -o -d built built.zip
114 | cd built
115 | git add -A
116 | git commit --allow-empty -m "differences for PR #${NR}"
117 | git push -u --force --set-upstream origin md-outputs-PR-${NR}
118 |
119 | # Comment on the Pull Request with a link to the branch and the diff
120 | comment-pr:
121 | name: "Comment on Pull Request"
122 | needs: [test-pr, create-branch]
123 | runs-on: ubuntu-22.04
124 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
125 | env:
126 | NR: ${{ needs.test-pr.outputs.number }}
127 | permissions:
128 | pull-requests: write
129 | steps:
130 | - name: 'Download comment artifact'
131 | id: dl
132 | uses: carpentries/actions/download-workflow-artifact@main
133 | with:
134 | run: ${{ github.event.workflow_run.id }}
135 | name: 'diff'
136 |
137 | - if: ${{ steps.dl.outputs.success == 'true' }}
138 | run: unzip ${{ github.workspace }}/diff.zip
139 |
140 | - name: "Comment on PR"
141 | id: comment-diff
142 | if: ${{ steps.dl.outputs.success == 'true' }}
143 | uses: carpentries/actions/comment-diff@main
144 | with:
145 | pr: ${{ env.NR }}
146 | path: ${{ github.workspace }}/diff.md
147 |
148 | # Comment if the PR is open and matches the SHA, but the workflow files have
149 | # changed
150 | comment-changed-workflow:
151 | name: "Comment if workflow files have changed"
152 | needs: test-pr
153 | runs-on: ubuntu-22.04
154 | if: ${{ always() && needs.test-pr.outputs.is_valid == 'false' }}
155 | env:
156 | NR: ${{ github.event.workflow_run.pull_requests[0].number }}
157 | body: ${{ needs.test-pr.outputs.msg }}
158 | permissions:
159 | pull-requests: write
160 | steps:
161 | - name: 'Check for spoofing'
162 | id: dl
163 | uses: carpentries/actions/download-workflow-artifact@main
164 | with:
165 | run: ${{ github.event.workflow_run.id }}
166 | name: 'built'
167 |
168 | - name: 'Alert if spoofed'
169 | id: spoof
170 | if: ${{ steps.dl.outputs.success == 'true' }}
171 | run: |
172 | echo 'body<> $GITHUB_ENV
173 | echo '' >> $GITHUB_ENV
174 | echo '## :x: DANGER :x:' >> $GITHUB_ENV
175 | echo 'This pull request has modified workflows that created output. Close this now.' >> $GITHUB_ENV
176 | echo '' >> $GITHUB_ENV
177 | echo 'EOF' >> $GITHUB_ENV
178 |
179 | - name: "Comment on PR"
180 | id: comment-diff
181 | uses: carpentries/actions/comment-diff@main
182 | with:
183 | pr: ${{ env.NR }}
184 | body: ${{ env.body }}
185 |
--------------------------------------------------------------------------------
/.github/workflows/pr-post-remove-branch.yaml:
--------------------------------------------------------------------------------
1 | name: "Bot: Remove Temporary PR Branch"
2 |
3 | on:
4 | workflow_run:
5 | workflows: ["Bot: Send Close Pull Request Signal"]
6 | types:
7 | - completed
8 |
9 | jobs:
10 | delete:
11 | name: "Delete branch from Pull Request"
12 | runs-on: ubuntu-22.04
13 | if: >
14 | github.event.workflow_run.event == 'pull_request' &&
15 | github.event.workflow_run.conclusion == 'success'
16 | permissions:
17 | contents: write
18 | steps:
19 | - name: 'Download artifact'
20 | uses: carpentries/actions/download-workflow-artifact@main
21 | with:
22 | run: ${{ github.event.workflow_run.id }}
23 | name: pr
24 | - name: "Get PR Number"
25 | id: get-pr
26 | run: |
27 | unzip pr.zip
28 | echo "NUM=$(<./NUM)" >> $GITHUB_OUTPUT
29 | - name: 'Remove branch'
30 | uses: carpentries/actions/remove-branch@main
31 | with:
32 | pr: ${{ steps.get-pr.outputs.NUM }}
33 |
--------------------------------------------------------------------------------
/.github/workflows/pr-preflight.yaml:
--------------------------------------------------------------------------------
1 | name: "Pull Request Preflight Check"
2 |
3 | on:
4 | pull_request_target:
5 | branches:
6 | ["main"]
7 | types:
8 | ["opened", "synchronize", "reopened"]
9 |
10 | jobs:
11 | test-pr:
12 | name: "Test if pull request is valid"
13 | if: ${{ github.event.action != 'closed' }}
14 | runs-on: ubuntu-22.04
15 | outputs:
16 | is_valid: ${{ steps.check-pr.outputs.VALID }}
17 | permissions:
18 | pull-requests: write
19 | steps:
20 | - name: "Get Invalid Hashes File"
21 | id: hash
22 | run: |
23 | echo "json<> $GITHUB_OUTPUT
26 | - name: "Check PR"
27 | id: check-pr
28 | uses: carpentries/actions/check-valid-pr@main
29 | with:
30 | pr: ${{ github.event.number }}
31 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
32 | fail_on_error: true
33 | - name: "Comment result of validation"
34 | id: comment-diff
35 | if: ${{ always() }}
36 | uses: carpentries/actions/comment-diff@main
37 | with:
38 | pr: ${{ github.event.number }}
39 | body: ${{ steps.check-pr.outputs.MSG }}
40 |
--------------------------------------------------------------------------------
/.github/workflows/pr-receive.yaml:
--------------------------------------------------------------------------------
1 | name: "Receive Pull Request"
2 |
3 | on:
4 | pull_request:
5 | types:
6 | [opened, synchronize, reopened]
7 |
8 | concurrency:
9 | group: ${{ github.ref }}
10 | cancel-in-progress: true
11 |
12 | jobs:
13 | test-pr:
14 | name: "Record PR number"
15 | if: ${{ github.event.action != 'closed' }}
16 | runs-on: ubuntu-22.04
17 | outputs:
18 | is_valid: ${{ steps.check-pr.outputs.VALID }}
19 | steps:
20 | - name: "Record PR number"
21 | id: record
22 | if: ${{ always() }}
23 | run: |
24 | echo ${{ github.event.number }} > ${{ github.workspace }}/NR # 2022-03-02: artifact name fixed to be NR
25 | - name: "Upload PR number"
26 | id: upload
27 | if: ${{ always() }}
28 | uses: actions/upload-artifact@v4
29 | with:
30 | name: pr
31 | path: ${{ github.workspace }}/NR
32 | - name: "Get Invalid Hashes File"
33 | id: hash
34 | run: |
35 | echo "json<> $GITHUB_OUTPUT
38 | - name: "echo output"
39 | run: |
40 | echo "${{ steps.hash.outputs.json }}"
41 | - name: "Check PR"
42 | id: check-pr
43 | uses: carpentries/actions/check-valid-pr@main
44 | with:
45 | pr: ${{ github.event.number }}
46 | invalid: ${{ fromJSON(steps.hash.outputs.json)[github.repository] }}
47 |
48 | build-md-source:
49 | name: "Build markdown source files if valid"
50 | needs: test-pr
51 | runs-on: ubuntu-22.04
52 | if: ${{ needs.test-pr.outputs.is_valid == 'true' }}
53 | env:
54 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
55 | RENV_PATHS_ROOT: ~/.local/share/renv/
56 | CHIVE: ${{ github.workspace }}/site/chive
57 | PR: ${{ github.workspace }}/site/pr
58 | MD: ${{ github.workspace }}/site/built
59 | steps:
60 | - name: "Check Out Main Branch"
61 | uses: actions/checkout@v4
62 |
63 | - name: "Check Out Staging Branch"
64 | uses: actions/checkout@v4
65 | with:
66 | ref: md-outputs
67 | path: ${{ env.MD }}
68 |
69 | - name: "Set up R"
70 | uses: r-lib/actions/setup-r@v2
71 | with:
72 | use-public-rspm: true
73 | install-r: false
74 |
75 | - name: "Set up Pandoc"
76 | uses: r-lib/actions/setup-pandoc@v2
77 |
78 | - name: "Setup Lesson Engine"
79 | uses: carpentries/actions/setup-sandpaper@main
80 | with:
81 | cache-version: ${{ secrets.CACHE_VERSION }}
82 |
83 | - name: "Setup Package Cache"
84 | uses: carpentries/actions/setup-lesson-deps@main
85 | with:
86 | cache-version: ${{ secrets.CACHE_VERSION }}
87 |
88 | - name: "Validate and Build Markdown"
89 | id: build-site
90 | run: |
91 | sandpaper::package_cache_trigger(TRUE)
92 | sandpaper::validate_lesson(path = '${{ github.workspace }}')
93 | sandpaper:::build_markdown(path = '${{ github.workspace }}', quiet = FALSE)
94 | shell: Rscript {0}
95 |
96 | - name: "Generate Artifacts"
97 | id: generate-artifacts
98 | run: |
99 | sandpaper:::ci_bundle_pr_artifacts(
100 | repo = '${{ github.repository }}',
101 | pr_number = '${{ github.event.number }}',
102 | path_md = '${{ env.MD }}',
103 | path_pr = '${{ env.PR }}',
104 | path_archive = '${{ env.CHIVE }}',
105 | branch = 'md-outputs'
106 | )
107 | shell: Rscript {0}
108 |
109 | - name: "Upload PR"
110 | uses: actions/upload-artifact@v4
111 | with:
112 | name: pr
113 | path: ${{ env.PR }}
114 | overwrite: true
115 |
116 | - name: "Upload Diff"
117 | uses: actions/upload-artifact@v4
118 | with:
119 | name: diff
120 | path: ${{ env.CHIVE }}
121 | retention-days: 1
122 |
123 | - name: "Upload Build"
124 | uses: actions/upload-artifact@v4
125 | with:
126 | name: built
127 | path: ${{ env.MD }}
128 | retention-days: 1
129 |
130 | - name: "Teardown"
131 | run: sandpaper::reset_site()
132 | shell: Rscript {0}
133 |
--------------------------------------------------------------------------------
/.github/workflows/sandpaper-main.yaml:
--------------------------------------------------------------------------------
1 | name: "01 Build and Deploy Site"
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 | - master
8 | schedule:
9 | - cron: '0 0 * * 2'
10 | workflow_dispatch:
11 | inputs:
12 | name:
13 | description: 'Who triggered this build?'
14 | required: true
15 | default: 'Maintainer (via GitHub)'
16 | reset:
17 | description: 'Reset cached markdown files'
18 | required: false
19 | default: false
20 | type: boolean
21 | jobs:
22 | full-build:
23 | name: "Build Full Site"
24 |
25 | # 2024-10-01: ubuntu-latest is now 24.04 and R is not installed by default in the runner image
26 | # pin to 22.04 for now
27 | runs-on: ubuntu-22.04
28 | permissions:
29 | checks: write
30 | contents: write
31 | pages: write
32 | env:
33 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
34 | RENV_PATHS_ROOT: ~/.local/share/renv/
35 | steps:
36 |
37 | - name: "Checkout Lesson"
38 | uses: actions/checkout@v4
39 |
40 | - name: "Set up R"
41 | uses: r-lib/actions/setup-r@v2
42 | with:
43 | use-public-rspm: true
44 | install-r: false
45 |
46 | - name: "Set up Pandoc"
47 | uses: r-lib/actions/setup-pandoc@v2
48 |
49 | - name: "Setup Lesson Engine"
50 | uses: carpentries/actions/setup-sandpaper@main
51 | with:
52 | cache-version: ${{ secrets.CACHE_VERSION }}
53 |
54 | - name: "Setup Package Cache"
55 | uses: carpentries/actions/setup-lesson-deps@main
56 | with:
57 | cache-version: ${{ secrets.CACHE_VERSION }}
58 |
59 | - name: "Deploy Site"
60 | run: |
61 | reset <- "${{ github.event.inputs.reset }}" == "true"
62 | sandpaper::package_cache_trigger(TRUE)
63 | sandpaper:::ci_deploy(reset = reset)
64 | shell: Rscript {0}
65 |
--------------------------------------------------------------------------------
/.github/workflows/sandpaper-version.txt:
--------------------------------------------------------------------------------
1 | 0.16.12
2 |
--------------------------------------------------------------------------------
/.github/workflows/update-cache.yaml:
--------------------------------------------------------------------------------
1 | name: "03 Maintain: Update Package Cache"
2 |
3 | on:
4 | workflow_dispatch:
5 | inputs:
6 | name:
7 | description: 'Who triggered this build (enter github username to tag yourself)?'
8 | required: true
9 | default: 'monthly run'
10 | schedule:
11 | # Run every tuesday
12 | - cron: '0 0 * * 2'
13 |
14 | jobs:
15 | preflight:
16 | name: "Preflight Check"
17 | runs-on: ubuntu-22.04
18 | outputs:
19 | ok: ${{ steps.check.outputs.ok }}
20 | steps:
21 | - id: check
22 | run: |
23 | if [[ ${{ github.event_name }} == 'workflow_dispatch' ]]; then
24 | echo "ok=true" >> $GITHUB_OUTPUT
25 | echo "Running on request"
26 | # using single brackets here to avoid 08 being interpreted as octal
27 | # https://github.com/carpentries/sandpaper/issues/250
28 | elif [ `date +%d` -le 7 ]; then
29 | # If the Tuesday lands in the first week of the month, run it
30 | echo "ok=true" >> $GITHUB_OUTPUT
31 | echo "Running on schedule"
32 | else
33 | echo "ok=false" >> $GITHUB_OUTPUT
34 | echo "Not Running Today"
35 | fi
36 |
37 | check_renv:
38 | name: "Check if We Need {renv}"
39 | runs-on: ubuntu-22.04
40 | needs: preflight
41 | if: ${{ needs.preflight.outputs.ok == 'true'}}
42 | outputs:
43 | needed: ${{ steps.renv.outputs.exists }}
44 | steps:
45 | - name: "Checkout Lesson"
46 | uses: actions/checkout@v4
47 | - id: renv
48 | run: |
49 | if [[ -d renv ]]; then
50 | echo "exists=true" >> $GITHUB_OUTPUT
51 | fi
52 |
53 | check_token:
54 | name: "Check SANDPAPER_WORKFLOW token"
55 | runs-on: ubuntu-22.04
56 | needs: check_renv
57 | if: ${{ needs.check_renv.outputs.needed == 'true' }}
58 | outputs:
59 | workflow: ${{ steps.validate.outputs.wf }}
60 | repo: ${{ steps.validate.outputs.repo }}
61 | steps:
62 | - name: "validate token"
63 | id: validate
64 | uses: carpentries/actions/check-valid-credentials@main
65 | with:
66 | token: ${{ secrets.SANDPAPER_WORKFLOW }}
67 |
68 | update_cache:
69 | name: "Update Package Cache"
70 | needs: check_token
71 | if: ${{ needs.check_token.outputs.repo== 'true' }}
72 | runs-on: ubuntu-22.04
73 | env:
74 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
75 | RENV_PATHS_ROOT: ~/.local/share/renv/
76 | steps:
77 |
78 | - name: "Checkout Lesson"
79 | uses: actions/checkout@v4
80 |
81 | - name: "Set up R"
82 | uses: r-lib/actions/setup-r@v2
83 | with:
84 | use-public-rspm: true
85 | install-r: false
86 |
87 | - name: "Update {renv} deps and determine if a PR is needed"
88 | id: update
89 | uses: carpentries/actions/update-lockfile@main
90 | with:
91 | cache-version: ${{ secrets.CACHE_VERSION }}
92 |
93 | - name: Create Pull Request
94 | id: cpr
95 | if: ${{ steps.update.outputs.n > 0 }}
96 | uses: carpentries/create-pull-request@main
97 | with:
98 | token: ${{ secrets.SANDPAPER_WORKFLOW }}
99 | delete-branch: true
100 | branch: "update/packages"
101 | commit-message: "[actions] update ${{ steps.update.outputs.n }} packages"
102 | title: "Update ${{ steps.update.outputs.n }} packages"
103 | body: |
104 | :robot: This is an automated build
105 |
106 | This will update ${{ steps.update.outputs.n }} packages in your lesson with the following versions:
107 |
108 | ```
109 | ${{ steps.update.outputs.report }}
110 | ```
111 |
112 | :stopwatch: In a few minutes, a comment will appear that will show you how the output has changed based on these updates.
113 |
114 | If you want to inspect these changes locally, you can use the following code to check out a new branch:
115 |
116 | ```bash
117 | git fetch origin update/packages
118 | git checkout update/packages
119 | ```
120 |
121 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
122 |
123 | [1]: https://github.com/carpentries/create-pull-request/tree/main
124 | labels: "type: package cache"
125 | draft: false
126 |
--------------------------------------------------------------------------------
/.github/workflows/update-workflows.yaml:
--------------------------------------------------------------------------------
1 | name: "02 Maintain: Update Workflow Files"
2 |
3 | on:
4 | workflow_dispatch:
5 | inputs:
6 | name:
7 | description: 'Who triggered this build (enter github username to tag yourself)?'
8 | required: true
9 | default: 'weekly run'
10 | clean:
11 | description: 'Workflow files/file extensions to clean (no wildcards, enter "" for none)'
12 | required: false
13 | default: '.yaml'
14 | schedule:
15 | # Run every Tuesday
16 | - cron: '0 0 * * 2'
17 |
18 | jobs:
19 | check_token:
20 | name: "Check SANDPAPER_WORKFLOW token"
21 | runs-on: ubuntu-22.04
22 | outputs:
23 | workflow: ${{ steps.validate.outputs.wf }}
24 | repo: ${{ steps.validate.outputs.repo }}
25 | steps:
26 | - name: "validate token"
27 | id: validate
28 | uses: carpentries/actions/check-valid-credentials@main
29 | with:
30 | token: ${{ secrets.SANDPAPER_WORKFLOW }}
31 |
32 | update_workflow:
33 | name: "Update Workflow"
34 | runs-on: ubuntu-22.04
35 | needs: check_token
36 | if: ${{ needs.check_token.outputs.workflow == 'true' }}
37 | steps:
38 | - name: "Checkout Repository"
39 | uses: actions/checkout@v4
40 |
41 | - name: Update Workflows
42 | id: update
43 | uses: carpentries/actions/update-workflows@main
44 | with:
45 | clean: ${{ github.event.inputs.clean }}
46 |
47 | - name: Create Pull Request
48 | id: cpr
49 | if: "${{ steps.update.outputs.new }}"
50 | uses: carpentries/create-pull-request@main
51 | with:
52 | token: ${{ secrets.SANDPAPER_WORKFLOW }}
53 | delete-branch: true
54 | branch: "update/workflows"
55 | commit-message: "[actions] update sandpaper workflow to version ${{ steps.update.outputs.new }}"
56 | title: "Update Workflows to Version ${{ steps.update.outputs.new }}"
57 | body: |
58 | :robot: This is an automated build
59 |
60 | Update Workflows from sandpaper version ${{ steps.update.outputs.old }} -> ${{ steps.update.outputs.new }}
61 |
62 | - Auto-generated by [create-pull-request][1] on ${{ steps.update.outputs.date }}
63 |
64 | [1]: https://github.com/carpentries/create-pull-request/tree/main
65 | labels: "type: template and tools"
66 | draft: false
67 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # sandpaper files
2 | episodes/*html
3 | site/*
4 | !site/README.md
5 | *.Rproj
6 |
7 | # History files
8 | .Rhistory
9 | .Rapp.history
10 | # Session Data files
11 | .RData
12 | # User-specific files
13 | .Ruserdata
14 | # Example code in package build process
15 | *-Ex.R
16 | # Output files from R CMD build
17 | /*.tar.gz
18 | # Output files from R CMD check
19 | /*.Rcheck/
20 | # RStudio files
21 | .Rproj.user/
22 | # produced vignettes
23 | vignettes/*.html
24 | vignettes/*.pdf
25 | # OAuth2 token, see https://github.com/hadley/httr/releases/tag/v0.3
26 | .httr-oauth
27 | # knitr and R markdown default cache directories
28 | *_cache/
29 | /cache/
30 | # Temporary files created by R markdown
31 | *.utf8.md
32 | *.knit.md
33 | # R Environment Variables
34 | .Renviron
35 | # pkgdown site
36 | docs/
37 | # translation temp files
38 | po/*~
39 | # renv detritus
40 | renv/sandbox/
41 | *.pyc
42 | *~
43 | .DS_Store
44 | .ipynb_checkpoints
45 | .sass-cache
46 | .jekyll-cache/
47 | .jekyll-metadata
48 | __pycache__
49 | _site
50 | .Rproj.user
51 | .bundle/
52 | .vendor/
53 | vendor/
54 | .docker-vendor/
55 | Gemfile.lock
56 | .*history
57 |
--------------------------------------------------------------------------------
/.zenodo.json:
--------------------------------------------------------------------------------
1 | {
2 | "contributors": [
3 | {
4 | "type": "Editor",
5 | "name": "Trevor Burrows"
6 | }
7 | ],
8 | "creators": [
9 | {
10 | "name": "Christopher Prener"
11 | },
12 | {
13 | "name": "Trevor Burrows"
14 | },
15 | {
16 | "name": "bkmgit"
17 | },
18 | {
19 | "name": "Mara Sedlins"
20 | },
21 | {
22 | "name": "AliNite"
23 | },
24 | {
25 | "name": "Scott Carl Peterson",
26 | "orcid": "0000-0002-1920-616X"
27 | },
28 | {
29 | "name": "Angelique Trusler",
30 | "orcid": "0000-0003-2340-8538"
31 | },
32 | {
33 | "name": "Annajiat Alim Rasel",
34 | "orcid": "0000-0003-0198-3734"
35 | },
36 | {
37 | "name": "Maneesha Sane"
38 | },
39 | {
40 | "name": "Peter Bugeia"
41 | },
42 | {
43 | "name": "Phil Reed",
44 | "orcid": "0000-0002-4479-715X"
45 | },
46 | {
47 | "name": "Angela Li",
48 | "orcid": "0000-0002-8956-419X"
49 | },
50 | {
51 | "name": "Claudiu Forgaci",
52 | "orcid": "0000-0003-3218-5102"
53 | },
54 | {
55 | "name": "Dafne Erica van Kuppevelt",
56 | "orcid": "0000-0002-2662-1994"
57 | },
58 | {
59 | "name": "Emily Ferrier"
60 | },
61 | {
62 | "name": "Fran Baseby"
63 | },
64 | {
65 | "name": "Katherine E. Koziar",
66 | "orcid": "0000-0003-0505-7973"
67 | },
68 | {
69 | "name": "Kunal Marwaha",
70 | "orcid": "0000-0001-9084-6971"
71 | },
72 | {
73 | "name": "Naoe Tatara",
74 | "orcid": "0000-0002-0049-1634"
75 | },
76 | {
77 | "name": "Nathaniel Porter",
78 | },
79 | {
80 | "name": "Sarah M Brown",
81 | "orcid": "0000-0001-5728-0822"
82 | },
83 | {
84 | "name": "Shiobhan Smith",
85 | "orcid": "0000-0003-1738-9836"
86 | },
87 | {
88 | "name": "Tugba Ozturk"
89 | },
90 | {
91 | "name": "ecparke-utm"
92 | },
93 | {
94 | "name": "geyslein"
95 | },
96 | {
97 | "name": "marksnyders"
98 | },
99 | {
100 | "name": "mlbecher"
101 | }
102 | ],
103 | "license": {
104 | "id": "CC-BY-4.0"
105 | }
106 | }
107 |
--------------------------------------------------------------------------------
/AUTHORS:
--------------------------------------------------------------------------------
1 | FIXME: list authors' names and email addresses.
2 |
--------------------------------------------------------------------------------
/CITATION:
--------------------------------------------------------------------------------
1 | FIXME: describe how to cite this lesson.
2 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Contributor Code of Conduct"
3 | ---
4 |
5 | As contributors and maintainers of this project,
6 | we pledge to follow the [The Carpentries Code of Conduct][coc].
7 |
8 | Instances of abusive, harassing, or otherwise unacceptable behavior
9 | may be reported by following our [reporting guidelines][coc-reporting].
10 |
11 |
12 | [coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html
13 | [coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html
14 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | ## Contributing
2 |
3 | [The Carpentries][cp-site] ([Software Carpentry][swc-site], [Data
4 | Carpentry][dc-site], and [Library Carpentry][lc-site]) are open source
5 | projects, and we welcome contributions of all kinds: new lessons, fixes to
6 | existing material, bug reports, and reviews of proposed changes are all
7 | welcome.
8 |
9 | ### Contributor Agreement
10 |
11 | By contributing, you agree that we may redistribute your work under [our
12 | license](LICENSE.md). In exchange, we will address your issues and/or assess
13 | your change proposal as promptly as we can, and help you become a member of our
14 | community. Everyone involved in [The Carpentries][cp-site] agrees to abide by
15 | our [code of conduct](CODE_OF_CONDUCT.md).
16 |
17 | ### How to Contribute
18 |
19 | The easiest way to get started is to file an issue to tell us about a spelling
20 | mistake, some awkward wording, or a factual error. This is a good way to
21 | introduce yourself and to meet some of our community members.
22 |
23 | 1. If you do not have a [GitHub][github] account, you can [send us comments by
24 | email][contact]. However, we will be able to respond more quickly if you use
25 | one of the other methods described below.
26 |
27 | 2. If you have a [GitHub][github] account, or are willing to [create
28 | one][github-join], but do not know how to use Git, you can report problems
29 | or suggest improvements by [creating an issue][issues]. This allows us to
30 | assign the item to someone and to respond to it in a threaded discussion.
31 |
32 | 3. If you are comfortable with Git, and would like to add or change material,
33 | you can submit a pull request (PR). Instructions for doing this are
34 | [included below](#using-github).
35 |
36 | Note: if you want to build the website locally, please refer to [The Workbench
37 | documentation][template-doc].
38 |
39 | ### Where to Contribute
40 |
41 | 1. If you wish to change this lesson, add issues and pull requests here.
42 | 2. If you wish to change the template used for workshop websites, please refer
43 | to [The Workbench documentation][template-doc].
44 |
45 |
46 | ### What to Contribute
47 |
48 | There are many ways to contribute, from writing new exercises and improving
49 | existing ones to updating or filling in the documentation and submitting [bug
50 | reports][issues] about things that do not work, are not clear, or are missing.
51 | If you are looking for ideas, please see [the list of issues for this
52 | repository][repo], or the issues for [Data Carpentry][dc-issues], [Library
53 | Carpentry][lc-issues], and [Software Carpentry][swc-issues] projects.
54 |
55 | Comments on issues and reviews of pull requests are just as welcome: we are
56 | smarter together than we are on our own. **Reviews from novices and newcomers
57 | are particularly valuable**: it's easy for people who have been using these
58 | lessons for a while to forget how impenetrable some of this material can be, so
59 | fresh eyes are always welcome.
60 |
61 | ### What *Not* to Contribute
62 |
63 | Our lessons already contain more material than we can cover in a typical
64 | workshop, so we are usually *not* looking for more concepts or tools to add to
65 | them. As a rule, if you want to introduce a new idea, you must (a) estimate how
66 | long it will take to teach and (b) explain what you would take out to make room
67 | for it. The first encourages contributors to be honest about requirements; the
68 | second, to think hard about priorities.
69 |
70 | We are also not looking for exercises or other material that only run on one
71 | platform. Our workshops typically contain a mixture of Windows, macOS, and
72 | Linux users; in order to be usable, our lessons must run equally well on all
73 | three.
74 |
75 | ### Using GitHub
76 |
77 | If you choose to contribute via GitHub, you may want to look at [How to
78 | Contribute to an Open Source Project on GitHub](https://egghead.io/courses/how-to-contribute-to-an-open-source-project-on-github). In brief, we
79 | use [GitHub flow][github-flow] to manage changes:
80 |
81 | 1. Create a new branch in your desktop copy of this repository for each
82 | significant change.
83 | 2. Commit the change in that branch.
84 | 3. Push that branch to your fork of this repository on GitHub.
85 | 4. Submit a pull request from that branch to the [upstream repository][repo].
86 | 5. If you receive feedback, make changes on your desktop and push to your
87 | branch on GitHub: the pull request will update automatically.
88 |
89 | NB: The published copy of the lesson is usually in the `main` branch.
90 |
91 | Each lesson has a team of maintainers who review issues and pull requests or
92 | encourage others to do so. The maintainers are community volunteers, and have
93 | final say over what gets merged into the lesson.
94 |
95 | ### Other Resources
96 |
97 | The Carpentries is a global organisation with volunteers and learners all over
98 | the world. We share values of inclusivity and a passion for sharing knowledge,
99 | teaching and learning. There are several ways to connect with The Carpentries
100 | community listed at including via social
101 | media, slack, newsletters, and email lists. You can also [reach us by
102 | email][contact].
103 |
104 | [repo]: https://example.com/FIXME](https://github.com/datacarpentry/spreadsheets-socialsci
105 | [contact]: mailto:team@carpentries.org
106 | [cp-site]: https://carpentries.org/
107 | [dc-issues]: https://github.com/issues?q=user%3Adatacarpentry
108 | [dc-lessons]: https://datacarpentry.org/lessons/
109 | [dc-site]: https://datacarpentry.org/
110 | [discuss-list]: https://lists.software-carpentry.org/listinfo/discuss
111 | [github]: https://github.com
112 | [github-flow]: https://guides.github.com/introduction/flow/
113 | [github-join]: https://github.com/join
114 | [how-contribute]: https://egghead.io/series/how-to-contribute-to-an-open-source-project-on-github
115 | [issues]: https://carpentries.org/help-wanted-issues/
116 | [lc-issues]: https://github.com/issues?q=user%3ALibraryCarpentry
117 | [swc-issues]: https://github.com/issues?q=user%3Aswcarpentry
118 | [swc-lessons]: https://software-carpentry.org/lessons/
119 | [swc-site]: https://software-carpentry.org/
120 | [lc-site]: https://librarycarpentry.org/
121 | [template-doc]: https://carpentries.github.io/workbench/
122 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Licenses"
3 | ---
4 |
5 | ## Instructional Material
6 |
7 | All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry)
8 | instructional material is made available under the [Creative Commons
9 | Attribution license][cc-by-human]. The following is a human-readable summary of
10 | (and not a substitute for) the [full legal text of the CC BY 4.0
11 | license][cc-by-legal].
12 |
13 | You are free:
14 |
15 | - to **Share**---copy and redistribute the material in any medium or format
16 | - to **Adapt**---remix, transform, and build upon the material
17 |
18 | for any purpose, even commercially.
19 |
20 | The licensor cannot revoke these freedoms as long as you follow the license
21 | terms.
22 |
23 | Under the following terms:
24 |
25 | - **Attribution**---You must give appropriate credit (mentioning that your work
26 | is derived from work that is Copyright (c) The Carpentries and, where
27 | practical, linking to ), provide a [link to the
28 | license][cc-by-human], and indicate if changes were made. You may do so in
29 | any reasonable manner, but not in any way that suggests the licensor endorses
30 | you or your use.
31 |
32 | - **No additional restrictions**---You may not apply legal terms or
33 | technological measures that legally restrict others from doing anything the
34 | license permits. With the understanding that:
35 |
36 | Notices:
37 |
38 | * You do not have to comply with the license for elements of the material in
39 | the public domain or where your use is permitted by an applicable exception
40 | or limitation.
41 | * No warranties are given. The license may not give you all of the permissions
42 | necessary for your intended use. For example, other rights such as publicity,
43 | privacy, or moral rights may limit how you use the material.
44 |
45 | ## Software
46 |
47 | Except where otherwise noted, the example programs and other software provided
48 | by The Carpentries are made available under the [OSI][osi]-approved [MIT
49 | license][mit-license].
50 |
51 | Permission is hereby granted, free of charge, to any person obtaining a copy of
52 | this software and associated documentation files (the "Software"), to deal in
53 | the Software without restriction, including without limitation the rights to
54 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
55 | of the Software, and to permit persons to whom the Software is furnished to do
56 | so, subject to the following conditions:
57 |
58 | The above copyright notice and this permission notice shall be included in all
59 | copies or substantial portions of the Software.
60 |
61 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
62 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
63 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
64 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
65 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
66 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
67 | SOFTWARE.
68 |
69 | ## Trademark
70 |
71 | "The Carpentries", "Software Carpentry", "Data Carpentry", and "Library
72 | Carpentry" and their respective logos are registered trademarks of
73 | [The Carpentries, Inc.][carpentries].
74 |
75 | [cc-by-human]: https://creativecommons.org/licenses/by/4.0/
76 | [cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode
77 | [mit-license]: https://opensource.org/licenses/mit-license.html
78 | [carpentries]: https://carpentries.org
79 | [osi]: https://opensource.org
80 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | [](https://slack-invite.carpentries.org/)
2 | [](https://carpentries.slack.com/messages/C9X34DJ9Z)
3 | [](https://zenodo.org/badge/latestdoi/92422634)
4 |
5 | # spreadsheets-socialsci
6 |
7 | Lesson on spreadsheets and data organization for social scientists.
8 |
9 | Readme file for The SAFI Teaching Database
10 | Generated on 2019-09-19 for teaching purposes.
11 |
12 | Recommended citation for the dataset: Woodhouse, Philip; Veldwisch, Gert Jan; Brockington, Daniel; Komakech,
13 | Hans C.; Manjichi, Angela; Venot, Jean-Philippe (2018): SAFI Survey Results. doi:10.6084/m9.figshare.6262019.v1
14 |
15 | ***
16 |
17 | PROJECT INFORMATION
18 |
19 | ***
20 |
21 | 1. Title of dataset: The SAFI (Studying African Farmer-led Irrigation) Teaching Database
22 |
23 | 2. Author information:
24 |
25 | Principal Investigator
26 | Name: Philip Woodhouse
27 | Address: University of Manchester
28 | Email: [phil.woodhouse@manchester.ac.uk](mailto:phil.woodhouse@manchester.ac.uk)
29 |
30 | Co-Investigators:
31 | Name:Gert Jan Veldwisch
32 | Name: Daniel Brockington
33 | Name: Hans C Komakech
34 | Name: Angela Manjichi
35 | Name: Jean-Philippe Vernot
36 |
37 | 3. Data of data collection: November 2016 - June 2017
38 |
39 | 4. Funder Name: DFID-ESRC Growth Research Programme (DEGRP) grant ES/L01239/1
40 |
41 | 5. Publications:
42 | Farmer-led irrigation development and investment strategies for food security, growth and employment in
43 | Africa. Policy Brief. [www.safi-research.org/resources](https://www.safi-research.org/resources)
44 |
45 | ***
46 |
47 | DATA ACCESS INFORMATION
48 |
49 | ***
50 |
51 | 1. Licences / restrictions placed on access to the dataset: CC0
52 | 2: Access through figshare: doi:10.6084/m9.figshare.6262019.v1
53 |
54 | ***
55 |
56 | METHODS OF DATA COLLECTION
57 |
58 | ***
59 |
60 | 1. Describe the methods for data collection and / or provide links to papers describing data collection methods:
61 | This is survey data relating to households and agriculture in Tanzania and Mozambique. The survey data was collected
62 | through interviews conducted between November 2016 and June 2017. This is a teaching version of the dataset,
63 | not the full version.
64 |
65 | 2. Instrument information:
66 | The survey was split into several sections:
67 | A - General questions about when and where the survey was conducted;
68 | B - Information about the household and how long they have been living in the area;
69 | C - Details about the accommodation and other buildings on the farm;
70 | D - Details about different plots of land they grow crops on;
71 | E - Details about how they irrigate the land and availability of water;
72 | F - Financial details including assets owned and sources of income;
73 | G - Details of financial hardships;
74 | X - Information collected directly from the smartphone (GPS) or automatically included in the form (InstanceID).
75 |
76 | 3. Data procesessing:
77 | The survey data was collected through interviews using forms downloaded to Android Smartphones. The survey forms were
78 | created using the ODK (Open Data Kit) software via an Excel spreadsheet. The collected data were then sent back to the server.
79 | The server can be used to download the collected data in both JSON and CSV formats.
80 |
81 | 4. Analysis methods:
82 | Descriptive and summary statistics were calculated using SPSS.
83 |
84 | ***
85 |
86 | SUMMARY OF DATA FILES
87 |
88 | ***
89 |
90 | 1. List of data files:
91 | Filename: SAFI\_clean.csv
92 | Short description: CSV file containing the combined teaching data on one worksheet.
93 |
94 | Filename: SAFI\_messy.xlsx
95 | Short description: Excel file containing data for Tanzania and Mozambique recorded on separate worksheets
96 | and requiring data cleaning prior to anlysis.
97 |
98 | Filename: SAFI\_dates.xlsx
99 | Short description: Excel file containing date data for understanding how to format dates in spreadsheets.
100 |
101 | 2. Relationships between files:
102 | No official linkages between files.
103 |
104 | ***
105 |
106 | DATA-SPECIFIC INFORMATION FOR SAFI\_clean.csv
107 |
108 | ***
109 |
110 | 1. Number of variables: 14
111 |
112 | 2. Number of cases: 131
113 |
114 | 3. Missing data codes: NULL
115 |
116 | 4. Variable list
117 | Variable name: key\_ID
118 | Variable description: Added to provide a unique ID for each observation (the InstanceID field does this as well)
119 | Variable coding/values: Numeric values
120 | Range of values: 1-202
121 |
122 | Variable name: village
123 | Variable description: Village name
124 | Variable coding / values: Text
125 | Range of values: God, Chirodzo, Ruaca
126 |
127 | Variable name: interview\_date
128 | Variable description: Date of interview
129 | Variable coding: Date YYYY-MM-DDTime
130 | Range of values: 2016-11-16 - 2017-06-04
131 |
132 | Variable name: no\_membrs
133 | Variable description: How many members live in the household?
134 | Variable coding: Numeric value (continuous)
135 | Range of values: 2 - 19
136 |
137 | Variable name: years\_liv
138 | Variable description: How many years have you lived in this, or a neighbouring village?
139 | Variable coding: Numeric value (years, continuous)
140 | Range of values: 1-96
141 |
142 | Variable name: respondent\_wall\_type
143 | Variable description: What type of walls are in the house?
144 | Variable coding: Text (categories)
145 | Range of values: burntbricks, muddaub, sunbricks, cement
146 |
147 | Variable name: rooms
148 | Variable description: How many rooms in the house are used for sleeping?
149 | Variable coding: Numeric value (continuous)
150 | Range of values: 1-8
151 |
152 | Variable name: memb\_assoc
153 | Variable description: Is the participant a member of an irrigation association?
154 | Variable coding: Yes / No / NULL
155 |
156 | Variable name: affect\_conflicts
157 | Variable description: Has the person been affected by conflicts with other irrigators in the area?
158 | Variable coding: Text (category)
159 | Range of values: once, more\_once, frequently, never, NULL
160 |
161 | Variable name: liv\_count
162 | Variable description: Livestock count
163 | Variable coding: Numeric value (continuous)
164 | Range of values: 1-5
165 |
166 | Variable name: items\_owned
167 | Variable description: Which of the following items are owned by the household (list provided)
168 | Variable coding: Text (string separated by semicolon)
169 |
170 | Variable name: no\_meals
171 | Variable description: How many meals do people in your household normally eat in a day?
172 | Variable coding: Numeric value (continuous)
173 | Range of values: 2-3
174 |
175 | Variable name: months\_lack\_food
176 | Variable description: Indicate which months, in the last 12 months where you have faced a situation when you did not have enough food to feed the household?
177 | Variable coding: Text (string separate by semicolon)
178 | Range of values: Month given in abbreviation or none
179 |
180 | Variable name: InstanceID
181 | Variable description: Unique identifier for the form data submission
182 | Variable coding: unique ID alpha-numeric string
183 |
184 | ***
185 |
186 | DATA-SPECIFIC INFORMATION FOR SAFI\_messy.xlsx
187 |
188 | ***
189 |
190 | 1. Number of variables: 14 across two worksheets (Tanzania and Mozambique)
191 |
192 | 2. Variable list
193 | Variable name: key\_ID
194 | Variable description: Added to provide a unique ID for each observation (the InstanceID field does this as well)
195 | Variable coding/values: Numeric values
196 | Range of values: 1-202
197 |
198 | Variable name: roof\_type
199 | Variable description: Type of roof on accommodation
200 | Variable coding / values: Text (categories)
201 | Range of values: grass, mabatisloping
202 |
203 | Variable name: wall\_type
204 | Variable description: Type of wall in accommodation
205 | Variable coding: Text (categories)
206 | Range of values: muddaub, burntbricks
207 |
208 | Variable name: floor\_type
209 | Variable description: Type of floor in accommodation
210 | Variable coding: Text (categories)
211 | Range of values: earth, cement
212 |
213 | Variable name: live\_stock\_owned\_and\_numbers
214 | Variable description: Type of livestock owned and total number owned
215 | Variable coding: Alpha numeric
216 | Range of values: 1-4, poultry, oxen, cows, goats
217 |
218 | Variable name: plots
219 | Variable description: Number of plots cultivated in the last 12 months
220 | Variable coding: Numeric (categories)
221 | Range of values: 1-4 and -999
222 |
223 | Variable name: water use
224 | Variable description: Do you bring water to your fields, stop water leaving your fields or drain water out of any of your fields?
225 | Variable coding: text (categories)
226 | Range of values: no, yes, Y, N, 1, 1, no (only in summer)
227 |
228 | Variable name: rooms
229 | Variable description: Number of rooms in the house used for seleeping
230 | Variable coding: Numeric
231 | Range of values: 1 - 4
232 |
233 | Variable name: oxen
234 | Variable description: Do you own oxen?
235 | Variable coding: Numeric binary
236 | Range of values: 0, 1
237 |
238 | Variable name: poultry
239 | Variable description: Do you own poultry
240 | Variable coding: Text
241 | Range of values: 1,2 Yes
242 |
243 | Variable name: goats
244 | Variable description: Do you own goats?
245 | Variable coding: Text
246 | Range of values: 1, 0, No
247 |
248 | Variable name: cows
249 | Variable description: Do you own cows?
250 | Variable coding: Text
251 | Range of values: 1,0, Yes
252 |
253 | Variable name: total
254 | Variable description:Total number of livestock owned
255 | Variable coding: Numeric (continuous)
256 | Range of values: 1-4
257 |
258 | Variable name: look after cows
259 | Variable description: Does the participant look after cows?
260 | Variable coding: Yes / No
261 |
262 | ***
263 |
264 | DATA-SPECIFIC INFORMATION FOR SAFI\_dates.xlsx
265 |
266 | ***
267 |
268 | 1. Number of variables: 14 across two worksheets (DD\_MM\_YEAR and MM\_DD\_YEAR)
269 |
270 | 2. Variable list
271 | Variable name: Interview dates
272 | Variable description: Date that interview took place
273 | Variable coding/values: Date
274 | Range of values: DD-MM-YYYY or MM\_DD\_YYYY depending on spreadsheet
275 |
276 | Variable name: years\_farm
277 | Variable description: Number of years the household have been farming in this area
278 | Variable coding / values:
279 | Range of values:
280 |
281 | Variable name: parents\_live
282 | Variable description: Did your parents live in this village or neighbouring village?
283 | Variable coding: Yes / No
284 | Range of values: Yes / No
285 |
286 | Variable name: no\_membrs
287 | Variable description: How many members live in your household?
288 | Variable coding: Numeric value
289 | Range of values: 2 - 19
290 |
291 | Variable name: roof\_type
292 | Variable description: Type of roof on the accommodataion
293 | Variable coding: Text (categories)
294 | Range of values: grass, mabatisloping
295 |
296 | Variable name: respondent\_wall\_type
297 | Variable description: Type of wall in the accommodation
298 | Variable coding: Text (categories)
299 | Range of values: burntbricks, muddaub, sunbricks, cement
300 |
301 | Variable name: floor\_type
302 | Variable description: Type of floor in the accommodation
303 | Variable coding: Text (categories)
304 | Range of values: earth, cement
305 |
306 |
307 |
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | #------------------------------------------------------------
2 | # Values for this lesson.
3 | #------------------------------------------------------------
4 |
5 | # Which carpentry is this (swc, dc, lc, or cp)?
6 | # swc: Software Carpentry
7 | # dc: Data Carpentry
8 | # lc: Library Carpentry
9 | # cp: Carpentries (to use for instructor training for instance)
10 | # incubator: The Carpentries Incubator
11 | carpentry: 'dc'
12 |
13 | # Overall title for pages.
14 | title: 'Data Organization in Spreadsheets for Social Scientists'
15 |
16 | # Date the lesson was created (YYYY-MM-DD, this is empty by default)
17 | created: '2017-05-25'
18 |
19 | # Comma-separated list of keywords for the lesson
20 | keywords: 'software, data, lesson, The Carpentries'
21 |
22 | # Life cycle stage of the lesson
23 | # possible values: pre-alpha, alpha, beta, stable
24 | life_cycle: 'stable'
25 |
26 | # License of the lesson materials (recommended CC-BY 4.0)
27 | license: 'CC-BY 4.0'
28 |
29 | # Link to the source repository for this lesson
30 | source: 'https://github.com/datacarpentry/spreadsheets-socialsci'
31 |
32 | # Default branch of your lesson
33 | branch: 'main'
34 |
35 | # Who to contact if there are any issues
36 | contact: 'team@carpentries.org'
37 |
38 | # Navigation ------------------------------------------------
39 | #
40 | # Use the following menu items to specify the order of
41 | # individual pages in each dropdown section. Leave blank to
42 | # include all pages in the folder.
43 | #
44 | # Example -------------
45 | #
46 | # episodes:
47 | # - introduction.md
48 | # - first-steps.md
49 | #
50 | # learners:
51 | # - setup.md
52 | #
53 | # instructors:
54 | # - instructor-notes.md
55 | #
56 | # profiles:
57 | # - one-learner.md
58 | # - another-learner.md
59 |
60 | # Order of episodes in your lesson
61 | episodes:
62 | - 00-intro.md
63 | - 01-format-data.md
64 | - 02-common-mistakes.md
65 | - 03-dates-as-data.md
66 | - 04-quality-assurance.md
67 | - 05-exporting-data.md
68 |
69 | # Information for Learners
70 | learners:
71 |
72 | # Information for Instructors
73 | instructors:
74 |
75 | # Learner Profiles
76 | profiles:
77 |
78 | # Customisation ---------------------------------------------
79 | #
80 | # This space below is where custom yaml items (e.g. pinning
81 | # sandpaper and varnish versions) should live
82 |
83 |
84 | url: 'https://datacarpentry.github.io/spreadsheets-socialsci'
85 | analytics: carpentries
86 | lang: en
87 |
--------------------------------------------------------------------------------
/episodes/00-intro.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Introduction
3 | teaching: 15
4 | exercises: 3
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Define the scope of this lesson
10 | - Describe some drawbacks and advantages of using spreadsheet programs
11 |
12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
13 |
14 | :::::::::::::::::::::::::::::::::::::::: questions
15 |
16 | - What are spreadsheets useful for in a research project?
17 |
18 | ::::::::::::::::::::::::::::::::::::::::::::::::::
19 |
20 |
21 | Good data organization is the foundation of your research
22 | project. Most researchers have data or do data entry in
23 | spreadsheets. Spreadsheet programs are very useful graphical
24 | interfaces for designing data tables and handling very basic data
25 | quality control functions.
26 |
27 | ### Spreadsheet outline
28 |
29 | In this lesson, we're going to talk about:
30 |
31 | - Good data entry practices - formatting data tables in spreadsheets
32 | - How to avoid common formatting mistakes
33 | - Recognising and reformatting dates in spreadsheets
34 | - Basic quality control and data manipulation in spreadsheets
35 | - Exporting data from spreadsheets
36 |
37 | ### Spreadsheet programs
38 |
39 | Many spreadsheet programs are available. We will use Microsoft Excel in our examples.
40 | Although it is not open source software it is very widely available and used.
41 |
42 | Free spreadsheet programs such as LibreOffice are available.
43 | The functionality of these may differ from Excel, but in general they can be used to perform similar tasks.
44 |
45 | ## Problems with Spreadsheets
46 |
47 | Spreadsheets are good for data entry,
48 | but in reality we tend to use spreadsheet programs for much more than data entry.
49 | We use them to create data tables for publications,
50 | to generate summary statistics,
51 | and make figures.
52 | Laying out spreadsheets in this way often adds some difficulty when we want
53 | to take our data from the spreadsheet and use it in another program.
54 | Additional white space, merged cells, colour and grids
55 | may aid readability but are not easily handled by other programs
56 | that take our spreadsheet as an input to further analysis.
57 |
58 | Generating statistics and figures in spreadsheets should be done with caution.
59 | The graphical, drag and drop nature of spreadsheet programs means that it can be very difficult, if not impossible, to replicate your steps (much less retrace anyone else's).
60 | This is particularly true if your stats or figures require complex calculations.
61 | Furthermore, when performing calculations in a spreadsheet, it's easy to accidentally apply a slightly different formula to multiple adjacent cells.
62 | This often makes it difficult to demonstrate data quality and consistency in our analysis.
63 |
64 | Even when we are aware of some of the limitations that data in spreadsheets presents,
65 | often we have inherited spreadsheets from another colleague or data provider.
66 | In these situations we cannot exercise any control in its construction
67 | or entry of the data within it.
68 | Nevertheless it is important to be aware of the limitations these data may present, and know how to assess if any problems are present and how to overcome them.
69 |
70 | ::::::::::::::::::::::::::::::::::::::::: callout
71 |
72 | ## What this lesson will not teach you
73 |
74 | - How to do *statistics* in a spreadsheet
75 | - How to do *plotting* in a spreadsheet
76 | - How to *write code* in spreadsheet programs
77 |
78 | If you're looking to do this, a couple of good references are the
79 | [Excel Cookbook](https://search.worldcat.org/title/1419271899), published by O'Reilly, and the [Microsoft Excel 365 bible](https://search.worldcat.org/en/title/1263023438).
80 |
81 |
82 | ::::::::::::::::::::::::::::::::::::::::::::::::::
83 |
84 | ::::::::::::::::::::::::::::::::::::::: challenge
85 |
86 | ## Exercise
87 |
88 | - How many people have used spreadsheets in their research?
89 | - How many people have accidentally done something that made them
90 | frustrated or sad?
91 |
92 |
93 | ::::::::::::::::::::::::::::::::::::::::::::::::::
94 |
95 | ### Using Spreadsheets for Data Entry and Cleaning
96 |
97 | However, there are circumstances where you might want to use a spreadsheet
98 | program to produce "quick and dirty" calculations or figures, and some of
99 | these features can be used in data cleaning, prior to importation into a
100 | statistical analysis program. We will show you how to use some features of
101 | spreadsheet programs to check your data quality along the way and produce
102 | preliminary summary statistics.
103 |
104 | In this lesson, we will assume that you are most likely using Excel as
105 | your primary spreadsheet program - there are other programs with similar functionality but Excel seems
106 | to be the most commonly used.
107 |
108 |
109 |
110 | :::::::::::::::::::::::::::::::::::::::: keypoints
111 |
112 | - Good data organization is the foundation of any research project.
113 | - Spreadsheets are good for data entry, but when doing data cleaning or analysis, it's not easy to show or replicate what you did.
114 |
115 | ::::::::::::::::::::::::::::::::::::::::::::::::::
116 |
117 |
118 |
--------------------------------------------------------------------------------
/episodes/01-format-data.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Formatting Data Tables in Spreadsheets
3 | teaching: 15
4 | exercises: 15
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Recognise and resolve common spreadsheet formatting problems.
10 | - Describe the importance of metadata.
11 | - Identify metadata that should be included with a dataset.
12 |
13 | ::::::::::::::::::::::::::::::::::::::::::::::::::
14 |
15 | :::::::::::::::::::::::::::::::::::::::: questions
16 |
17 | - How do we format data in spreadsheets for effective data use?
18 |
19 | ::::::::::::::::::::::::::::::::::::::::::::::::::
20 |
21 | ## Data formatting problems
22 |
23 | The most common mistake made is treating spreadsheet programs like lab notebooks, that is,
24 | relying on context, notes in the margin,
25 | spatial layout of data and fields to convey information. As humans, we
26 | can (usually) interpret these things, but computers don't view information the same way, and
27 | unless we explain to the computer what every single thing means (and
28 | that can be hard!), it will not be able to see how our data fit
29 | together.
30 |
31 | Using the power of computers, we can manage and analyze data in much more
32 | effective and faster ways, but to use that power, we have to set up
33 | our data for the computer to be able to understand it (and computers are very
34 | literal).
35 |
36 | This is why it's extremely important to set up well-formatted
37 | tables from the outset - before you even start entering data from
38 | your very first preliminary experiment. Data organization is the
39 | foundation of your research project. It can make it easier or harder
40 | to work with your data throughout your analysis, so it's worth
41 | thinking about when you're doing your data entry or setting up your
42 | experiment. You can set things up in different ways in spreadsheets,
43 | but some of these choices can limit your ability to work with the data in other programs or
44 | have the you-of-6-months-from-now or your collaborator work with the
45 | data.
46 |
47 | ::::::::::::::::::::::::::::::::::::::::: callout
48 |
49 | ## Tip
50 |
51 | The best layouts/formats (as well as software and
52 | interfaces) for data entry and data analysis might be
53 | different. It is important to take this into account, and ideally
54 | automate the conversion from one to another.
55 |
56 |
57 | ::::::::::::::::::::::::::::::::::::::::::::::::::
58 |
59 | ### Keeping track of your analyses
60 |
61 | When you're working with spreadsheets, during data clean up or analyses, it's
62 | very easy to end up with a spreadsheet that looks very different from the one
63 | you started with. In order to be able to reproduce your analyses or figure out
64 | what you did when Reviewer #3 asks for a different analysis, you should
65 |
66 | - create a new file or tab with your cleaned or analyzed data. Don't modify
67 | the original dataset, or you will never know where you started!
68 | - keep track of the steps you took in your clean up or analysis. You should track
69 | these steps as you would any step in an experiment. You can
70 | do this in another text file, or a good option is to create a new tab in your spreadsheet
71 | with your notes. This way the notes and data stay together.
72 |
73 | Put these principles in to practice today during the exercises.
74 |
75 | ### Tidy data in spreadsheets
76 |
77 | The tidy data principles when structuring data in spreadsheets are:
78 |
79 | 1. Put all your variables in columns - the thing you're measuring,
80 | like 'weight' or 'temperature'.
81 | 2. Put each observation in its own row.
82 | 3. Don't combine multiple pieces of information in one
83 | cell. Sometimes it just seems like one thing, but think if that's
84 | the only way you'll want to be able to use or sort that data.
85 | 4. Leave the raw data raw - don't change it!
86 | 5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This
87 | ensures that anyone can use the data, and is required by
88 | most data repositories.
89 |
90 | You can understand more easily these principles with the illustrations in the [Tidy Data Series by Lowndes & Horst](https://allisonhorst.com/other-r-fun).
91 |
92 | For instance, we're going to be working with data from a study of
93 | agricultural practices among farmers in two countries in eastern
94 | sub-Saharan Africa (Mozambique and Tanzania). Researchers conducted
95 | interviews with farmers in these countries to collect data on
96 | household statistics (e.g., number of household members,
97 | number of meals eaten per day, availability of water),
98 | farming practices (e.g., water usage), and assets (e.g., number of farm plots,
99 | number of livestock). They also recorded the dates and locations of
100 | each interview.
101 |
102 | If they were to keep track of the data like this:
103 |
104 | {alt='multiple-info example'}
105 |
106 | the problem is that number of livestock and type of livestock are in
107 | the same field. So, if they wanted to
108 | look at the average number of livestock owned, or the average number of each type
109 | of livestock,
110 | it would be hard to do this using this data setup. If instead we put the count
111 | of each type of livestock in its own column, this would make analysis
112 | much easier. The rule of thumb, when setting up a datasheet, is that each
113 | variable (in this case, each type of livestock) should have its own column,
114 | each observation should have its own row, and each cell should contain only a
115 | single value. Thus, the example above should look like this:
116 |
117 | {alt='single-info example'}
118 |
119 | Notice that this now allows us to make statements about the number of each type of
120 | animal that a farmer owns, while still allowing us to say things about the
121 | total number of livestock. All we need to do is sum the values in each row to
122 | find a total. We'll be learning how to do this computationally and reproducibly
123 | later in this workshop.
124 |
125 | ::::::::::::::::::::::::::::::::::::::::: callout
126 |
127 | ## Workshop Data
128 |
129 | > The data used in these lessons are taken from interviews of farmers in two
130 | > countries in eastern sub-Saharan Africa (Mozambique and Tanzania). These
131 | > interviews were conducted between November 2016 and June 2017 and probed
132 | > household features (e.g., construction materials used, number of household
133 | > members), agricultural practices (e.g., water usage), and assets (e.g., number
134 | > and types of livestock).
135 |
136 | This is a real dataset, however, it has been simplified for this workshop. If
137 | you're interested in exploring the full dataset further, you can download
138 | it from Figshare and work with it using exactly the same tools we'll learn
139 | about today.
140 |
141 | For more information about the dataset and to download it from Figshare, check
142 | out the [Social Sciences workshop data
143 | page](https://www.datacarpentry.org/socialsci-workshop/data).
144 |
145 |
146 | ::::::::::::::::::::::::::::::::::::::::::::::::::
147 |
148 | ::::::::::::::::::::::::::::::::::::::::: callout
149 |
150 | ## LibreOffice Users
151 |
152 | The default for LibreOffice is to treat tabs, commas, and semicolons as delimiters.
153 | This behavior can cause problems with both the data for this lesson and other data
154 | you might want to use. This can be fixed when opening LibreOffice by deselecting
155 | the "semicolons" and "tabs" checkboxes.
156 |
157 |
158 | ::::::::::::::::::::::::::::::::::::::::::::::::::
159 |
160 | ::::::::::::::::::::::::::::::::::::::: challenge
161 |
162 | ## Exercise
163 |
164 | We're going to take a messy version of the SAFI data and describe how we would clean it up.
165 |
166 | 1. Download the [messy data](https://ndownloader.figshare.com/files/11502824).
167 | 2. Open up the data in a spreadsheet program.
168 | 3. Notice that there are two tabs. Two researchers conducted the interviews,
169 | one in Mozambique and the other in Tanzania. They both structured their
170 | data tables in a different way. Now, you're the person in charge of this
171 | project and you want to be able to start analyzing the data.
172 | 4. With the person next to you, identify what is wrong with this spreadsheet.
173 | Discuss the steps you would need to take to clean up the two tabs, and to
174 | put them all together in one spreadsheet.
175 |
176 | **Important** Do not forget our first piece of advice, to create a new file
177 | (or tab) for the cleaned data, never modify your original (raw) data.
178 |
179 | After you go through this exercise, we'll discuss as a group what was wrong
180 | with this data and how you would fix it.
181 |
182 | ::::::::::::::: solution
183 |
184 | ## Solution
185 |
186 | - Take about 10 minutes to work on this exercise.
187 | - All the mistakes listed in [the next episode](02-common-mistakes.md) are
188 | present in the messy dataset. If this
189 | exercise is done during a workshop, ask people what they saw as wrong with
190 | the data. As they bring up different points, you can refer to [the next episode](02-common-mistakes.md)
191 | or expand a bit on the point they brought up.
192 |
193 |
194 |
195 | :::::::::::::::::::::::::
196 |
197 | ::::::::::::::::::::::::::::::::::::::::::::::::::
198 |
199 | ::::::::::::::::::::::::::::::::::::::::: callout
200 |
201 | ## Handy References
202 |
203 | Three excellent references on spreadsheet organization are:
204 |
205 | - Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of
206 | Statistical Software. [http://www.jstatsoft.org/v59/i10](https://www.jstatsoft.org/v59/i10)
207 |
208 | - Julia Lowndes \& Allison Horst, *Tidy Data Series by Lowndes & Horst*. [https://allisonhorst.com/other-r-fun](https://allisonhorst.com/other-r-fun)
209 |
210 | - Karl W. Broman \& Kara H. Woo, *Data Organization in Spreadsheets*, Vol. 72,
211 | Issue 1, 2018, The American Statistician.
212 | [https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989)
213 |
214 |
215 | ::::::::::::::::::::::::::::::::::::::::::::::::::
216 |
217 | ### Metadata
218 |
219 | Recording data about your data ("metadata") is essential. You may be on intimate
220 | terms with your dataset while you are
221 | collecting and analysing it, but the chances that you will still remember
222 | the exact wording of the question you asked about your
223 | informants' water use (the data recorded in the column `water use`), for
224 | example, are slim.
225 |
226 | As well, there are many reasons other people may want to examine or use your data - to understand your findings, to verify your findings,
227 | to review your submitted publication, to replicate your results, to design a
228 | similar study, or even to archive your data for access and
229 | re-use by others. While digital data by definition are machine-readable,
230 | understanding their meaning is a job for human beings. The
231 | importance of documenting your data during the collection and analysis phase of
232 | your research cannot be overestimated, especially if your
233 | research is going to be part of the scholarly record.
234 |
235 | However, metadata should not be contained in the data file itself. Unlike a table
236 | in a paper or a supplemental file, metadata (in the
237 | form of legends) should not be included in a data file since this information is
238 | not data, and including it can disrupt how computer
239 | programs interpret your data file. Rather, metadata should be stored as a
240 | separate file in the same directory as your data file,
241 | preferably in plain text format with a name that clearly associates it with your
242 | data file. Because metadata files are free text format,
243 | they also allow you to encode comments, units, information about how null values
244 | are encoded, etc. that are important to document but can
245 | disrupt the formatting of your data file.
246 |
247 | Some of this information may be familiar to learners who conduct analyses on
248 | survey data or other data sets that come with codebooks. Codebooks will often
249 | describe the way a variable has been constructed, what prompt was associated with
250 | it in a survey or interview, and what the meaning of various values are. For example,
251 | the [General Social Survey](https://gss.norc.org) maintains their entire codebook online.
252 | Looking at an entry for a particular variable, such as
253 | [the variable `SEX`](https://gssdataexplorer.norc.org/variables/81/vshow), provides
254 | valuable information about what survey waves the variable covers, and the meaning
255 | of particular values.
256 |
257 | Additionally, file or database level metadata describes how files that make up
258 | the dataset relate to each other; what format they are
259 | in; and whether they supersede or are superseded by previous files. A
260 | folder-level readme.txt file is the classic way of accounting for
261 | all the files and folders in a project.
262 |
263 | Metadata are most useful when they follow a standard. For example, the
264 | [Data Documentation Initiative (DDI)](https://www.ddialliance.org) provides a
265 | standardized way to document metadata at various points in the research cycle.
266 | Research librarians may have specific expertise in this area, and can be
267 | helpful resources for thinking about ways to purposefully document metatdata
268 | as part of your research.
269 |
270 | (Text on metadata adapted from the online course [MANTRA - Research Data Management Training](https://mantra.ed.ac.uk/) by Research Data Service and the Institute for Academic Development, University of Edinburgh. MANTRA is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).)
271 |
272 | ::::::::::::::::::::::::::::::::::::: instructor
273 | For the next exercise, learners will open a CSV file.
274 | Instructions on how to load the file correctly (ensuring the delimiter is
275 | interpreted properly regardless of the learner's computer settings)
276 | are provided in the
277 | [Quality Assurance episode](04-quality-assurance.md#restricting-data-to-a-numeric-range).
278 |
279 |
280 | :::::::::::::::::::::::::::::::::::::::::::::::::
281 |
282 | ::::::::::::::::::::::::::::::::::::::: challenge
283 |
284 | ## Exercise
285 |
286 | Download a [clean version of this
287 | dataset](https://ndownloader.figshare.com/files/11492171) and open the file
288 | with your spreadsheet program. This data has many more variables that were not
289 | included in the messy spreadsheet and is formatted according to tidy data
290 | principles.
291 |
292 | Discuss this data with a partner and make a list of some of the types of
293 | metadata that should be recorded about this dataset. It may be helpful to
294 | start by asking yourself, "What is not immediately obvious to me about this
295 | data? What questions would I need to know the answers to in order to analyze
296 | and interpret this data?"
297 |
298 | ::::::::::::::: solution
299 |
300 | ## Solution
301 |
302 | Some types of metadata that should be recorded and made available with the
303 | data are:
304 |
305 | - the exact wording of questions used in the interviews (if interviews were
306 | structured) or general prompts used (if interviews were semi-structured)
307 | - a description of the type of data allowed in each column (e.g., the allowed
308 | range for numerical data with a restricted range, a list of allowed options
309 | for categorical variables, whether data in a numerical column should be
310 | continuous or discrete)
311 | - definitions of any categorical variables (e.g., definitions of
312 | "burntbricks" and "sunbricks")
313 | - definitions of what was counted as a "room", a "plot", etc. (e.g., was
314 | there a minimum size)
315 | - learners may come up with additional questions to add to this list
316 |
317 |
318 |
319 | :::::::::::::::::::::::::
320 |
321 | ::::::::::::::::::::::::::::::::::::::::::::::::::
322 |
323 |
324 |
325 | :::::::::::::::::::::::::::::::::::::::: keypoints
326 |
327 | - Never modify your raw data. Always make a copy before making any changes.
328 | - Keep track of all of the steps you take to clean your data.
329 | - Organize your data according to tidy data principles.
330 | - Record metadata in a separate plain text file.
331 |
332 | ::::::::::::::::::::::::::::::::::::::::::::::::::
333 |
334 |
335 |
--------------------------------------------------------------------------------
/episodes/02-common-mistakes.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Formatting Problems
3 | teaching: 20
4 | exercises: 0
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Recognize and resolve common spreadsheet formatting problems.
10 |
11 | ::::::::::::::::::::::::::::::::::::::::::::::::::
12 |
13 | :::::::::::::::::::::::::::::::::::::::: questions
14 |
15 | - What common mistakes are made when formatting spreadsheets?
16 |
17 | ::::::::::::::::::::::::::::::::::::::::::::::::::
18 |
19 | ## Common Spreadsheet Errors
20 |
21 | This lesson is meant to be used as a reference for discussion as learners identify issues with the messy dataset discussed in the
22 | previous lesson. Instructors: don't go through this lesson except to refer to responses to the exercise in the previous lesson.
23 |
24 | There are a few potential errors to be on the lookout for in your own data as well as data from collaborators or the Internet. If you are aware of the errors and the possible negative effect on downstream data analysis and result interpretation, it might motivate yourself and your project members to try and avoid them. Making small changes to the way you format your data in spreadsheets can have a great impact on efficiency and reliability when it comes to data cleaning and analysis.
25 |
26 | - [Using multiple tables](#tables)
27 | - [Using multiple tabs](#tabs)
28 | - [Not filling in zeros](#zeros)
29 | - [Using problematic null values](#null)
30 | - [Using formatting to convey information](#formatting)
31 | - [Using formatting to make the data sheet look pretty](#formatting-pretty)
32 | - [Placing comments or units in cells](#units)
33 | - [Entering more than one piece of information in a cell](#info)
34 | - [Using problematic field names](#field-name)
35 | - [Using special characters in data](#special)
36 |
37 | ## Using multiple tables {#tables}
38 |
39 | A common strategy is creating multiple data tables within
40 | one spreadsheet. This confuses the computer, so try to avoid doing this!
41 | When you create multiple tables within one
42 | spreadsheet, you're drawing false associations between things for the computer,
43 | which sees each row as an observation. You're also potentially using the same
44 | field name in multiple places, which will make it harder to clean your data up
45 | into a usable form. The example below depicts the problem:
46 |
47 | {alt='multiple tables'}
48 |
49 | In the example above, the computer will see row 24 and assume that all columns A-J
50 | refer to the same sample. This row actually represents two distinct samples
51 | (information about livestock for informant 1 and information about plots for informant 2). Other rows are similarly problematic.
52 |
53 | ## Using multiple tabs {#tabs}
54 |
55 | But what about workbook tabs? That seems like an easy way to organize data, right? Well, yes and no. When you create extra tabs, you fail
56 | to allow the computer to see connections in the data that are there (you have to introduce spreadsheet application-specific functions or
57 | scripting to ensure this connection).
58 |
59 | Say you make a separate tab for each day you take a measurement. This isn't good practice for two reasons:
60 |
61 | 1) you are more likely to accidentally add inconsistencies to your data if each time you take a measurement, you start recording data in a new tab, and
62 | 2) even if you manage to prevent all inconsistencies from creeping in, you will add an extra step for yourself before you analyze the
63 | data because you will have to combine these data into a single datatable. You will have to explicitly tell the computer how to combine
64 | tabs - and if the tabs are inconsistently formatted, you might even have to do it manually.
65 |
66 | For these and other reasons, it is good practice to avoid creating new tabs to organize your spreadsheet data. The next time you're entering data, and you go to create another tab or table, ask yourself if you could avoid adding this tab by adding another column to your original spreadsheet. You may, however, use a new tab to store notes about your data, such as steps you've taken to clean or manipulate your data.
67 |
68 | Your data sheet might get very long over the course of the experiment. This makes it harder to enter data if you can't see your headers
69 | at the top of the spreadsheet. But don't repeat your header row. These can easily get mixed into the data,
70 | leading to problems down the road.
71 |
72 | Instead you can freeze the column headers so that they remain visible even when you have a spreadsheet with many rows.
73 |
74 | [Documentation on how to freeze column headers](https://support.office.com/en-ca/article/Freeze-column-headings-for-easy-scrolling-57ccce0c-cf85-4725-9579-c5d13106ca6a)
75 |
76 | ## Not filling in zeros {#zeros}
77 |
78 | It might be that when you're measuring something, it's
79 | usually a zero, say the number of cows that an informant has, in a
80 | region where most farmers have goats and no cows. Why bother
81 | writing in the number zero in that column, when it's mostly zeros?
82 |
83 | {alt='filling in zeros'}
84 |
85 | However, there's a difference between a zero and a blank cell in a spreadsheet. To the computer, a zero is actually data. You measured
86 | or counted it. A blank cell means that it wasn't measured and the computer will interpret it as an unknown value (otherwise known as a
87 | null value).
88 |
89 | The spreadsheets or statistical programs will likely mis-interpret blank cells that you intend to be zeros. By not entering the value of
90 | your observation, you are telling your computer to represent that data as unknown or missing (null). This can cause problems with
91 | subsequent calculations or analyses. For example, the average of a set of numbers which includes a single null value is always null
92 | (because the computer can't guess the value of the missing observations). Because of this, it's very important to record zeros as zeros and truly missing data as nulls.
93 |
94 | ## Using problematic null values {#null}
95 |
96 | **Example**: using -999 or other numerical values (or zero) to represent missing data.
97 |
98 | **Solution**: One common practice is to record unknown or missing data as -999, 999, or 0. Many statistical programs will not recognize
99 | that these are intended to represent missing (null) values. How these values are interpreted will depend on the software you use to
100 | analyze your data. It is essential to use a clearly defined and consistent null indicator.
101 | Blanks (most applications) and NA (for R) are good choices. White et al., 2013, explain good choices for indicating null values for different software applications in their article:
102 | [Nine simple ways to make it easier to (re)use your data.](https://ojs.library.queensu.ca/index.php/IEE/article/view/4608) Ideas in Ecology and Evolution.
103 |
104 | | Null Values | Problems | Compatibility | Recommendation |
105 | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- | -------------- |
106 | | 0 | Indistinguishable from a true zero | | NEVER use |
107 | | Blank | Hard to distinguish values that are missing from those overlooked on entry. Hard to distinguish blanks from spaces, which behave differently. | R, Python, SQL, Excel | Best option |
108 | | \-999, 999 | Not recognized as null by many programs without user input. Can be inadvertently entered into calculations. | | Avoid |
109 | | NA, na | Can also be an abbreviation (e.g., North America), can cause problems with data type (turn a numerical column into a text column). NA is more commonly recognized than na. | R | Good option |
110 | | N/A | An alternate form of NA, but often not compatible with software. | | Avoid |
111 | | NULL | Can cause problems with data type. | SQL | Good option |
112 | | None | Uncommon. Can cause problems with data type. | Python | Avoid |
113 | | No data | Uncommon. Can cause problems with data type, contains a space. | | Avoid |
114 | | Missing | Uncommon. Can cause problems with data type. | | Avoid |
115 | | \-, +, . | Uncommon. Can cause problems with data type. | | Avoid |
116 |
117 | ## Using formatting to convey information {#formatting}
118 |
119 | **Example**: highlighting cells, rows or columns that should be excluded from an analysis, and leaving blank rows to indicate separations in data.
120 |
121 | {alt='formatting'}
122 |
123 | **Solution**: create a new field to encode which data should be excluded.
124 |
125 | {alt='good formatting'}
126 |
127 | ## Using formatting to make the data sheet look pretty {#formatting-pretty}
128 |
129 | **Example**: merging cells.
130 |
131 | **Solution**: If you're not careful, formatting a worksheet to be more aesthetically pleasing can compromise your computer's ability to
132 | see associations in the data. Merged cells will make your data unreadable by statistics software. Consider restructuring your data in
133 | such a way that you will not need to merge cells to organize your data.
134 |
135 | ## Placing comments or units in cells {#units}
136 |
137 | **Example**: Some of your informants only irrigate their plots at certain times of the year. You've added this information as notes directly into the cell with the data.
138 |
139 | **Solution**: Most analysis software can't see Excel or LibreOffice comments, and would be confused by comments placed within your data
140 | cells. As described above for formatting, create another field if you need to add notes to cells. Similarly, don't include units in
141 | cells: ideally, all the measurements you place in one column should be in the same unit, but if for some reason they aren't, create
142 | another field and specify the units the cell is in.
143 |
144 | {alt='comments in cells'}
145 |
146 | ## Entering more than one piece of information in a cell {#info}
147 |
148 | **Example**: Your informant has multiple livestock of different types. You record this information as "3, (oxen , cows)" to indicate that there are three total livestock, which is a mixture of oxen and cows.
149 |
150 | **Solution**: Don't include more than one piece of information in a cell. This will limit the ways in which you can analyze your data.
151 | If you need both these types of information (the total number of animals and the types), design your data sheet to include this information. For example, include a separate column for each type of livestock.
152 |
153 | ## Using problematic field names {#field-name}
154 |
155 | Choose descriptive field names, but be careful not to include spaces, numbers, or special characters of any kind. Spaces can be
156 | misinterpreted by parsers that use whitespace as delimiters and some programs don't like field names that are text strings that start
157 | with numbers.
158 |
159 | Underscores (`_`) are a good alternative to spaces. Consider writing names in camel case (like this: ExampleFileName) to improve
160 | readability. Remember that abbreviations that make sense at the moment may not be so obvious in 6 months, but don't overdo it with names
161 | that are excessively long. Including the units in the field names avoids confusion and enables others to readily interpret your variable names. Avoid starting variable names with numbers, as this may cause problems with some analysis software.
162 |
163 | **Examples**
164 |
165 | | Good Name | Good Alternative | Avoid |
166 | |--------------|-------------------|----------------|
167 | | wall_type | WallType | wall type |
168 | | longitude | GpsLongitude | gps:Longitude |
169 | | gender | gender | M/F |
170 | | Informant_01 | first_informant | 1st Inf |
171 | | age_18 | years18 | 18years |
172 |
173 | ## Using special characters in data {#special}
174 |
175 | **Example**: You treat your spreadsheet program as a word processor when writing notes, for example copying data directly from Word or
176 | other applications.
177 |
178 | **Solution**: This is a common strategy. For example, when writing longer text in a cell, people often include line breaks, em-dashes,
179 | etc in their spreadsheet. Also, when copying data in from applications such as Word, formatting and fancy non-standard characters (such
180 | as left- and right-aligned quotation marks) are included. When exporting this data into a coding/statistical environment or into a
181 | relational database, dangerous things may occur, such as lines being cut in half and encoding errors being thrown.
182 |
183 | General best practice is to avoid adding characters such as newlines, tabs, and vertical tabs. In other words, treat a text cell as if
184 | it were a simple web form that can only contain text and spaces.
185 |
186 |
187 |
188 | :::::::::::::::::::::::::::::::::::::::: keypoints
189 |
190 | - Avoid using multiple tables within one spreadsheet.
191 | - Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).
192 | - Record zeros as zeros.
193 | - Use an appropriate null value to record missing data.
194 | - Don't use formatting to convey information or to make your spreadsheet look pretty.
195 | - Place comments in a separate column.
196 | - Record units in column headers.
197 | - Include only one piece of information in a cell.
198 | - Avoid spaces, numbers and special characters in column headers.
199 | - Avoid special characters in your data.
200 |
201 | ::::::::::::::::::::::::::::::::::::::::::::::::::
202 |
203 |
204 |
--------------------------------------------------------------------------------
/episodes/03-dates-as-data.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Dates as Data
3 | teaching: 10
4 | exercises: 10
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Recognise problematic or suspicious date formats.
10 | - Use formulas to separate dates into their component values (e.g., Year, Month, Day).
11 |
12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
13 |
14 | :::::::::::::::::::::::::::::::::::::::: questions
15 |
16 | - What are good approaches for handling dates in spreadsheets?
17 |
18 | ::::::::::::::::::::::::::::::::::::::::::::::::::
19 |
20 | ## Date formats in spreadsheets
21 |
22 | Dates in spreadsheets are often stored in a single column.
23 |
24 | While this seems like a logical way to record dates when you are entering them, or visually reviewing data, it's not actually a best practice for preparing data for analysis.
25 |
26 | When working with data, your goal is to have as little ambiguity as possible. Ambiguity can creep into your data when working with dates when there are regional variations either in your observations or when you or your team might be working with different versions or suites of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric).
27 |
28 | To avoid ambiguity between regional differences in date formatting and compatibility across spreadsheet software programs, a good practice is to divide dates into components in different columns - YEAR, MONTH, and DAY.
29 |
30 | When working with dates it's also important to remember that functions are guaranteed to be compatible only within the same family of software products (e.g., LibreOffice, Microsoft Excel, Gnumeric). If you need to export your data and conserve the timestamps, you are better off handling dates using one of the solutions discussed below than the single column method.
31 |
32 | One of the other reasons dates can be tricky is that most spreadsheet programs have "useful features" which can change the way dates are displayed - but not stored. The image below demonstrates some of the many date formatting options in Excel.
33 |
34 | {alt='Many formats, many ambiguities'}
35 |
36 | ## Dates stored as integers
37 |
38 | The first thing you need to know is that Excel stores dates as numbers - see the last column in the above figure. This serial number represents the number of days from December 31, 1899. In the example, July 2, 2014 is stored as the serial number 41822.
39 |
40 | Using functions we can add days, months or years to a given date.
41 | Say you had a research plan where you needed to conduct interviews with a
42 | set of informants every ninety days for a year.
43 |
44 | In our example above, in a new cell you can type:
45 |
46 | \=B2+90
47 |
48 | And it would return
49 |
50 | 30-Sep
51 |
52 | because it understands the date as a number `41822`, and `41822 + 90 = 41912`
53 | which Excel interprets as the 30th day of September, 2014. In most cases, it retains the format of the cell that is being operated upon. Month and year rollovers are internally tracked and applied.
54 |
55 | ## Regional date formatting
56 |
57 | When you enter a date into a spreadsheet it looks like a date although the spreadsheet program may
58 | display different text from what you input. It does this to be 'helpful' but it often is not.
59 |
60 | For example if you enter '7/12/88' into your
61 | Excel spreadsheet it may display as '07/12/1988' (depending on your version of Excel). These
62 | are different ways of formatting the same date.
63 |
64 | Different countries also write dates differently. If you are in the UK, for example, you will interpret
65 | the date above as the 7th day of December, however a researcher from the US will interpret the same entry as the 12th day of July. This regional variation is handled automatically by your
66 | spreadsheet program so that when you are typing in dates they appear as you would expect. If you
67 | try to type in a US format date into a UK version of Excel, it may or may not be treated as a
68 | date.
69 |
70 | This regional variation is one good reason to treat dates, not as a single data point, but as
71 | three distinct pieces of data (year, month, and day). Separating dates into their component parts
72 | will avoid this confusion, while also giving the added benefit of allowing you to compare, for
73 | example data collected in January of multiple years with data collected in February of multiple years.
74 |
75 | ::::::::::::::::::::::::::::::::::::::: challenge
76 |
77 | ## Separating dates into components
78 |
79 | Download and open the [SAFI\_dates.xlsx](https://ndownloader.figshare.com/files/11502827) file. This file
80 | contains a subset of the data from the SAFI interviews, including the dates on which the
81 | interviews were conducted.
82 |
83 | Choose the tab of the spreadsheet that corresponds to the way you format dates in your
84 | location (either day first `DD_MM_YEAR`, or month first `MM_DD_YEAR`).
85 |
86 | Extract the components of the date to new columns. For this we
87 | can use the built-in Excel functions:
88 |
89 | `=YEAR()`
90 | `=MONTH()`
91 | `=DAY()`
92 |
93 | Apply each of these formulas to its entire column.
94 | Make sure the new column is formatted as a number and not as a date.
95 |
96 | We now have each component of our date isolated in its own column. This will allow us
97 | to group our data with respect to year, month, or day of month for our analyses and will
98 | also prevent problems when passing data between different versions of spreadsheet
99 | software (as for example when sharing data with collaborators in different countries).
100 |
101 | ::::::::::::::: solution
102 |
103 | ## Solution
104 |
105 | {alt='dates exercise 1'}
106 |
107 | Note that this solution shows the dates in `MM_DD_YEAR` format.
108 |
109 |
110 |
111 | :::::::::::::::::::::::::
112 |
113 | ::::::::::::::::::::::::::::::::::::::::::::::::::
114 |
115 | ::::::::::::::::::::::::::::::::::::::: challenge
116 |
117 | ## Default year
118 |
119 | Using the same spreadsheet you used for the previous exercise, add another data point
120 | in the `interview_date` column by typing either `11/17` (if your location uses `MM/DD` formatting)
121 | or `17/11` (if your location uses `DD/MM` formatting). The `Year`, `Month`, and `Day` columns
122 | should populate for this new data point. What year is shown in the `Year` column?
123 |
124 | ::::::::::::::: solution
125 |
126 | ## Solution
127 |
128 | If no year is specified, the spreadsheet program will assume you mean the current year
129 | and will insert that value. This may be incorrect if you are working with historical data so
130 | be very cautious when working with data that does not have a year specified within its date
131 | variable.
132 |
133 |
134 |
135 | :::::::::::::::::::::::::
136 |
137 | ::::::::::::::::::::::::::::::::::::::::::::::::::
138 |
139 | ## Historical data
140 |
141 | Excel is unable to parse dates from before 1899-12-31, and will thus leave these untouched. If you're mixing historic data
142 | from before and after this date, Excel will translate only the post-1900 dates into its internal format, thus resulting in mixed data. If you're working with historic data, be extremely careful with your dates!
143 |
144 |
145 |
146 | :::::::::::::::::::::::::::::::::::::::: keypoints
147 |
148 | - Use extreme caution when working with date data.
149 | - Splitting dates into their component values can make them easier to handle.
150 |
151 | ::::::::::::::::::::::::::::::::::::::::::::::::::
152 |
153 |
154 |
--------------------------------------------------------------------------------
/episodes/04-quality-assurance.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Quality Assurance
3 | teaching: 15
4 | exercises: 10
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Apply quality assurance techniques to limit incorrect data entry.
10 |
11 | ::::::::::::::::::::::::::::::::::::::::::::::::::
12 |
13 | :::::::::::::::::::::::::::::::::::::::: questions
14 |
15 | - How can we carry out basic quality assurance in spreadsheets?
16 |
17 | ::::::::::::::::::::::::::::::::::::::::::::::::::
18 |
19 | When you have a well-structured data table, you can use several simple
20 | techniques within your spreadsheet to ensure the data you enter is
21 | free of errors.
22 |
23 | ## Validating data on input
24 |
25 | When we input data into a cell of a spreadsheet we are typically not constrained in the type of data we enter.
26 | In any one column, the spreadsheets software will not warn us if we start to enter a mix of text, numbers or dates in different rows.
27 | Even if we are not facing constraints from the software, as a researcher we often anticipate that all data in one column will be of a certain type.
28 | It is also possible that the nature of the data contained in the table allows us to place additional restrictions on the acceptable values for cells in a column.
29 | For example a column recording age in years should be numeric, greater than 0 and is unlikely to be greater than 120.
30 |
31 | Excel allows us to specify a variety of data validations to be applied to cell contents.
32 | If the validation fails, an error is raised and the data we entered does not go into the particular cell.
33 |
34 | We will be working with a couple of examples of data validation
35 | rules but many others exist. For an overview of data validation rules
36 | available, check out the [Excel support page on data validation](https://support.office.com/en-us/article/Apply-data-validation-to-cells-29FECBCC-D1B9-42C1-9D76-EFF3CE5F7249) or the [Validating cell contents section of the LibreOffice Calc Guide](https://books.libreoffice.org/en/CG24/CG2402-EnteringandEditingData.html#toc28).
37 |
38 | We will look at two examples:
39 |
40 | 1. Restricting data to a numeric range
41 | 2. Restricting data to entries from a list
42 |
43 | ### Restricting data to a numeric range
44 |
45 | First, we'll open the [clean version of the SAFI dataset](https://ndownloader.figshare.com/files/11492171),
46 | which is a CSV file. CSV files are plain text files where the columns are separated
47 | by commas, hence 'comma separated values' or CSV. CSV is a format commonly used for tabular data,
48 | which we will discuss further in the next episode.
49 |
50 | To open this CSV file, one option is to double-click the file once it's in your Downloads folder.
51 | However, doing this can lead to different results depending on your computer's configuration. To avoid this,
52 | the following box shows a more reliable method to load the data in Excel or Calc.
53 |
54 | ::: group-tab
55 |
56 | ### Excel
57 |
58 | 1\. Open Excel and start a blank workbook.
59 |
60 | 2\. Select the `Data` tab. In the `Get & Transform Data` group, choose `Get Data` > `From File` > `From Text/CSV`
61 |
62 | 3\. In the pop-up window, navigate to the folder that contains your file, select the file, and click `Open`.
63 |
64 | 4\. In the new window, make sure the Delimiter is set to `Comma` at the top. Review the data preview. If everything looks correct, click `Load`.
65 |
66 | {alt='Load CSV in Excel'}
67 |
68 | ### Calc
69 |
70 | 1\. Open LibreOffice Calc.
71 |
72 | 2\. Click `File` > `Open...`
73 |
74 | 3\. In the pop-up window, navigate to the folder that contains your file, select the file, and click `Open`.
75 |
76 | 4\. In the new window, you'll see several options. For now, make sure that under `Separator Options`, only `Separated by` > `Comma` is selected. Review the data preview. If everything looks correct, click `OK`.
77 |
78 | {alt='Load CSV in Calc'}
79 |
80 | :::
81 |
82 | When we open the file, we see that there are
83 | several columns with numeric data. One example of this is the column `no_membrs`
84 | representing the number of people in the household. We would expect this always
85 | to be a positive integer, and so we should reject values like `1.5` and `-8` as
86 | entry errors. We would also reject values over a certain maximum - for example
87 | an entry like `90` is probably the result of the researcher inputting `9` and
88 | their finger slipping and also hitting the `0` key. It is up to you as the
89 | researcher to decide what a reasonable maximum value would be for your data,
90 | here we will assume that there are no families with greater than 30 members.
91 |
92 | Let's start by opening the data validation feature using the `no_membrs` column.
93 |
94 | ::: group-tab
95 |
96 | ### Excel
97 |
98 | 1\. Select the `no_membrs` column.
99 |
100 | 2\. Select the `Data` tab, and in the `Data Tools` group, select `Data Validation` or `Validation Tools` (depending on your version of Excel). The following pop-up will appear:
101 |
102 | {alt='Image of data validation tab in Excel'}
103 |
104 | 3\. Select 'Whole number' from the `Allow` drop down options.
105 |
106 | 4\. The window content will change.
107 | In the `Data` drop down box, check that 'between' is selected. `Minimum` and `Maximum` boxes will be provided for you to specify an allowed range. You will see this:
108 |
109 | {alt='Image of data validation tab for number rules in Excel'}
110 |
111 | 5\. Fill in the minimum and maximum values that make sense for your data and click `OK`. Here we will choose a minimum of 1 and a maximum of 30.
112 |
113 | ### Calc
114 |
115 | 1\. Select the `no_membrs` column.
116 |
117 | 2\. On the `Data` tab select `Validity...`. The following pop-up will appear:
118 |
119 | {alt='Image of data validation tab in LibreOffice'}
120 |
121 | 3\. Select 'Whole Numbers' from the `Allow` drop down options.
122 |
123 | 4\. The window content will change.
124 | In the `Data` drop down box, check that 'valid range' is selected. `Minimum` and `Maximum` boxes will be provided for you to specify an allowed range. You will see this:
125 |
126 | {alt='Image of data validation tab in LibreOffice'}
127 |
128 | 5\. Fill in the minimum and maximum values that make sense for your data and click `OK`. Here we will choose a minimum of 1 and a maximum of 30.
129 |
130 | :::
131 |
132 |
133 | Now your data table will not allow you to enter a value that violates
134 | the data validation rule you have created. To test this out, try
135 | to enter a new value into the `no_membrs` column that is not valid.
136 | The following error box will appear
137 |
138 | ::: group-tab
139 |
140 | ### Excel
141 |
142 | {alt='Image of error message for inputing invalid data in Excel'}
143 |
144 | ### Calc
145 |
146 | {alt='Image of error message for inputing invalid data in LibreOffice'}
147 |
148 | :::
149 |
150 |
151 | You can also customize the resulting message to be more informative by entering
152 | your own message in the `Error Alert` tab when creating a data validation rule.
153 |
154 | ::: group-tab
155 |
156 | ### Excel
157 |
158 | Check that the `Style` is 'Stop'. You can write 'Invalid number' as the `Title`
159 | and 'Number of households must be a whole number between 1 and 30' as the `Error Message`.
160 |
161 | {alt='Image of Error Alert tab in Excel'}
162 |
163 | Now check what happens if you try to enter an invalid value.
164 |
165 | ### Calc
166 |
167 | Check that the `Action` is 'Stop'. You can write 'Invalid number' as the `Title`
168 | and 'Number of households must be a whole number between 1 and 30' as the `Error Message`.
169 |
170 | {alt='Image of Error Alert tab in LibreOffice'}
171 |
172 |
173 | Now check what happens if you try to enter an invalid value.
174 |
175 | :::
176 |
177 |
178 | You can also have an `Input message` that warns users of the spreadsheet what values
179 | are accepted in cell that has data validation.
180 |
181 | ::: group-tab
182 |
183 | ### Excel
184 |
185 | Select the `Input Message` tab. Add the title and input message that is convenient for
186 | your task. In this example, we will write 'Household members' and 'Please enter a whole number between 1 and 30'.
187 |
188 | {alt='Image of Input Message tab in Excel'}
189 |
190 | Now check what happens when you select a cell that has data validation.
191 |
192 | ### Calc
193 |
194 | Select the `Input Help` tab. Add the title and input message that is convenient for
195 | your task. In this example, we will write 'Household members' and 'Please enter a whole number between 1 and 30'.
196 |
197 | {alt='Image of Input Message tab in LibreOffice'}
198 |
199 | Now check what happens when you select a cell that has data validation.
200 |
201 | :::
202 |
203 |
204 | ::::::::::::::::::::::::::::::::::::::: challenge
205 |
206 | ## Exercise
207 |
208 | Apply a new data validation rule to one of the other numeric
209 | columns in this data table. Discuss with the person sitting next
210 | to you what a reasonable rule would be for the column you've selected. Be sure to create an informative error alert and input message.
211 |
212 |
213 | ::::::::::::::::::::::::::::::::::::::::::::::::::
214 |
215 | ### Restricting data to entries from a list
216 |
217 | Quality assurance can make data entry easier as well as more robust. For
218 | example, if you use a list of options to restrict data entry, the spreadsheet
219 | will provide you with a drop-downlist of the available items. So, instead of
220 | trying to remember how to spell "mabatisloping", or whether or not you capitalized "cement" you can select the
221 | right option from the list.
222 |
223 | ::: group-tab
224 |
225 | ### Excel
226 |
227 | 1\. Select the `respondent_wall_type` column.
228 |
229 | 2\. Select the `Data` tab, and in the `Data Tools` group, select `Data Validation` or `Validation Tools` (depending on your version of Excel).
230 |
231 | 3\. Select `List` from the `Allow` drop-down menu.
232 |
233 | 4\. The window will change to include a `Source` box, you will see:
234 |
235 | {alt='Image of selecting a range of values to allow in Excel'}
236 |
237 | 5\. Type a list of all the values that you want to be accepted in this column, separated by a comma (with no spaces). For us this will be "grass,muddaub,burntbricks,sunbricks,cement".
238 |
239 | 6\. Create a meaningful error alert and input message, then click 'OK'. In LibreOffice, there is no need to create an input message.
240 |
241 |
242 | ### Calc
243 |
244 | 1\. Select the `respondent_wall_type` column.
245 |
246 | 2\. On the `Data` tab select `Validity...`.
247 |
248 | 3\. Select `List` from the `Allow` drop-down menu.
249 |
250 | 4\. The window will change to include an `Entries` box, you will see:
251 |
252 | {alt='Image of selecting a range of values to allow in LibreOffice'}
253 |
254 | 5\. Type a list of all the values that you want to be accepted in this column, and insert a new line by clicking enter after each value. Make sure not to include spaces before or after the values. Your entries of grass, muddaub, burntbricks, sunbricks and cement should look like this:
255 |
256 | {alt='Image of filled in range of values to allow in LibreOffice'}
257 |
258 | 6\. Create a meaningful error alert and input message, then click 'OK'.
259 |
260 | :::
261 |
262 | We have now provided a restriction that will be validated each time we try and
263 | enter data into the selected cells. When a cell in this column is selected, a drop-down arrow will appear.
264 | When you click the arrow, you will be able to select a value from your list.
265 | If you type a value which is not on the list, you will get an error message. This not only prevents data input errors, but also makes it easier and faster to enter data.
266 |
267 | ::::::::::::::::::::::::::::::::::::::: challenge
268 |
269 | ## Exercise
270 |
271 | Apply a new data validation rule to one of the other categorical
272 | columns in this data table. Discuss with the person sitting next
273 | to you what a reasonable rule would be for the column you've selected. Be sure to create an informative input message.
274 |
275 |
276 | ::::::::::::::::::::::::::::::::::::::::::::::::::
277 |
278 | ::::::::::::::::::::::::::::::::::::::::: callout
279 |
280 | ## Tip
281 |
282 | Typing a list of values where only a few possible values exist (like "grass, muddaub, burntbricks, sunbricks, cement") might be convenient, but if the list is longer it makes sense to create it as a small table (in a separate tab of the workbook).
283 | We can give the table a name and then reference the table name as the source of acceptable inputs when the source box appears in the Data Validation pop-out.
284 |
285 | Using a table in this way makes the data entry process more flexible.
286 | If you add or remove contents from the table, then these are immediately reflected in any new cell entries based on this source.
287 | You can also have different cells refer to the same table of acceptable inputs.
288 |
289 |
290 | ::::::::::::::::::::::::::::::::::::::::::::::::::
291 |
292 | ::::::::::::::::::::::::::::::::::::::::: callout
293 |
294 | ## Tip
295 |
296 | In the examples above we have applied data validation rules to
297 | an existing spreadsheet to demonstrate how they work, however,
298 | you may have noticed that data validation rules are not applied
299 | retroactively to data that is already present in the cell.
300 | This means, for example, that if we had already entered `150`
301 | in the `no_membrs` column before applying our data validation
302 | rule, that cell would not be flagged with a warning.
303 |
304 | In some versions of Excel, you can click in the `Data` tab, and in the `Data Tools` group,
305 | click in the little drop-down arrow next to `Data Validation`, and then `Circle invalid data`. This will put red circles around invalid data entries. Note that it can be a bit slow with large data files. You can do the same in LibreOffice Calc by going to `Tools` tab, then `Detective` and
306 | selecting `Mark invalid data`.
307 |
308 | When using spreadsheets for data entry, it is a good idea to set up
309 | data validation rules for each column when you set up your
310 | spreadsheet (i.e. before you enter any data).
311 |
312 |
313 | ::::::::::::::::::::::::::::::::::::::::::::::::::
314 |
315 |
316 |
317 | :::::::::::::::::::::::::::::::::::::::: keypoints
318 |
319 | - Always copy your original spreadsheet file and work with a copy so you don't affect the raw data.
320 | - Use data validation to prevent accidentally entering invalid data.
321 |
322 | ::::::::::::::::::::::::::::::::::::::::::::::::::
323 |
324 |
325 |
--------------------------------------------------------------------------------
/episodes/05-exporting-data.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Exporting Data
3 | teaching: 10
4 | exercises: 5
5 | ---
6 |
7 | ::::::::::::::::::::::::::::::::::::::: objectives
8 |
9 | - Store spreadsheet data in universal file formats.
10 | - Export data from a spreadsheet to a CSV file.
11 |
12 | ::::::::::::::::::::::::::::::::::::::::::::::::::
13 |
14 | :::::::::::::::::::::::::::::::::::::::: questions
15 |
16 | - How can we export data from spreadsheets in a way that is useful for downstream applications?
17 |
18 | ::::::::::::::::::::::::::::::::::::::::::::::::::
19 |
20 | Storing the data you're going to work with for your analyses in Excel
21 | default file format (`*.xls` or `*.xlsx` - depending on the Excel
22 | version) isn't a good idea. Why?
23 |
24 | - Because it is a proprietary format, and it is possible that in
25 | the future, technology won't exist (or will become sufficiently
26 | rare) to make it inconvenient, if not impossible, to open the file.
27 |
28 | - Other spreadsheet software may not be able to open files
29 | saved in a proprietary Excel format.
30 |
31 | - Different versions of Excel may handle data
32 | differently, leading to inconsistencies.
33 |
34 | - Finally, more journals and grant agencies are requiring you
35 | to deposit your data in a data repository, and most of them don't
36 | accept Excel format. It needs to be in one of the formats
37 | discussed below.
38 |
39 | - The above points also apply to other formats such as open data formats used by LibreOffice. These formats are not static and do not get parsed the same way by different software packages.
40 |
41 | As an example of inconsistencies in data storage, do you remember our earlier discussion about how Excel stores dates? It turns out that
42 | there are multiple defaults for different versions of the software, and you can switch between them all. So, say you're
43 | compiling Excel-stored data from multiple sources. There's dates in each file- Excel interprets them as their own internally consistent
44 | serial numbers. When you combine the data, Excel will take the serial number from the place you're importing it from, and interpret it
45 | using the rule set for the version of Excel you're using. Essentially, you could be adding errors to your data, and it wouldn't
46 | necessarily be flagged by any data cleaning methods if your ranges overlap.
47 |
48 | Storing data in a universal, open, and static format will help deal with this problem. Try tab-delimited (tab separated values
49 | or TSV) or comma-delimited (comma separated values or CSV). The advantage of a CSV file over an Excel/SPSS/etc. file is that we can open and read a CSV file
50 | using just about any software, including plain text editors like TextEdit or NotePad.
51 | Data in a CSV file can also be easily imported into other formats and
52 | environments, such as SQLite and R. We're not tied to a certain version of a certain expensive program when we work with CSV files, so
53 | it's a
54 | good format to work with for maximum portability and endurance. Most spreadsheet programs can save to delimited text formats like CSV
55 | easily, although they may give you a warning during the file export.
56 |
57 | To save a file you have opened in Excel in CSV format:
58 |
59 | 1. From the top menu select `File` and `Save as`.
60 | 2. In the `Format` field, from the list, select `Comma Separated Values` (`*.csv`).
61 | 3. Double check the file name and the location where you want to save it and hit `Save`.
62 |
63 | An important note for backwards compatibility: you can open CSV files in Excel!
64 |
65 | {alt='Saving an Excel file to CSV'}
66 |
67 | ::::::::::::::::::::::::::::::::::::::::: callout
68 |
69 | ## A note on R and `xls`
70 |
71 | There are R packages that can read `xls` files (as well as
72 | Google spreadsheets). It is even possible to access different
73 | worksheets in the `xls` documents. However, because these
74 | packages parse data tables from proprietary and non-static
75 | software, there is no guarantee that they will continue to
76 | work on new versions of Excel. Exporting your data to CSV or TSV
77 | format is much safer and more reproducible.
78 |
79 |
80 | ::::::::::::::::::::::::::::::::::::::::::::::::::
81 |
82 | ::::::::::::::::::::::::::::::::::::::::: callout
83 |
84 | ## What to do when your data contain commas
85 |
86 | In some datasets, the data values themselves may include commas (,). In that
87 | case, you need to make sure that the commas are properly escaped when saving
88 | the file. Otherwise, the software which you use (including Excel) will most
89 | likely incorrectly display the data in columns. This is because the commas
90 | which are a part of the data values will be interpreted as delimiters.
91 |
92 | If you are working with data that contains commas, the fields should be
93 | enclosed with double quotes. The spreadsheet software should do the right
94 | thing [LibreOffice](https://www.libreoffice.org/download/download/) provides
95 | comprehensive options to import and export CSV files). However, it is always a
96 | good idea to double check that the file you are exporting can be read in
97 | correctly. For more of a discussion on data formats and potential issues with
98 | commas within datasets see [the Ecology Spreadsheets lesson discussion
99 | page](https://www.datacarpentry.org/spreadsheet-ecology-lesson/discuss).
100 |
101 |
102 | ::::::::::::::::::::::::::::::::::::::::::::::::::
103 |
104 |
105 |
106 | :::::::::::::::::::::::::::::::::::::::: keypoints
107 |
108 | - Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.
109 | - Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.
110 |
111 | ::::::::::::::::::::::::::::::::::::::::::::::::::
112 |
113 |
114 |
--------------------------------------------------------------------------------
/episodes/fig/bad-formatting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/bad-formatting.png
--------------------------------------------------------------------------------
/episodes/fig/better-formatting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/better-formatting.png
--------------------------------------------------------------------------------
/episodes/fig/comments-in-cells.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/comments-in-cells.png
--------------------------------------------------------------------------------
/episodes/fig/data-validation-numbers-LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/data-validation-numbers-LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/data-validation-numbers-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/data-validation-numbers-new.png
--------------------------------------------------------------------------------
/episodes/fig/data-validation-tab-LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/data-validation-tab-LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/data-validation-tab-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/data-validation-tab-new.png
--------------------------------------------------------------------------------
/episodes/fig/error-invalid-data-LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/error-invalid-data-LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/error-invalid-data-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/error-invalid-data-new.png
--------------------------------------------------------------------------------
/episodes/fig/error_alert-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/error_alert-new.png
--------------------------------------------------------------------------------
/episodes/fig/error_alert_LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/error_alert_LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/excel-to-csv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/excel-to-csv.png
--------------------------------------------------------------------------------
/episodes/fig/excel_dates_1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/excel_dates_1.jpg
--------------------------------------------------------------------------------
/episodes/fig/filled-range-of-values-LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/filled-range-of-values-LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/input_message-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/input_message-new.png
--------------------------------------------------------------------------------
/episodes/fig/input_message_LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/input_message_LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/load-csv-calc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/load-csv-calc.png
--------------------------------------------------------------------------------
/episodes/fig/load-csv-excel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/load-csv-excel.png
--------------------------------------------------------------------------------
/episodes/fig/multiple-info.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/multiple-info.png
--------------------------------------------------------------------------------
/episodes/fig/multiple-tables-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/multiple-tables-example.png
--------------------------------------------------------------------------------
/episodes/fig/multiple-tables-example2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/multiple-tables-example2.png
--------------------------------------------------------------------------------
/episodes/fig/select-range-of-values-LibreOffice-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/select-range-of-values-LibreOffice-new.png
--------------------------------------------------------------------------------
/episodes/fig/select-range-of-values-new.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/select-range-of-values-new.png
--------------------------------------------------------------------------------
/episodes/fig/single-info.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/single-info.png
--------------------------------------------------------------------------------
/episodes/fig/solution_exercise_1_dates.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/solution_exercise_1_dates.png
--------------------------------------------------------------------------------
/episodes/fig/spreadsheet_simple_data_01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/spreadsheet_simple_data_01.png
--------------------------------------------------------------------------------
/episodes/fig/spreadsheets_Data_validation_05.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/spreadsheets_Data_validation_05.png
--------------------------------------------------------------------------------
/episodes/fig/white_table_1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/white_table_1.jpg
--------------------------------------------------------------------------------
/episodes/fig/zeros-example.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/datacarpentry/spreadsheets-socialsci/264dbfaacdbefbe72f9d62db393e730813aeef52/episodes/fig/zeros-example.png
--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | permalink: index.html
3 | site: sandpaper::sandpaper_site
4 | ---
5 |
6 | Good data organization is the foundation of any research project. Most
7 | researchers have data in spreadsheets, so it's the place that many research
8 | projects start.
9 |
10 | Typically we organize data in spreadsheets in ways that we as humans want to work with the data. However
11 | computers require data to be organized in particular ways. In order
12 | to use tools that make computation more efficient, such as programming
13 | languages like R or Python, we need to structure our data the way that
14 | computers need the data. Since this is where most research projects start,
15 | this is where we want to start too!
16 |
17 | In this lesson, you will learn:
18 |
19 | - Good data entry practices - formatting data tables in spreadsheets
20 | - How to avoid common formatting mistakes
21 | - Approaches for handling dates in spreadsheets
22 | - Basic quality control and data manipulation in spreadsheets
23 | - Exporting data from spreadsheets
24 |
25 | In this lesson, however, you will *not* learn about data analysis with spreadsheets.
26 | Much of your time as a researcher will be spent in the initial 'data wrangling'
27 | stage, where you need to organize the data to perform a proper analysis later.
28 | It's not the most fun, but it is necessary. In this lesson you will
29 | learn how to think about data organization and some practices for more
30 | effective data wrangling. With this approach you can better format current data
31 | and plan new data collection so less data wrangling is needed.
32 |
33 | :::::::::::::::::::::::::::::::::::::::::: prereq
34 |
35 | ## Getting Started
36 |
37 | Data Carpentry's teaching is hands-on, so participants are encouraged to use
38 | their own computers to ensure the proper setup of tools for an efficient
39 | workflow.
**These lessons assume no prior knowledge of the skills or tools.**
40 |
41 | To get started, follow the directions in the "[Setup](learners/setup.md)" tab to
42 | download data to your computer and follow any installation instructions.
43 |
44 | #### Prerequisites
45 |
46 | This lesson requires a working copy of spreadsheet software, such as Microsoft
47 | Excel or LibreOffice or OpenOffice.org (see more details in "[Setup](learners/setup.md)").
48 |
To most effectively use these materials, please make sure to install
49 | everything *before* working through this lesson.
50 |
51 |
52 | ::::::::::::::::::::::::::::::::::::::::::::::::::
53 |
54 | :::::::::::::::::::::::::::::::::::::::::: prereq
55 |
56 | ## For Instructors
57 |
58 | If you are teaching this lesson in a workshop, please see the
59 | [Instructor notes](instructors/instructor-notes.md).
60 |
61 |
62 | ::::::::::::::::::::::::::::::::::::::::::::::::::
63 |
64 |
65 |
--------------------------------------------------------------------------------
/instructors/instructor-notes.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Instructor Notes
3 | ---
4 |
5 | ## Instructor notes
6 |
7 | ## Lesson motivation and learning objectives
8 |
9 | The purpose of this lesson is not to teach how to do data analysis in spreadsheets,
10 | but to teach good data organization and how to do some data cleaning and
11 | quality control in a spreadsheet program.
12 |
13 | ## Lesson design
14 |
15 | #### [Introduction](../episodes/00-intro.md)
16 |
17 | - Introduce that we're teaching data organization, and that we're using
18 | spreadsheets, because most people do data entry in spreadsheets or
19 | have data in spreadsheets.
20 | - Emphasize that we are teaching good practice in data organization and that
21 | this is the foundation of their research practice. Without organized and clean
22 | data, it will be difficult for them to apply the things we're teaching in the
23 | rest of the workshop to their data.
24 | - Much of their lives as a researcher will be spent on this 'data wrangling' stage, but
25 | some of it can be prevented with good strategies for data collection up front.
26 | - Tell that we're not teaching data analysis or plotting in spreadsheets, because it's
27 | very manual and also not reproducible. That's why we're teaching SQL, R, Python!
28 | - Now let's talk about spreadsheets, and when we say spreadsheets, we mean any program that
29 | does spreadsheets like Excel or LibreOffice. Most learners are probably using Excel.
30 | - Ask the audience any things they've accidentally done in spreadsheets. Talk about an example of your own, like that you accidentally sorted only a single column and not the rest
31 | of the data in the spreadsheet. What are the pain points!?
32 | - As people answer highlight some of these issues with spreadsheets
33 |
34 | #### [Formatting data](../episodes/01-format-data.md)
35 |
36 | - Introduce the dataset that will be used in this lesson, and in the other Social Sciences lessons, the [Studying African Farmer-led Irrigation (SAFI) Dataset](https://www.datacarpentry.org/socialsci-workshop/data).
37 | - Go through the point about keeping track of your steps and keeping raw data raw
38 | - Go through the cardinal rule of spreadsheets about columns, rows and cells
39 | - Hand them a messy data file and have them pair up and work together to clean up the data.
40 | *Give them 15 minutes to do this.*
41 | - Learners who are using LibreOffice for the workshop will have problems with the dataset
42 | as the default for LibreOffice is to treat tabs, commas, and semicolons as delimiters. This
43 | can be fixed when opening LibreOffice by deselecting the "semicolons" and "tabs" checkboxes.
44 | - Ask for what people did to clean the data. As they bring up different points you can
45 | refer to them in the [Common formatting problems](../episodes/02-common-mistakes.md) file, or expand a bit on the point they brought up.
46 | All these mistakes are present in the messy
47 | dataset.
48 | - If you get a response where they've fixed the date, you can pause and go to the
49 | [dates](../episodes/03-dates-as-data.md) lesson. Or you can say you'll come back to dates at the end.
50 | There's an exercise in that file about how to change the
51 | date into three columns using Excel's built in MONTH, DAY, YEAR functions. Have them
52 | run through that exercise.
53 |
54 | #### [Common formatting problems](../episodes/02-common-mistakes.md)
55 |
56 | - **Don't go through this chapter** except to refer to as responses to the exercise in
57 | the previous chapter.
58 |
59 | #### [Dates as data](../episodes/03-dates-as-data.md)
60 |
61 | - Do the exercise and make the point about dates either in response to a learner bringing
62 | up date as an issue during the responses, or at the end of the response time.
63 | - If learners are using a non-English language version of Excel, the `=MONTH()`, `=DAY()`, and other date
64 | functions won't work for them. They will need to type in their language's equivalent of that word in the formula.
65 | - The spreadsheet for this episode has two tabs. The first tab is data stored as `DD-MM-YYYY`,
66 | the second is `MM-DD-YYYY`. If learners use the wrong tab for their location, they will get a `#VALUE` error.
67 | - When using Libre Office, it is helpful to first save the file in ods format. Then be sure to convert
68 | the date column to type date by right clicking on the cell, choose "Format Cells..." then choose Date and
69 | take a type of date that uses `DD/MM/YYYY`, such as English (Botswana). Once you click ok, you will find that
70 | the date has been pre-pended by an apostrophe. For example 21/11/2016 becomes '21/11/2016. Edit the cell to
71 | remove the apostrophe. You will then find that the day(), month() and year() functions work.
72 |
73 | #### [Quality assurance](../episodes/04-quality-assurance.md)
74 |
75 | The challenge with this lesson is that the instructor's version of the spreadsheet software is going to look different than about half the room's. It makes
76 | it challenging to show where you can find menu options and navigate through.
77 |
78 | Instead discuss the concepts of quality control, and how things like sorting can help you find outliers in your data.
79 |
80 | #### [Exporting data](../episodes/05-exporting-data.md)
81 |
82 | - Have the students export their cleaned data as CSV. Reiterate again the need for
83 | data in this format for the other tools we'll be using.
84 |
85 | #### Concluding points
86 |
87 | - Now your data is organized so that a computer can read and understand it. This
88 | let's you use the full power of the computer for your analyses as we'll see in the
89 | rest of the workshop.
90 | - While your data is now neatly organized, it still might have errors or missing data
91 | or other problems. It's like you put all your data in the right drawers, but the
92 | drawers might still be messy. The next lesson is going to teach you OpenRefine which
93 | is great for data cleaning and for some of the quality control that we touched on
94 | in this lesson. It also has the advantage that it automatically keeps track of the
95 | steps you take.
96 |
97 | ## Technical tips and tricks
98 |
99 | Provide information on setting up your environment for learners to view your
100 | live coding (increasing text size, changing text color, etc), as well as
101 | general recommendations for working with coding tools to best suit the
102 | learning environment.
103 |
104 | ## Common problems
105 |
106 | #### Excel looks and acts different on different operating systems
107 |
108 | The main challenge with this lesson is that Excel looks very different and how you
109 | do things is even different between Mac and PC, and between different versions of
110 | Excel. So, the presenter's environment will only be the same as some of the learners.
111 |
112 | We need better notes and screenshots of how things work on both Mac and PC. But we
113 | likely won't be able to cover all the different versions of Excel.
114 |
115 | If you have a helper who has experience with the other OS than you, it would be good
116 | to prep them to help with this lesson and tell how people to do things in the other OS.
117 |
118 | #### Apple Numbers
119 |
120 | Apple Numbers does not have data validation, which is needed for part of this lesson. A note
121 | is included in the setup instructions pointing Numbers users to either Microsoft Excel
122 | or LibreOffice.
123 |
124 | #### People are not interactive or responsive on the Exercise
125 |
126 | This lesson depends on people working on the exercise and responding with things
127 | that are fixed. If your audience is reluctant to participate, start out with
128 | some things on your own, or ask a helper for their answers. This generally gets
129 | even a reluctant audience started.
130 |
131 | ## Common questions raised by participants
132 |
133 | ### How do you extract date components from the interview\_date field in SAFI\_clean.csv?
134 |
135 | The interview\_date field in SAFI\_clean.csv when saved to SAFI\_clean.xlsx is difficult to
136 | manage because there isn't a way to format the column as a date field, even using the
137 | custom field formats. The easiest solution to this question is to show the student how to
138 | extract the date information from the field. Make a new column and format it as a date.
139 | In the first cell of the new column type =LEFT(C2,10) and then apply this to the column.
140 | This function extracts the first 10 characters from the left side of the interview\_date
141 | field and inserts them into a new column.
142 |
143 | ### How would you automatically transform the items\_owned field into a usable format?
144 |
145 | If you are not following the course immediately with the OpenRefine lesson it is important
146 | to make it clear that in the current format SAFI\_clean.csv is not ready for analysis.
147 | The items\_owned column ideally needs to be split into separate yes / no / null columns.
148 | Example: set up a new column 'bicycle' and format it as a number. You then need to extract
149 | information from the items\_owned column about whether the word 'bicycle' is in the column.
150 | One way of doing this is to use an IF statement: =IF(ISNUMBER(SEARCH("bicycle",K2))1,0).
151 | The IF statement can include a wild character e.g. "bicy\*".
152 |
153 |
154 |
--------------------------------------------------------------------------------
/learners/reference.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: 'Glossary'
3 | ---
4 |
5 | ## Glossary
6 |
7 | {:auto\_ids}
8 | cleaned data
9 | : data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
10 |
11 | conditional formatting
12 | : formatting that is applied to a specific cell or range of cells depending on a set of criteria
13 |
14 | CSV (comma separated values) format
15 | : a plain text file format in which values are separated by commas
16 |
17 | factor
18 | : a variable that takes on a limited number of possible values (i.e. categorical data)
19 |
20 | metadata
21 | : data which describes other data
22 |
23 | null value
24 | : a value used to record observations missing from a dataset
25 |
26 | observation
27 | : a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
28 |
29 | plain text
30 | : unformatted text
31 |
32 | quality assurance
33 | : any process which checks data for validity during entry
34 |
35 | quality control
36 | : any process which removes problematic data from a dataset
37 |
38 | raw data
39 | : data that has not been manipulated and represents actual recorded values
40 |
41 | rich text
42 | : formatted text (e.g. text that appears bolded, colored or italicized)
43 |
44 | string
45 | : a collection of characters (e.g. "thisisastring")
46 |
47 | TSV (tab separated values) format
48 | : a plain text file format in which values are separated by tabs
49 |
50 | variable
51 | : a category of data being collected on the object being recorded (e.g. a mouse's weight)
52 |
53 |
54 |
--------------------------------------------------------------------------------
/learners/setup.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: Setup
3 | ---
4 |
5 | :::::::::::::::::::::::::::::::::::::::::: prereq
6 |
7 | ## Data
8 |
9 | You need to download some files to follow this lesson:
10 |
11 | 1. Download the following three files:
12 |
13 | - [SAFI\_clean.csv](https://ndownloader.figshare.com/files/11492171)
14 | - [SAFI\_messy.xlsx](https://ndownloader.figshare.com/files/11502824)
15 | - [SAFI\_dates.xlsx](https://ndownloader.figshare.com/files/11502827)
16 |
17 | 2. Place these 3 files in a folder you can easily find and access on your
18 | computer (for instance in a `datacarpentry-spreadsheets` folder on your
19 | Desktop or within your Home folder).
20 |
21 | #### About the data
22 |
23 | For more information about the dataset and to
24 | download it from Figshare, check out the [Social Sciences workshop data
25 | page](https://www.datacarpentry.org/socialsci-workshop/data).
26 |
27 |
28 | ::::::::::::::::::::::::::::::::::::::::::::::::::
29 |
30 | :::::::::::::::::::::::::::::::::::::::::: prereq
31 |
32 | ## Software
33 |
34 | To work through this tutorial you will need access to a spreadsheet program. For this you have many options: [Microsoft Excel](https://www.microsoft.com/en-us/microsoft-365/excel), [LibreOffice](https://www.libreoffice.org/), [Apple Numbers](https://support.apple.com/numbers), [Gnumeric](http://www.gnumeric.org/), [Onlyoffice](https://www.onlyoffice.com/), [WPS office](https://www.wps.com/), among others. Commands may differ a bit between programs, but
35 | the general ideas for thinking about spreadsheets are the same.
36 |
37 | For this lesson, we encourage you to use LibreOffice or Microsoft Excel, as the tasks we will
38 | be doing have been tested in these programs. If you don't have Microsoft Excel, you can use
39 | LibreOffice. It's a free, open source spreadsheet program. Here are the instructions to install it:
40 |
41 |
42 |
43 | #### Windows
44 |
45 | - **Download the Installer**
46 | Install LibreOffice by going to the [installation
47 | page](https://www.libreoffice.org/download/download-libreoffice/). The
48 | version for Windows should automatically be selected. Click
49 | **Download**. You will go to a page that asks about a
50 | donation, but you don't need to make one. Your download should begin
51 | automatically.
52 | - **Install LibreOffice**
53 | Once the installer is downloaded, double click on it and it should
54 | install.
55 |
56 | #### Mac OS X
57 |
58 | - **Download the Installer**
59 | Install LibreOffice by going to the [installation
60 | page](https://www.libreoffice.org/download/download-libreoffice/). The
61 | version for macOS should automatically be selected. Click
62 | **Download**. You will go to a page that asks about a
63 | donation, but you don't need to make one. Your download should begin
64 | automatically.
65 | - **Install LibreOffice**
66 | The file *LibreOffice\_X.X.X\_MacOS\_x86-64* (whichever version of LibreOffice you have selected) should have been
67 | downloaded. Double click on this file, and LibreOffice will be
68 | installed.
69 |
70 | #### Linux
71 |
72 | - **Download the Installer**
73 | Install LibreOffice by going to the [installation
74 | page](https://www.libreoffice.org/download/download-libreoffice/). The
75 | version for Linux should automatically be selected. Click **Download**. You will go to a page that asks about a donation,
76 | but you don't need to make one. Your download should begin
77 | automatically.
78 | - **Install LibreOffice**
79 | Once the installer is downloaded, double click on it and it should
80 | install.
81 |
82 |
83 | ::::::::::::::::::::::::::::::::::::::::::::::::::
84 |
85 |
86 |
--------------------------------------------------------------------------------
/profiles/learner-profiles.md:
--------------------------------------------------------------------------------
1 | ---
2 | title: FIXME
3 | ---
4 |
5 | This is a placeholder file. Please add content here.
6 |
--------------------------------------------------------------------------------
/site/README.md:
--------------------------------------------------------------------------------
1 | This directory contains rendered lesson materials. Please do not edit files
2 | here.
3 |
--------------------------------------------------------------------------------