├── .bumpversion.toml ├── .github ├── ISSUE_TEMPLATE │ ├── bug_report.md │ └── content-or-feature-request.md └── workflows │ └── book.yaml ├── .gitignore ├── .pre-commit-config.yaml ├── .pymarkdown.json ├── CHANGELOG.md ├── CODE_OF_CONDUCT.md ├── LICENCE ├── README.md ├── book ├── _config.yml ├── _static │ ├── BPI_logo.jpg │ ├── accessibility.css │ ├── admonitions.css │ ├── af_logo.png │ ├── code_quality.png │ ├── duck_book_logo.svg │ ├── favicon.ico │ ├── git_deeper_branching.png │ ├── git_develop.png │ ├── git_feature.png │ ├── git_main.png │ ├── git_multiple_features.png │ ├── github_pr_changes.png │ ├── logo.css │ ├── markdown_within_tabs.css │ ├── no_integration_tests.png │ ├── qa_of_code_favicon.png │ ├── repro_stack.png │ ├── semantic_versioning.png │ └── separation_of_concerns.png ├── _toc.yml ├── checklist_higher.md ├── checklist_lower.md ├── checklist_moderate.md ├── checklists.md ├── code_documentation.md ├── configuration.md ├── continuous_integration.md ├── data.md ├── glossary.md ├── intro.md ├── learning.md ├── managers_guide.md ├── modular_code.md ├── peer_review.md ├── principles.md ├── project_documentation.md ├── project_structure.md ├── readable_code.md ├── testing_code.md ├── tools.md └── version_control.md ├── dev-requirements.txt ├── early_development ├── references.md └── validation-errors-logs.md └── requirements.txt /.bumpversion.toml: -------------------------------------------------------------------------------- 1 | [tool.bumpversion] 2 | current_version = "2025.1" 3 | parse = "(?P\\d+)\\.(?P\\d+)" 4 | serialize = ["{year}.{build}"] 5 | tag = true 6 | sign_tags = false 7 | tag_name = "v{new_version}" 8 | tag_message = "Bump version: {current_version} → {new_version}" 9 | allow_dirty = false 10 | commit = true 11 | message = "Bump version: {current_version} → {new_version}" 12 | 13 | [[tool.bumpversion.files]] 14 | filename = "book/_config.yml" 15 | search = "Book version {current_version}" 16 | replace = "Book version {new_version}" 17 | 18 | [[tool.bumpversion.files]] 19 | filename = "book/intro.md" 20 | search = "(version {current_version})" 21 | replace = "(version {new_version})" -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the problem is. e.g. broken link 12 | 13 | **To Reproduce** 14 | Steps to reproduce the problem e.g.: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Additional context** 32 | Add any other context about the problem here. 33 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/content-or-feature-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Content or feature request 3 | about: Suggest an idea for the Duck Book! 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your content or feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. e.g. 'I don't understand the testing section.' 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. e.g. 'I'd like more examples of tests.' 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/workflows/book.yaml: -------------------------------------------------------------------------------- 1 | name: Build and deploy book 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | tags: 8 | - 'v*' 9 | pull_request: 10 | 11 | jobs: 12 | build: 13 | runs-on: ubuntu-latest 14 | steps: 15 | - uses: actions/checkout@v3 16 | 17 | - name: Set up Python 3.10 18 | uses: actions/setup-python@v4 19 | with: 20 | python-version: 3.10.4 21 | 22 | - name: Install dependencies 23 | run: | 24 | pip install -r requirements.txt 25 | 26 | - name: Check book build 27 | run: | 28 | jb build book -W -v --keep-going 29 | 30 | - name: "Deploy book to GitHub Pages" 31 | if: startsWith(github.event.ref, 'refs/tags/v') 32 | uses: peaceiris/actions-gh-pages@v3.6.1 33 | with: 34 | github_token: ${{ secrets.GITHUB_TOKEN }} 35 | publish_dir: ./book/_build/html 36 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | # C extensions 6 | *.so 7 | # Distribution / packaging 8 | .Python 9 | build/ 10 | develop-eggs/ 11 | dist/ 12 | downloads/ 13 | eggs/ 14 | .eggs/ 15 | lib/ 16 | lib64/ 17 | parts/ 18 | sdist/ 19 | var/ 20 | wheels/ 21 | pip-wheel-metadata/ 22 | share/python-wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | # Installer logs 33 | pip-log.txt 34 | pip-delete-this-directory.txt 35 | # Unit test / coverage reports 36 | htmlcov/ 37 | .tox/ 38 | .nox/ 39 | .coverage 40 | .coverage.* 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | *.cover 45 | *.py,cover 46 | .hypothesis/ 47 | .pytest_cache/ 48 | # Translations 49 | *.mo 50 | *.pot 51 | # Django stuff: 52 | *.log 53 | local_settings.py 54 | db.sqlite3 55 | db.sqlite3-journal 56 | # Flask stuff: 57 | instance/ 58 | .webassets-cache 59 | # Scrapy stuff: 60 | .scrapy 61 | # Sphinx documentation 62 | docs/_build/ 63 | # PyBuilder 64 | target/ 65 | # Jupyter Notebook 66 | .ipynb_checkpoints 67 | # IPython 68 | profile_default/ 69 | ipython_config.py 70 | # pyenv 71 | .python-version 72 | # pipenv 73 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 74 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 75 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 76 | # install all needed dependencies. 77 | #Pipfile.lock 78 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 79 | __pypackages__/ 80 | # Celery stuff 81 | celerybeat-schedule 82 | celerybeat.pid 83 | # SageMath parsed files 84 | *.sage.py 85 | # Environments 86 | .env 87 | .venv 88 | env/ 89 | venv/ 90 | ENV/ 91 | env.bak/ 92 | venv.bak/ 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | # Rope project settings 97 | .ropeproject 98 | # mkdocs documentation 99 | /site 100 | # mypy 101 | .mypy_cache/ 102 | .dmypy.json 103 | dmypy.json 104 | # Pyre type checker 105 | .pyre/ 106 | # Docs 107 | _build 108 | .Rproj.user 109 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | default_stages: [commit, push] 2 | 3 | repos: 4 | - repo: https://github.com/jackdewinter/pymarkdown 5 | rev: v0.9.9 6 | hooks: 7 | - id: pymarkdown 8 | verbose: true 9 | args: 10 | - --config=.pymarkdown.json 11 | - scan 12 | -------------------------------------------------------------------------------- /.pymarkdown.json: -------------------------------------------------------------------------------- 1 | { 2 | "plugins" : { 3 | "md003" : { 4 | "style" : "atx" 5 | }, 6 | "md012" : { 7 | "maximum" : 2 8 | }, 9 | "md013" : { 10 | "line_length" : 160, 11 | "heading_line_length" : 160, 12 | "code_block_line_length" : 160 13 | }, 14 | "md022" : { 15 | "enabled" : false 16 | }, 17 | "md024" : { 18 | "allow_different_nesting" : true 19 | } 20 | 21 | } 22 | } 23 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | This changelog will be updated with changes to the Duck Book including additions, changes, deletions, and fixes. 3 | 4 | ## Next release: 5 | ### Last updated: 6 | ### Adds: 7 | ### Changes: 8 | ### Fixes: 9 | ### Removes: 10 | 11 | ## 2025.1 12 | ### Changes: 13 | - Updated instructions and link for signing up to the Learning Hub 14 | 15 | ## 2025.0: 16 | ### Adds: 17 | - Guidance on SQL, mocking, end-to-end testing guidance 18 | - Sections on what to test, structuring tests, using classes, risks in skipping tests 19 | - Modelling-relevant testing 20 | - Additional contribution guidance 21 | 22 | ### Changes: 23 | - Updates to parameterisation and test structuring 24 | - Book reviewed for consistency in tone 25 | 26 | ### Fixes: 27 | - Typos and grammar 28 | - Broken links 29 | - Header anchors configuration 30 | 31 | ## 2024.1: 32 | ### Fixes: 33 | - Typos 34 | - Broken links 35 | 36 | This was developed from [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). 37 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | 2 | # Duck Book Code of Conduct 3 | 4 | ## Our Pledge 5 | 6 | We as members, contributors, and leaders pledge to make participation in our 7 | community a harassment-free experience for everyone, regardless of age, body 8 | size, visible or invisible disability, ethnicity, sex characteristics, gender 9 | identity and expression, level of experience, education, socio-economic status, 10 | nationality, personal appearance, race, caste, color, religion, or sexual 11 | identity and orientation. 12 | 13 | We pledge to act and interact in ways that contribute to an open, welcoming, 14 | diverse, inclusive, and healthy community. 15 | 16 | ## Our Standards 17 | 18 | As the QA of Code Guidance (or commonly, the Duck Book) is owned by the UK Statistics Authority, a non-ministerial department which reports directly to the UK parliament, we encourage contributors to act in accordance with the highest standards of professional conduct. 19 | In addition, we expect contributors to not engage in any political activity or discussion in forums relating to this project. 20 | Where contributors are also civil servants we remind you of your obligations under the [Civil Service Code](https://www.gov.uk/government/publications/civil-service-code/the-civil-service-code). 21 | 22 | Examples of behavior that contributes to a positive environment for our 23 | community include: 24 | 25 | * Demonstrating empathy and kindness toward other people 26 | * Being respectful of differing opinions, viewpoints, and experiences 27 | * Giving and gracefully accepting constructive feedback 28 | * Accepting responsibility and apologizing to those affected by our mistakes, 29 | and learning from the experience 30 | * Focusing on what is best not just for us as individuals, but for the overall 31 | community 32 | 33 | Examples of unacceptable behavior include: 34 | 35 | * The use of sexualized language or imagery, and sexual attention or advances of 36 | any kind 37 | * Trolling, insulting or derogatory comments, and personal or political attacks 38 | * Public or private harassment 39 | * Publishing others' private information, such as a physical or email address, 40 | without their explicit permission 41 | * Other conduct which could reasonably be considered inappropriate in a 42 | professional setting 43 | 44 | ## Enforcement Responsibilities 45 | 46 | Community leaders are responsible for clarifying and enforcing our standards of 47 | acceptable behavior and will take appropriate and fair corrective action in 48 | response to any behavior that they deem inappropriate, threatening, offensive, 49 | or harmful. 50 | 51 | Community leaders have the right and responsibility to remove, edit, or reject 52 | comments, commits, code, wiki edits, issues, and other contributions that are 53 | not aligned to this Code of Conduct, and will communicate reasons for moderation 54 | decisions when appropriate. 55 | 56 | ## Scope 57 | 58 | This Code of Conduct applies within all community spaces, and also applies when 59 | an individual is officially representing the community in public spaces. 60 | Examples of representing our community include using an official e-mail address, 61 | posting via an official social media account, or acting as an appointed 62 | representative at an online or offline event. 63 | 64 | ## Enforcement 65 | 66 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 67 | reported to the community leaders responsible for enforcement at 68 | analysis.function@ons.gov.uk . 69 | All complaints will be reviewed and investigated promptly and fairly. 70 | 71 | All community leaders are obligated to respect the privacy and security of the 72 | reporter of any incident. 73 | 74 | ## Enforcement Guidelines 75 | 76 | Community leaders will follow these Community Impact Guidelines in determining 77 | the consequences for any action they deem in violation of this Code of Conduct: 78 | 79 | ### 1. Correction 80 | 81 | **Community Impact**: Use of inappropriate language or other behavior deemed 82 | unprofessional or unwelcome in the community. 83 | 84 | **Consequence**: A private, written warning from community leaders, providing 85 | clarity around the nature of the violation and an explanation of why the 86 | behavior was inappropriate. A public apology may be requested. 87 | 88 | ### 2. Warning 89 | 90 | **Community Impact**: A violation through a single incident or series of 91 | actions. 92 | 93 | **Consequence**: A warning with consequences for continued behavior. No 94 | interaction with the people involved, including unsolicited interaction with 95 | those enforcing the Code of Conduct, for a specified period of time. This 96 | includes avoiding interactions in community spaces as well as external channels 97 | like social media. Violating these terms may lead to a temporary or permanent 98 | ban. 99 | 100 | ### 3. Temporary Ban 101 | 102 | **Community Impact**: A serious violation of community standards, including 103 | sustained inappropriate behavior. 104 | 105 | **Consequence**: A temporary ban from any sort of interaction or public 106 | communication with the community for a specified period of time. No public or 107 | private interaction with the people involved, including unsolicited interaction 108 | with those enforcing the Code of Conduct, is allowed during this period. 109 | Violating these terms may lead to a permanent ban. 110 | 111 | ### 4. Permanent Ban 112 | 113 | **Community Impact**: Demonstrating a pattern of violation of community 114 | standards, including sustained inappropriate behavior, harassment of an 115 | individual, or aggression toward or disparagement of classes of individuals. 116 | 117 | **Consequence**: A permanent ban from any sort of public interaction within the 118 | community. 119 | 120 | ## Attribution 121 | 122 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 123 | version 2.1, available at 124 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 125 | 126 | Community Impact Guidelines were inspired by 127 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 128 | 129 | For answers to common questions about this code of conduct, see the FAQ at 130 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 131 | [https://www.contributor-covenant.org/translations][translations]. 132 | 133 | [homepage]: https://www.contributor-covenant.org 134 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 135 | [Mozilla CoC]: https://github.com/mozilla/diversity 136 | [FAQ]: https://www.contributor-covenant.org/faq 137 | [translations]: https://www.contributor-covenant.org/translations -------------------------------------------------------------------------------- /LICENCE: -------------------------------------------------------------------------------- 1 | Copyright 2020, Crown copyright. 2 | 3 | This publication is licensed under the terms of the Open Government Licence v3.0 except where otherwise stated. 4 | You are encouraged to use and re-use the Information that is available under this licence freely and flexibly, with only a few conditions. 5 | To view this licence, visit nationalarchives.gov.uk/doc/open-government-licence/version/3 or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: psi@nationalarchives.gsi.gov.uk. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Quality Assurance of Code for Analysis and Research 2 | 3 | This document forms part of the Quality Guidance, published by the Quality and Improvement Division of the Methods and Quality Directorate at the UK 4 | [Office for National Statistics](https://www.ons.gov.uk). 5 | 6 | 7 | ## Contributing to this guidance 8 | 9 | We welcome all constructive feedback and contributions. 10 | 11 | To provide feedback or request new content, you can [create an issue](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues) on this book's repository. 12 | Alternatively, you can always drop us an [email](mailto:ASAP@ons.gov.uk). 13 | 14 | If you'd like to contribute, please also 15 | [create or comment on an issue](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues) 16 | to describe the changes that you'd like to make. 17 | This will allow discussion around whether content is suitable for the book, before you put the hard work into implementing it. 18 | 19 | 20 | ### Getting started 21 | 22 | Minor text edits can be submitted as a Pull Request using the "Suggest edit" button under the GitHub logo at the top of the page you would like to change. 23 | 24 | For changes to anything other than lines of text, you should follow these steps to make the changes locally: 25 | 26 | To start contributing, you'll need Python installed. 27 | If you sit outside of Quality and Improvement Division, the you'll need to [create a Fork of this repository to make changes](https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/working-with-forks). 28 | Once forked, you should clone the fork repository to get a copy of the book. Then install its Python dependencies like so: 29 | 30 | ```{none} 31 | git clone https://github.com//qa-of-code-guidance.git 32 | cd qa-of-code-guidance 33 | pip install -r requirements.txt 34 | ``` 35 | 36 | Great, now you should have the dependencies, including [jupyter-book](https://jupyterbook.org/intro.html), installed and be able to build the book locally using: 37 | 38 | ```{none} 39 | jb build book 40 | ``` 41 | 42 | Jupyter book will write the book's `HTML` content to `book/_build/html/`, so you can open `index.html` from there to view the local build. 43 | Run the build command after making a change to the text to update the HTML that you view here. 44 | 45 | All content for the book is currently written in 46 | [Markedly Structured Text](https://myst-parser.readthedocs.io/en/latest/), 47 | which is based on standard Markdown (`.md`) but allows use of "directives" for generating content. 48 | 49 | We also require developers to conform to a specific Markdown style. 50 | You can do this by installing our pre-commit `pymarkdownlnt`: 51 | 52 | ```{none} 53 | pip install -r dev-requirements.txt 54 | pre-commit install 55 | ``` 56 | 57 | ### Guidelines 58 | 59 | When contributing to the book: 60 | 61 | * Use plain English and active tense throughout 62 | * Include only what the user needs to know 63 | * Explain informative image content in text where possible 64 | * Provide examples of good and/or bad practices to support your content 65 | * Avoid duplicating content elsewhere in the book, link to the relevant section instead 66 | * Take on feedback from users and other developers 67 | 68 | Keep any content that is in early development under the `early_development/` directory. 69 | Content that is ready for publication belongs under `book/`. 70 | All pages in `book/` must be referenced in `_toc.yml` or a warning will be raised and the changes will not be published. 71 | 72 | ### Reviewing contributions 73 | 74 | We recommend getting all contributions peer reviewed. Before submitting, check the following: 75 | 76 | * Spelling and grammar. Refer to the [Government Digital Service (GDS) style guide](https://www.gov.uk/guidance/style-guide) for commonly used conventions. 77 | * Text is as simple as possible. The [Hemingway app](https://hemingwayapp.com/) can be used to identify readability issues. 78 | * Sentence case is used for titles and subheadings 79 | * All images have informative alt text 80 | * Technical terms are explained the first time they are used 81 | * Hyperlinks work and have informative anchor text (e.g. 'blog post on reproducibility' instead of 'this link') 82 | * Where possible, test new content with a screen reader 83 | 84 | 85 | ### Submitting contributions 86 | 87 | You should create a new branch to collect related changes that you make. 88 | Once you're happy with any changes you've made to the book, you should raise a 89 | [Pull Request (PR)](https://github.com/best-practice-and-impact/qa-of-code-guidance/pulls) 90 | to the `main` branch of the main repository. 91 | The source branch of this PR should be the fork and/or branch that you have commited changes to. 92 | 93 | ## Publishing changes 94 | 95 | Internal contributors can trigger a new release of the book. 96 | 97 | ### Preparation 98 | 99 | To create a new release and publish the `main` branch, you will need to install the development dependencies: 100 | 101 | ```{none} 102 | pip install -r dev-requirements.txt 103 | ``` 104 | 105 | ### Releasing 106 | 107 | To create a new release, use the command line tool `bump-my-version`, which will be installed with the dev dependencies. 108 | The version number references the current `year` and an incremental `build` count. 109 | 110 | For a the first release of a year, provide the `year` as the command argument, otherwise provide `build`. 111 | 112 | ```{none} 113 | bump-my-version bump year 114 | ``` 115 | 116 | `bump-my-version` will create a new Git `tag` and `commit`. 117 | If you're happy with the version increase, `push` these to the remote to trigger the publication, by running both: 118 | 119 | ```{none} 120 | git push 121 | git push --tags 122 | ``` 123 | -------------------------------------------------------------------------------- /book/_config.yml: -------------------------------------------------------------------------------- 1 | # Book settings 2 | title: "Quality Assurance of Code for Analysis and Research" 3 | author: "" 4 | email: ASAP@ons.gov.uk 5 | logo: "./_static/duck_book_logo.svg" 6 | 7 | repository: 8 | url: https://github.com/best-practice-and-impact/qa-of-code-guidance 9 | path_to_book: book 10 | branch: main 11 | 12 | html: 13 | home_page_in_navbar: false 14 | favicon: "./_static/favicon.ico" 15 | use_repository_button: true 16 | use_issues_button: true 17 | use_edit_page_button: true 18 | extra_footer: "

Analysis Function - Analytical Standards and Pipelines hub

This is a living document, please help use to make it grow by providing any feedback by email or via the GitHub repository." 19 | extra_navbar: "

Book version 2025.1

" 20 | 21 | sphinx: 22 | extra_extensions: 23 | - sphinx_tabs.tabs 24 | - sphinx.ext.todo 25 | config: 26 | todo_include_todos: true 27 | language: en 28 | html_show_copyright: false 29 | myst_heading_anchors: 3 30 | 31 | latex: 32 | latex_documents: 33 | targetname: book.tex 34 | -------------------------------------------------------------------------------- /book/_static/BPI_logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/BPI_logo.jpg -------------------------------------------------------------------------------- /book/_static/accessibility.css: -------------------------------------------------------------------------------- 1 | /* 2 | Inline code 3 | Contrast ratio: 4.97:1 4 | Meets AA standard for normal sized text 5 | */ 6 | code { 7 | color: #D51A72; 8 | } 9 | 10 | /* 11 | Code block comments 12 | Contrast ratio: 4.76:1 13 | Meets AA standard for normal sized text 14 | */ 15 | .highlight .c1 { 16 | color: #3B7887; 17 | } 18 | -------------------------------------------------------------------------------- /book/_static/admonitions.css: -------------------------------------------------------------------------------- 1 | /* 2 | Orange AF #ff6a00 3 | Blue AF #193f54 4 | */ 5 | 6 | /* Custom admonition, usage like: 7 | 8 | ```{admonition} Key Learning 9 | :class: admonition-learning 10 | 11 | ``` 12 | */ 13 | .admonition-learning p.admonition-title::after { 14 | content: "\f5fc"; 15 | left: .35rem; 16 | top: .5rem; 17 | } 18 | 19 | /* Custom admonition, usage like: 20 | 21 | ```{admonition} Key Strategies 22 | :class: admonition-strategies 23 | 24 | ``` 25 | */ 26 | .admonition-strategies p.admonition-title::after { 27 | content: "\f439"; 28 | left: .55rem; 29 | } -------------------------------------------------------------------------------- /book/_static/af_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/af_logo.png -------------------------------------------------------------------------------- /book/_static/code_quality.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/code_quality.png -------------------------------------------------------------------------------- /book/_static/duck_book_logo.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /book/_static/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/favicon.ico -------------------------------------------------------------------------------- /book/_static/git_deeper_branching.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/git_deeper_branching.png -------------------------------------------------------------------------------- /book/_static/git_develop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/git_develop.png -------------------------------------------------------------------------------- /book/_static/git_feature.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/git_feature.png -------------------------------------------------------------------------------- /book/_static/git_main.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/git_main.png -------------------------------------------------------------------------------- /book/_static/git_multiple_features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/git_multiple_features.png -------------------------------------------------------------------------------- /book/_static/github_pr_changes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/github_pr_changes.png -------------------------------------------------------------------------------- /book/_static/logo.css: -------------------------------------------------------------------------------- 1 | div.site-navigation div.navbar-brand-box a.navbar-brand img.logo { 2 | max-width: 75%; 3 | } -------------------------------------------------------------------------------- /book/_static/markdown_within_tabs.css: -------------------------------------------------------------------------------- 1 | /* Fixes sphinx tabs that contain HTML h1 elements. 2 | Otherwise h1::before covers the tabs and prevents switching. */ 3 | .sphinx-tabs-panel h1::before { 4 | display: none; 5 | } -------------------------------------------------------------------------------- /book/_static/no_integration_tests.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/no_integration_tests.png -------------------------------------------------------------------------------- /book/_static/qa_of_code_favicon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/qa_of_code_favicon.png -------------------------------------------------------------------------------- /book/_static/repro_stack.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/repro_stack.png -------------------------------------------------------------------------------- /book/_static/semantic_versioning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/semantic_versioning.png -------------------------------------------------------------------------------- /book/_static/separation_of_concerns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/best-practice-and-impact/qa-of-code-guidance/e1868108cdafa3798732504ab9022e0b4fb215e7/book/_static/separation_of_concerns.png -------------------------------------------------------------------------------- /book/_toc.yml: -------------------------------------------------------------------------------- 1 | format: jb-book 2 | root: intro 3 | parts: 4 | - caption: Overview 5 | chapters: 6 | - file: glossary 7 | - file: managers_guide 8 | - file: checklists 9 | sections: 10 | - file: checklist_lower 11 | - file: checklist_moderate 12 | - file: checklist_higher 13 | 14 | 15 | - caption: Guidance 16 | chapters: 17 | - file: principles 18 | - file: modular_code 19 | - file: readable_code 20 | - file: project_structure 21 | - file: code_documentation 22 | - file: project_documentation 23 | - file: version_control 24 | - file: configuration 25 | - file: data 26 | - file: peer_review 27 | - file: testing_code 28 | - file: continuous_integration 29 | 30 | 31 | - caption: Resources 32 | chapters: 33 | - file: learning 34 | - file: tools 35 | -------------------------------------------------------------------------------- /book/checklist_higher.md: -------------------------------------------------------------------------------- 1 | # Higher quality assurance 2 | 3 | ## Quality assurance checklist 4 | 5 | Quality assurance checklist from [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 6 | 7 | ### Modular code 8 | 9 | - Individual pieces of logic are written as functions. Classes are used if more appropriate. 10 | - Code is grouped in themed files (modules) and is packaged for easier use. 11 | - Main analysis scripts import and run high level functions from the package. 12 | - Low level functions and classes carry out one specific task. As such, there is only one reason to change each function. 13 | - Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 14 | - Objects and functions are open for extension but closed for modification; functionality can be extended without modifying the source code. 15 | - Subclasses retain the functionality of their parent class while adding new functionality. Parent class objects can be replaced with instances of the subclass 16 | and still work as expected. 17 | 18 | ### Good coding practices 19 | 20 | - Names used in the code are informative and concise. 21 | - Names used in the code are explicit, rather than implicit. 22 | - Code logic is clear and avoids unnecessary complexity. 23 | - Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) 24 | and [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 25 | 26 | ### Project structure 27 | 28 | - A clear, standard directory structure is used to separate input data, outputs, code and documentation. 29 | - Packages follow a standard structure. 30 | 31 | ### Code documentation 32 | 33 | - Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 34 | - Comments are kept up to date, so they do not confuse the reader. 35 | - Code is not commented out to adjust which lines of code run. 36 | - All functions and classes are documented to describe what they do, what inputs they take and what they return. 37 | - Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). R code is 38 | [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 39 | - Human-readable (preferably HTML) documentation is generated automatically from code documentation. 40 | - Documentation is hosted for easy access. [GitHub Pages](https://pages.github.com/) and 41 | [Read the Docs](https://readthedocs.org/) provide a free service for hosting documentation publicly. 42 | 43 | ### Project documentation 44 | 45 | - A README file details the purpose of the project, basic installation instructions, and examples of usage. 46 | - Where appropriate, guidance for prospective contributors is available including a code of conduct. 47 | - If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 48 | - The extent of analytical quality assurance conducted on the project is clearly documented. 49 | - Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 50 | - Copyright and licenses are specified for both documentation and code. 51 | - Instructions for how to cite the project are given. 52 | - Releases of the project used for reports, publications, or other outputs are versioned using a standard pattern such as [semantic versioning](https://semver.org/). 53 | - A summary of [changes to functionality are documented in a changelog](https://keepachangelog.com/en/1.0.0/) following releases. The changelog is available to users. 54 | - Example usage of packages and underlying functionality is documented for developers and users. 55 | - Design certificates confirm that the design is compliant with requirements. 56 | - If appropriate, the software is fully specified. 57 | 58 | ### Version control 59 | 60 | - Code is [version controlled using Git](https://git-scm.com/). 61 | - Code is committed regularly, preferably when a discrete unit of work has been completed. 62 | - An appropriate branching strategy is defined and used throughout development. 63 | - Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 64 | - Committing standards are followed such as appropriate commit summary and message supplied. 65 | - Commits are tagged at significant stages. This is used to indicate the state of code for specific releases or model versions. 66 | - Continuous integration is applied through tools such as [GitHub Actions](https://github.com/features/actions), 67 | to ensure that each change is integrated into the workflow smoothly. 68 | 69 | ### Configuration 70 | 71 | - Credentials and other secrets are not written in code but are configured as environment variables. 72 | - Configuration is stored in a dedicated configuration file, separate to the code. 73 | - If appropriate, multiple configuration files are used depending on system/local/user. 74 | - Configuration files are version controlled separately to the analysis code, so that they can be updated independently. 75 | - The configuration used to generate particular outputs, releases and publications is recorded. 76 | - Example configuration file templates are provided alongside the code, but do not include real data. 77 | 78 | ### Data management 79 | 80 | - Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 81 | - All data for analysis are stored in an open format, so that specific software is not required to access them. 82 | - Input data are stored safely and are treated as read-only. 83 | - Input data are versioned. All changes to the data result in new versions being created, 84 | or [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 85 | - All input data is documented in a data register, including where they come from and their importance to the analysis. 86 | - Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. 87 | Your analysis code is able to reproduce them at any time. 88 | - Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 89 | - Data quality is monitored, as per 90 | [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 91 | - Fields within input and output datasets are documented in a data dictionary. 92 | - Large or complex data are stored in a database. 93 | - Data are documented in an information asset register. 94 | 95 | ### Peer review 96 | 97 | - Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 98 | - Pair programming is used to review code and share knowledge. 99 | - Users are encouraged to participate in peer review. 100 | 101 | ### Testing 102 | 103 | - Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/). 104 | - Code based tests are run regularly and after every significant change to the code. 105 | - Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 106 | - Informal tests are recorded near to the code. 107 | - Stakeholder or user acceptance sign-offs are recorded near to the code. 108 | - Test are automatically run and recorded using continuous integration or git hooks. 109 | - The whole process is tested from start to finish using one or more realistic end-to-end tests. 110 | - Test code is clean and readable. Tests make use of fixtures and parameterisation to reduce repetition. 111 | - Formal user acceptance testing is conducted and recorded. 112 | - Integration tests ensure that multiple units of code work together as expected. 113 | 114 | ### Dependency management 115 | 116 | - Required passwords, secrets and tokens are documented, but are stored outside of version control. 117 | - Required libraries and packages are documented, including their versions. 118 | - Working operating system environments are documented. 119 | - Example configuration files are provided. 120 | - Where appropriate, code runs independently of the operating system (for example there is suitable management of file paths for different operating systems). 121 | - Dependencies are managed separately for users, developers, and testers. 122 | - There are as few dependencies as possible. 123 | - Package dependencies are managed using an environment manager such as 124 | [virtualenv for Python](https://virtualenv.pypa.io/en/latest/) or [renv for R](https://rstudio.github.io/renv/articles/renv.html). 125 | - Docker containers or virtual machine builds are available for the code execution environment and these are version controlled. 126 | 127 | ### Logging 128 | 129 | - Misuse or failure in the code produces informative error messages. 130 | - Code configuration is recorded when the code is run. 131 | - Pipeline route is recorded if decisions are made in code. 132 | 133 | ### Project management 134 | 135 | - The roles and responsibilities of team members are clearly defined. 136 | - An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 137 | - New issues or tasks are guided by users’ needs and stories. 138 | - Issues templates are used to ensure proper logging of the title, description, labels and comments. 139 | - Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 140 | - Quality assurance standards and processes for the project are defined. These are based around 141 | [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 142 | 143 | ## Template checklist 144 | 145 | You can either refer to the checklist above, or use the Markdown template below to include the checklist in your project. 146 | 147 | ```{code-block} md 148 | ## Quality assurance checklist 149 | 150 | Quality assurance checklist from 151 | [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 152 | 153 | ### Modular code 154 | 155 | - [ ] Individual pieces of logic are written as functions. Classes are used if more appropriate. 156 | - [ ] Code is grouped in themed files (modules) and is packaged for easier use. 157 | - [ ] Main analysis scripts import and run high level functions from the package. 158 | - [ ] Low level functions and classes carry out one specific task. As such, there is only one reason to change each function. 159 | - [ ] Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 160 | - [ ] Objects and functions are open for extension but closed for modification; functionality can be extended without modifying the source code. 161 | - [ ] Subclasses retain the functionality of their parent class while adding new functionality. Parent class objects can be replaced with instances of the 162 | subclass and still work as expected. 163 | 164 | ### Good coding practices 165 | 166 | - [ ] Names used in the code are informative and concise. 167 | - [ ] Names used in the code are explicit, rather than implicit. 168 | - [ ] Code logic is clear and avoids unnecessary complexity. 169 | - [ ] Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) and 170 | [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 171 | 172 | ### Project structure 173 | 174 | - [ ] A clear, standard directory structure is used to separate input data, outputs, code and documentation. 175 | - [ ] Packages follow a standard structure. 176 | 177 | ### Code documentation 178 | 179 | - [ ] Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 180 | - [ ] Comments are kept up to date, so they do not confuse the reader. 181 | - [ ] Code is not commented out to adjust which lines of code run. 182 | - [ ] All functions and classes are documented to describe what they do, what inputs they take and what they return. 183 | - [ ] Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). 184 | R code is [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 185 | - [ ] Human-readable (preferably HTML) documentation is generated automatically from code documentation. 186 | - [ ] Documentation is hosted for easy access. [GitHub Pages](https://pages.github.com/) and 187 | [Read the Docs](https://readthedocs.org/) provide a free service for hosting documentation publicly. 188 | 189 | ### Project documentation 190 | 191 | - [ ] A README file details the purpose of the project, basic installation instructions, and examples of usage. 192 | - [ ] Where appropriate, guidance for prospective contributors is available including a code of conduct. 193 | - [ ] If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 194 | - [ ] The extent of analytical quality assurance conducted on the project is clearly documented. 195 | - [ ] Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 196 | - [ ] Copyright and licenses are specified for both documentation and code. 197 | - [ ] Instructions for how to cite the project are given. 198 | - [ ] Releases of the project used for reports, publications, or other outputs are versioned using a standard pattern such as 199 | [semantic versioning](https://semver.org/). 200 | - [ ] A summary of [changes to functionality are documented in a changelog](https://keepachangelog.com/en/1.0.0/) following releases. 201 | The changelog is available to users. 202 | - [ ] Example usage of packages and underlying functionality is documented for developers and users. 203 | - [ ] Design certificates confirm that the design is compliant with requirements. 204 | - [ ] If appropriate, the software is fully specified. 205 | 206 | ### Version control 207 | 208 | - [ ] Code is [version controlled using Git](https://git-scm.com/). 209 | - [ ] Code is committed regularly, preferably when a discrete unit of work has been completed. 210 | - [ ] An appropriate branching strategy is defined and used throughout development. 211 | - [ ] Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 212 | - [ ] Committing standards are followed such as appropriate commit summary and message supplied. 213 | - [ ] Commits are tagged at significant stages. This is used to indicate the state of code for specific releases or model versions. 214 | - [ ] Continuous integration is applied through tools such as [GitHub Actions](https://github.com/features/actions), 215 | to ensure that each change is integrated into the workflow smoothly. 216 | 217 | ### Configuration 218 | 219 | - [ ] Credentials and other secrets are not written in code but are configured as environment variables. 220 | - [ ] Configuration is stored in a dedicated configuration file, separate to the code. 221 | - [ ] If appropriate, multiple configuration files are used depending on system/local/user. 222 | - [ ] Configuration files are version controlled separately to the analysis code, so that they can be updated independently. 223 | - [ ] The configuration used to generate particular outputs, releases and publications is recorded. 224 | - [ ] Example configuration file templates are provided alongside the code, but do not include real data. 225 | 226 | ### Data management 227 | 228 | - [ ] Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 229 | - [ ] All data for analysis are stored in an open format, so that specific software is not required to access them. 230 | - [ ] Input data are stored safely and are treated as read-only. 231 | - [ ] Input data are versioned. All changes to the data result in new versions being created, 232 | or [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 233 | - [ ] All input data is documented in a data register, including where they come from and their importance to the analysis. 234 | - [ ] Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. 235 | Your analysis code is able to reproduce them at any time. 236 | - [ ] Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 237 | - [ ] Data quality is monitored, as per 238 | [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 239 | - [ ] Fields within input and output datasets are documented in a data dictionary. 240 | - [ ] Large or complex data are stored in a database. 241 | - [ ] Data are documented in an information asset register. 242 | 243 | ### Peer review 244 | 245 | - [ ] Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 246 | - [ ] Pair programming is used to review code and share knowledge. 247 | - [ ] Users are encouraged to participate in peer review. 248 | 249 | ### Testing 250 | 251 | - [ ] Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and 252 | [`testthat` for R](https://testthat.r-lib.org/). 253 | - [ ] Code based tests are run regularly and after every significant change to the code base. 254 | - [ ] Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 255 | - [ ] Informal tests are recorded near to the code. 256 | - [ ] Stakeholder or user acceptance sign-offs are recorded near to the code. 257 | - [ ] Test are automatically run and recorded using continuous integration or git hooks. 258 | - [ ] The whole process is tested from start to finish using one or more realistic end-to-end tests. 259 | - [ ] Test code is clean and readable. Tests make use of fixtures and parameterisation to reduce repetition. 260 | - [ ] Formal user acceptance testing is conducted and recorded. 261 | - [ ] Integration tests ensure that multiple units of code work together as expected. 262 | 263 | ### Dependency management 264 | 265 | - [ ] Required passwords, secrets and tokens are documented, but are stored outside of version control. 266 | - [ ] Required libraries and packages are documented, including their versions. 267 | - [ ] Working operating system environments are documented. 268 | - [ ] Example configuration files are provided. 269 | - [ ] Where appropriate, code runs independent of operating system (e.g. suitable management of file paths). 270 | - [ ] Dependencies are managed separately for users, developers, and testers. 271 | - [ ] There are as few dependencies as possible. 272 | - [ ] Package dependencies are managed using an environment manager such as 273 | [virtualenv for Python](https://virtualenv.pypa.io/en/latest/) or [renv for R](https://rstudio.github.io/renv/articles/renv.html). 274 | - [ ] Docker containers or virtual machine builds are available for the code execution environment and these are version controlled. 275 | 276 | ### Logging 277 | 278 | - [ ] Misuse or failure in the code produces informative error messages. 279 | - [ ] Code configuration is recorded when the code is run. 280 | - [ ] Pipeline route is recorded if decisions are made in code. 281 | 282 | ### Project management 283 | 284 | - [ ] The roles and responsibilities of team members are clearly defined. 285 | - [ ] An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 286 | - [ ] New issues or tasks are guided by users’ needs and stories. 287 | - [ ] Issues templates are used to ensure proper logging of the title, description, labels and comments. 288 | - [ ] Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 289 | - [ ] Quality assurance standards and processes for the project are defined. These are based around 290 | [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 291 | ``` 292 | -------------------------------------------------------------------------------- /book/checklist_lower.md: -------------------------------------------------------------------------------- 1 | # Lower quality assurance 2 | 3 | ## Quality assurance checklist 4 | 5 | Quality assurance checklist from [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 6 | 7 | ### Modular code 8 | 9 | - Individual pieces of logic are written as functions. Classes are used if more appropriate. 10 | - Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 11 | 12 | ### Good coding practices 13 | 14 | - Names used in the code are informative and concise. 15 | - Code logic is clear and avoids unnecessary complexity. 16 | - Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) and [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 17 | 18 | ### Project structure 19 | 20 | - A clear, standard directory structure is used to separate input data, outputs, code and documentation. 21 | 22 | ### Code documentation 23 | 24 | - Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 25 | - Comments are kept up to date, so they do not confuse the reader. 26 | - Code is not commented out to adjust which lines of code run. 27 | - All functions and classes are documented to describe what they do, what inputs they take and what they return. 28 | - Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). R code is [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 29 | 30 | ### Project documentation 31 | 32 | - A README file details the purpose of the project, basic installation instructions, and examples of usage. 33 | - Where appropriate, guidance for prospective contributors is available including a code of conduct. 34 | - If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 35 | - The extent of analytical quality assurance conducted on the project is clearly documented. 36 | - Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 37 | - Copyright and licenses are specified for both documentation and code. 38 | - Instructions for how to cite the project are given. 39 | 40 | ### Version control 41 | 42 | - Code is [version controlled using Git](https://git-scm.com/). 43 | - Code is committed regularly, preferably when a discrete unit of work has been completed. 44 | - An appropriate branching strategy is defined and used throughout development. 45 | - Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 46 | 47 | ### Configuration 48 | 49 | - Credentials and other secrets are not written in code but are configured as environment variables. 50 | - Configuration is clearly separated from code used for analysis, so that it is simple to identify and update. 51 | - The configuration used to generate particular outputs, releases and publications is recorded. 52 | 53 | ### Data management 54 | 55 | - Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 56 | - All data for analysis are stored in an open format, so that specific software is not required to access them. 57 | - Input data are stored safely and are treated as read-only. 58 | - Input data are versioned. All changes to the data result in new versions being created, or [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 59 | - All input data is documented in a data register, including where they come from and their importance to the analysis. 60 | - Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. Your analysis code is able to reproduce them at any time. 61 | - Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 62 | - Data quality is monitored, as per [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 63 | 64 | ### Peer review 65 | 66 | - Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 67 | 68 | ### Testing 69 | 70 | - Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/). 71 | - Code based tests are run regularly, ideally being automated using continuous integration. 72 | - Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 73 | - Informal tests are recorded near to the code. 74 | - Stakeholder or user acceptance sign-offs are recorded near to the code. 75 | 76 | ### Dependency management 77 | 78 | - Required passwords, secrets and tokens are documented, but are stored outside of version control. 79 | - Required libraries and packages are documented, including their versions. 80 | - Working operating system environments are documented. 81 | - Example configuration files are provided. 82 | 83 | ### Logging 84 | 85 | - Misuse or failure in the code produces informative error messages. 86 | - Code configuration is recorded when the code is run. 87 | 88 | ### Project management 89 | 90 | - The roles and responsibilities of team members are clearly defined. 91 | - An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 92 | - New issues or tasks are guided by users’ needs and stories. 93 | - Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 94 | - Quality assurance standards and processes for the project are defined. These are based around [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 95 | 96 | ## Template checklist 97 | 98 | You can either refer to the checklist above, or use the Markdown template below to include the checklist in your project. 99 | 100 | ```{code-block} md 101 | ## Quality assurance checklist 102 | 103 | Quality assurance checklist from [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 104 | 105 | ### Modular code 106 | 107 | - [ ] Individual pieces of logic are written as functions. Classes are used if more appropriate. 108 | - [ ] Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 109 | 110 | ### Good coding practices 111 | 112 | - [ ] Names used in the code are informative and concise. 113 | - [ ] Code logic is clear and avoids unnecessary complexity. 114 | - [ ] Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) and [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 115 | 116 | ### Project structure 117 | 118 | - [ ] A clear, standard directory structure is used to separate input data, outputs, code and documentation. 119 | 120 | ### Code documentation 121 | 122 | - [ ] Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 123 | - [ ] Comments are kept up to date, so they do not confuse the reader. 124 | - [ ] Code is not commented out to adjust which lines of code run. 125 | - [ ] All functions and classes are documented to describe what they do, what inputs they take and what they return. 126 | - [ ] Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). R code is [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 127 | 128 | ### Project documentation 129 | 130 | - [ ] A README file details the purpose of the project, basic installation instructions, and examples of usage. 131 | - [ ] Where appropriate, guidance for prospective contributors is available including a code of conduct. 132 | - [ ] If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 133 | - [ ] The extent of analytical quality assurance conducted on the project is clearly documented. 134 | - [ ] Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 135 | - [ ] Copyright and licenses are specified for both documentation and code. 136 | - [ ] Instructions for how to cite the project are given. 137 | 138 | ### Version control 139 | 140 | - [ ] Code is [version controlled using Git](https://git-scm.com/). 141 | - [ ] Code is committed regularly, preferably when a discrete unit of work has been completed. 142 | - [ ] An appropriate branching strategy is defined and used throughout development. 143 | - [ ] Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 144 | 145 | ### Configuration 146 | 147 | - [ ] Credentials and other secrets are not written in code but are configured as environment variables. 148 | - [ ] Configuration is clearly separated from code used for analysis, so that it is simple to identify and update. 149 | - [ ] The configuration used to generate particular outputs, releases and publications is recorded. 150 | 151 | ### Data management 152 | 153 | - [ ] Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 154 | - [ ] All data for analysis are stored in an open format, so that specific software is not required to access them. 155 | - [ ] Input data are stored safely and are treated as read-only. 156 | - [ ] Input data are versioned. All changes to the data result in new versions being created, or [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 157 | - [ ] All input data is documented in a data register, including where they come from and their importance to the analysis. 158 | - [ ] Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. Your analysis code is able to reproduce them at any time. 159 | - [ ] Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 160 | - [ ] Data quality is monitored, as per [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 161 | 162 | ### Peer review 163 | 164 | - [ ] Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 165 | 166 | ### Testing 167 | 168 | - [ ] Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/). 169 | - [ ] Code based tests are run regularly, ideally being automated using continuous integration. 170 | - [ ] Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 171 | - [ ] Informal tests are recorded near to the code. 172 | - [ ] Stakeholder or user acceptance sign-offs are recorded near to the code. 173 | 174 | ### Dependency management 175 | 176 | - [ ] Required passwords, secrets and tokens are documented, but are stored outside of version control. 177 | - [ ] Required libraries and packages are documented, including their versions. 178 | - [ ] Working operating system environments are documented. 179 | - [ ] Example configuration files are provided. 180 | 181 | ### Logging 182 | 183 | - [ ] Misuse or failure in the code produces informative error messages. 184 | - [ ] Code configuration is recorded when the code is run. 185 | 186 | ### Project management 187 | 188 | - [ ] The roles and responsibilities of team members are clearly defined. 189 | - [ ] An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 190 | - [ ] New issues or tasks are guided by users’ needs and stories. 191 | - [ ] Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 192 | - [ ] Quality assurance standards and processes for the project are defined. These are based around [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 193 | ``` 194 | -------------------------------------------------------------------------------- /book/checklist_moderate.md: -------------------------------------------------------------------------------- 1 | # Moderate quality assurance 2 | 3 | ## Quality assurance checklist 4 | 5 | Quality assurance checklist from [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 6 | 7 | ### Modular code 8 | 9 | - Individual pieces of logic are written as functions. Classes are used if more appropriate. 10 | - Code is grouped in themed files (modules) and is packaged for easier use. 11 | - Main analysis scripts import and run high level functions from the package. 12 | - Low level functions and classes carry out one specific task. As such, there is only one reason to change each function. 13 | - Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 14 | 15 | ### Good coding practices 16 | 17 | - Names used in the code are informative and concise. 18 | - Names used in the code are explicit, rather than implicit. 19 | - Code logic is clear and avoids unnecessary complexity. 20 | - Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) and 21 | [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 22 | 23 | ### Project structure 24 | 25 | - A clear, standard directory structure is used to separate input data, outputs, code and documentation. 26 | - Packages follow a standard structure. 27 | 28 | ### Code documentation 29 | 30 | - Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 31 | - Comments are kept up to date, so they do not confuse the reader. 32 | - Code is not commented out to adjust which lines of code run. 33 | - All functions and classes are documented to describe what they do, what inputs they take and what they return. 34 | - Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). R code is [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 35 | - Human-readable (preferably HTML) documentation is generated automatically from code documentation. 36 | 37 | ### Project documentation 38 | 39 | - A README file details the purpose of the project, basic installation instructions, and examples of usage. 40 | - Where appropriate, guidance for prospective contributors is available including a code of conduct. 41 | - If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 42 | - The extent of analytical quality assurance conducted on the project is clearly documented. 43 | - Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 44 | - Copyright and licenses are specified for both documentation and code. 45 | - Instructions for how to cite the project are given. 46 | - Releases of the project used for reports, publications, or other outputs are versioned using a standard pattern such as [semantic versioning](https://semver.org/). 47 | - A summary of [changes to functionality are documented in a changelog](https://keepachangelog.com/en/1.0.0/) following releases. The changelog is available to users. 48 | - Example usage of packages and underlying functionality is documented for developers and users. 49 | 50 | ### Version control 51 | 52 | - Code is [version controlled using Git](https://git-scm.com/). 53 | - Code is committed regularly, preferably when a discrete unit of work has been completed. 54 | - An appropriate branching strategy is defined and used throughout development. 55 | - Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 56 | - Committing standards are followed such as appropriate commit summary and message supplied. 57 | - Commits are tagged at significant stages. This is used to indicate the state of code for specific releases or model versions. 58 | - Continuous integration is applied through tools such as [GitHub Actions](https://github.com/features/actions), 59 | to ensure that each change is integrated into the workflow smoothly. 60 | 61 | ### Configuration 62 | 63 | - Credentials and other secrets are not written in code but are configured as environment variables. 64 | - Configuration is stored in a dedicated configuration file, separate to the code. 65 | - If appropriate, multiple configuration files are used depending on system/local/user. 66 | - Configuration files are version controlled separately to the analysis code, so that they can be updated independently. 67 | - The configuration used to generate particular outputs, releases and publications is recorded. 68 | 69 | ### Data management 70 | 71 | - Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 72 | - All data for analysis are stored in an open format, so that specific software is not required to access them. 73 | - Input data are stored safely and are treated as read-only. 74 | - Input data are versioned. All changes to the data result in new versions being created, or [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 75 | - All input data is documented in a data register, including where they come from and their importance to the analysis. 76 | - Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. 77 | Your analysis code is able to reproduce them at any time. 78 | - Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 79 | - Data quality is monitored, as per 80 | [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 81 | - Fields within input and output datasets are documented in a data dictionary. 82 | - Large or complex data are stored in a database. 83 | 84 | ### Peer review 85 | 86 | - Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 87 | - Pair programming is used to review code and share knowledge. 88 | - Users are encouraged to participate in peer review. 89 | 90 | ### Testing 91 | 92 | - Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and [`testthat` for R](https://testthat.r-lib.org/). 93 | - Code based tests are run regularly. 94 | - Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 95 | - Informal tests are recorded near to the code. 96 | - Stakeholder or user acceptance sign-offs are recorded near to the code. 97 | - Test are automatically run and recorded using continuous integration or git hooks. 98 | - The whole process is tested from start to finish using one or more realistic end-to-end tests. 99 | - Test code is clean and readable. Tests make use of fixtures and parameterisation to reduce repetition. 100 | 101 | ### Dependency management 102 | 103 | - Required passwords, secrets and tokens are documented, but are stored outside of version control. 104 | - Required libraries and packages are documented, including their versions. 105 | - Working operating system environments are documented. 106 | - Example configuration files are provided. 107 | - Where appropriate, code runs independent of operating system (e.g. suitable management of file paths). 108 | - Dependencies are managed separately for users, developers, and testers. 109 | - There are as few dependencies as possible. 110 | - Package dependencies are managed using an environment manager such as [virtualenv for Python](https://virtualenv.pypa.io/en/latest/) 111 | or [renv for R](https://rstudio.github.io/renv/articles/renv.html). 112 | 113 | ### Logging 114 | 115 | - Misuse or failure in the code produces informative error messages. 116 | - Code configuration is recorded when the code is run. 117 | - Pipeline route is recorded if decisions are made in code. 118 | 119 | ### Project management 120 | 121 | - The roles and responsibilities of team members are clearly defined. 122 | - An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 123 | - New issues or tasks are guided by users’ needs and stories. 124 | - Issues templates are used to ensure proper logging of the title, description, labels and comments. 125 | - Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 126 | - Quality assurance standards and processes for the project are defined. These are based around 127 | [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 128 | 129 | ## Template checklist 130 | 131 | You can either refer to the checklist above, or use the Markdown template below to include the checklist in your project. 132 | 133 | ```{code-block} md 134 | ## Quality assurance checklist 135 | 136 | Quality assurance checklist from 137 | [the quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 138 | 139 | ### Modular code 140 | 141 | - [ ] Individual pieces of logic are written as functions. Classes are used if more appropriate. 142 | - [ ] Code is grouped in themed files (modules) and is packaged for easier use. 143 | - [ ] Main analysis scripts import and run high level functions from the package. 144 | - [ ] Low level functions and classes carry out one specific task. As such, there is only one reason to change each function. 145 | - [ ] Repetition in the code is minimalised. For example, by moving reusable code into functions or classes. 146 | 147 | ### Good coding practices 148 | 149 | - [ ] Names used in the code are informative and concise. 150 | - [ ] Names used in the code are explicit, rather than implicit. 151 | - [ ] Code logic is clear and avoids unnecessary complexity. 152 | - [ ] Code follows a standard style, e.g. [PEP8 for Python](https://www.python.org/dev/peps/pep-0008/) 153 | and [Google](https://google.github.io/styleguide/Rguide.html) or [tidyverse](https://style.tidyverse.org/) for R. 154 | 155 | ### Project structure 156 | 157 | - [ ] A clear, standard directory structure is used to separate input data, outputs, code and documentation. 158 | - [ ] Packages follow a standard structure. 159 | 160 | ### Code documentation 161 | 162 | - [ ] Comments are used to describe why code is written in a particular way, rather than describing what the code is doing. 163 | - [ ] Comments are kept up to date, so they do not confuse the reader. 164 | - [ ] Code is not commented out to adjust which lines of code run. 165 | - [ ] All functions and classes are documented to describe what they do, what inputs they take and what they return. 166 | - [ ] Python code is [documented using docstrings](https://www.python.org/dev/peps/pep-0257/). 167 | R code is [documented using `roxygen2` comments](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html). 168 | - [ ] Human-readable (preferably HTML) documentation is generated automatically from code documentation. 169 | 170 | ### Project documentation 171 | 172 | - [ ] A README file details the purpose of the project, basic installation instructions, and examples of usage. 173 | - [ ] Where appropriate, guidance for prospective contributors is available including a code of conduct. 174 | - [ ] If the code's users are not familiar with the code, desk instructions are provided to guide lead users through example use cases. 175 | - [ ] The extent of analytical quality assurance conducted on the project is clearly documented. 176 | - [ ] Assumptions in the analysis and their quality are documented next to the code that implements them. These are also made available to users. 177 | - [ ] Copyright and licenses are specified for both documentation and code. 178 | - [ ] Instructions for how to cite the project are given. 179 | - [ ] Releases of the project used for reports, publications, or other outputs are versioned using a standard pattern such as 180 | [semantic versioning](https://semver.org/). 181 | - [ ] A summary of [changes to functionality are documented in a changelog](https://keepachangelog.com/en/1.0.0/) following releases. 182 | The changelog is available to users. 183 | - [ ] Example usage of packages and underlying functionality is documented for developers and users. 184 | 185 | ### Version control 186 | 187 | - [ ] Code is [version controlled using Git](https://git-scm.com/). 188 | - [ ] Code is committed regularly, preferably when a discrete unit of work has been completed. 189 | - [ ] An appropriate branching strategy is defined and used throughout development. 190 | - [ ] Code is open-sourced. Any sensitive data are omitted or replaced with dummy data. 191 | - [ ] Committing standards are followed such as appropriate commit summary and message supplied. 192 | - [ ] Commits are tagged at significant stages. This is used to indicate the state of code for specific releases or model versions. 193 | - [ ] Continuous integration is applied through tools such as [GitHub Actions](https://github.com/features/actions), 194 | to ensure that each change is integrated into the workflow smoothly. 195 | 196 | ### Configuration 197 | 198 | - [ ] Credentials and other secrets are not written in code but are configured as environment variables. 199 | - [ ] Configuration is stored in a dedicated configuration file, separate to the code. 200 | - [ ] If appropriate, multiple configuration files are used depending on system/local/user. 201 | - [ ] Configuration files are version controlled separately to the analysis code, so that they can be updated independently. 202 | - [ ] The configuration used to generate particular outputs, releases and publications is recorded. 203 | 204 | ### Data management 205 | 206 | - [ ] Published outputs meet [accessibility regulations](https://analysisfunction.civilservice.gov.uk/area_of_work/accessibility/). 207 | - [ ] All data for analysis are stored in an open format, so that specific software is not required to access them. 208 | - [ ] Input data are stored safely and are treated as read-only. 209 | - [ ] Input data are versioned. All changes to the data result in new versions being created, or 210 | [changes are recorded as new records](https://en.wikipedia.org/wiki/Slowly_changing_dimension). 211 | - [ ] All input data is documented in a data register, including where they come from and their importance to the analysis. 212 | - [ ] Outputs from your analysis are disposable and are regularly deleted and regenerated while analysis develops. 213 | Your analysis code is able to reproduce them at any time. 214 | - [ ] Non-sensitive data are made available to users. If data are sensitive, dummy data is made available so that the code can be run by others. 215 | - [ ] Data quality is monitored, as per [the government data 216 | quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 217 | - [ ] Fields within input and output datasets are documented in a data dictionary. 218 | - [ ] Large or complex data are stored in a database. 219 | 220 | ### Peer review 221 | 222 | - [ ] Peer review is conducted and recorded near to the code. Merge or pull requests are used to document review, when relevant. 223 | - [ ] Pair programming is used to review code and share knowledge. 224 | - [ ] Users are encouraged to participate in peer review. 225 | 226 | ### Testing 227 | 228 | - [ ] Core functionality is unit tested as code. See [`pytest` for Python](https://docs.pytest.org/en/stable/) and 229 | [`testthat` for R](https://testthat.r-lib.org/). 230 | - [ ] Code based tests are run regularly. 231 | - [ ] Bug fixes include implementing new unit tests to ensure that the same bug does not reoccur. 232 | - [ ] Informal tests are recorded near to the code. 233 | - [ ] Stakeholder or user acceptance sign-offs are recorded near to the code. 234 | - [ ] Test are automatically run and recorded using continuous integration or git hooks. 235 | - [ ] The whole process is tested from start to finish using one or more realistic end-to-end tests. 236 | - [ ] Test code is clean an readable. Tests make use of fixtures and parameterisation to reduce repetition. 237 | 238 | ### Dependency management 239 | 240 | - [ ] Required passwords, secrets and tokens are documented, but are stored outside of version control. 241 | - [ ] Required libraries and packages are documented, including their versions. 242 | - [ ] Working operating system environments are documented. 243 | - [ ] Example configuration files are provided. 244 | - [ ] Where appropriate, code runs independent of operating system (e.g. suitable management of file paths). 245 | - [ ] Dependencies are managed separately for users, developers, and testers. 246 | - [ ] There are as few dependencies as possible. 247 | - [ ] Package dependencies are managed using an environment manager such as [virtualenv for Python](https://virtualenv.pypa.io/en/latest/) 248 | or [renv for R](https://rstudio.github.io/renv/articles/renv.html). 249 | 250 | ### Logging 251 | 252 | - [ ] Misuse or failure in the code produces informative error messages. 253 | - [ ] Code configuration is recorded when the code is run. 254 | - [ ] Pipeline route is recorded if decisions are made in code. 255 | 256 | ### Project management 257 | 258 | - [ ] The roles and responsibilities of team members are clearly defined. 259 | - [ ] An issue tracker (e.g GitHub Project, Trello or Jira) is used to record development tasks. 260 | - [ ] New issues or tasks are guided by users’ needs and stories. 261 | - [ ] Issues templates are used to ensure proper logging of the title, description, labels and comments. 262 | - [ ] Acceptance criteria are noted for issues and tasks. Fulfilment of acceptance criteria is recorded. 263 | - [ ] Quality assurance standards and processes for the project are defined. These are based around 264 | [the quality assurance of code for analysis and research guidance document](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 265 | ``` 266 | -------------------------------------------------------------------------------- /book/checklists.md: -------------------------------------------------------------------------------- 1 | # Code quality assurance checklists 2 | 3 | This sections aims to provide a checklist for quality assurance of analytical projects in government. 4 | 5 | As per the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government), 6 | quality assurance should be proportionate to the complexity and risk of your analysis. 7 | With this in mind, we have provided checklists for three levels of quality assurance. 8 | 9 | We recommend that you consider the risk and complexity associated with your project. 10 | Given this assessment, you should select and tailor the checklists that we have provided. 11 | Risk tolerance varies between government departments, so it is important that you consider the operational context for the code when deciding what quality assurance is adequate. 12 | You may choose to select elements from each level of quality assurance to address the specific risks associated with your work. 13 | 14 | 15 | ## Checklists 16 | 17 | ```{tableofcontents} 18 | ``` 19 | -------------------------------------------------------------------------------- /book/configuration.md: -------------------------------------------------------------------------------- 1 | # Configuration 2 | 3 | Configuration describes how your code runs when you execute it. 4 | 5 | In analysis, we often want to run our analysis code using different inputs or parameters. 6 | And we likely want other analysts to be able to run our code on different machines, for example, to reproduce our results. 7 | This section describes how we can define analysis configuration that is easy to update and can remain separate from the logic in our analysis. 8 | 9 | 10 | ## Basic configuration 11 | 12 | Configuration for your analysis code should include high level parameters (settings) that can be used to easily adjust how your analysis runs. 13 | This might include paths to input and output files, database connection settings, and model parameters that are likely to be adjusted between runs. 14 | 15 | In early development of our analysis, lets imagine that we have a script that looks something like this: 16 | 17 | ````{tabs} 18 | 19 | ```{code-tab} python 20 | # Note: this is not an example of good practice 21 | # This is intended as example of what early pipeline code might look like 22 | data = read_csv("C:/a/very/specific/path/to/input_data.csv") 23 | 24 | variables_test, variables_train, outcome_test, outcome_train = train_test_split(data["a", "b", "c"], data["outcome"], test_size=0.3, random_seed=42) 25 | 26 | model = Model() 27 | model.fit(variables_train, outcome_train) 28 | 29 | # prediction = model.predict(variables_test, constant_a=4.5, max_v=100) 30 | prediction = model.predict(variables_test, constant_a=7, max_v=1000) 31 | 32 | prediction.to_csv("outputs/predictions.csv") 33 | 34 | ``` 35 | 36 | ```{code-tab} r R 37 | # Note: this is not an example of good practice 38 | # This is intended as example of what early pipeline code might look like 39 | data <- utils::read.csv("C:/a/very/specific/path/to/input_data.csv") 40 | 41 | set.seed(42) 42 | split <- caTools::sample.split(data, SplitRatio = .3) 43 | 44 | train_data <- data[split, ] 45 | test_data <- data[!split, ] 46 | 47 | model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "model.frame") 48 | # model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = "glm.fit") 49 | 50 | prediction <- predict(model, test_data, type = "response") 51 | 52 | utils::write.csv(prediction, "outputs/predictions.csv") 53 | 54 | ``` 55 | 56 | ```` 57 | 58 | Here we're reading in some data and splitting it into subsets for training and testing a model. 59 | We use one subset of variables and outcomes to train our model and then use the subset to test the model. 60 | Finally, we write the model's predictions to a `.csv` file. 61 | 62 | The file paths we use to read and write data in our script are particular to our working environment. 63 | These files and paths may not exist on another analyst's machine. 64 | As such, to run our code, other analysts need to read through the script and replace these paths. 65 | As we'll demonstrate below, collecting flexible parts of our code together makes it easier for others to update them. 66 | 67 | When splitting our data and using our model to make predictions, we've provided some parameters to the functions that we have used to perform these tasks. 68 | Eventually, we might reuse some of these parameters elsewhere in our script (e.g., the random seed) 69 | and we are likely to adjust these parameters between runs of our analysis. 70 | We should store them in variables to make it easier to adjust these consistently throughout our script. 71 | We should also store these variables with any other parameters and options, so that it's easy to identify where they should be adjusted. 72 | 73 | Note that in this example we've tried our model prediction twice, with different parameters. 74 | We've used comments to switch between which of these lines of code runs. 75 | This practice is common, especially when we want to make a number of changes when developing how our analysis should run. 76 | However, commenting sections of code in this way makes it difficult for others to understand our code and reproduce our results. 77 | We should avoid this form of ambiguity because another analyst would not be sure which set of parameters was used to produce a given set of predictions. 78 | Below, we'll look at some better alternatives for storing and switching analysis parameters. 79 | 80 | ````{tabs} 81 | 82 | ```{code-tab} python 83 | # Note: this is not an example of good practice 84 | # This is intended as an example of basic in-code configuration. 85 | 86 | # Configuration 87 | input_path = "C:/a/very/specific/path/to/input_data.csv" 88 | output_path = "outputs/predictions.csv" 89 | 90 | test_split_proportion = 0.3 91 | random_seed = 42 92 | 93 | prediction_parameters = { 94 | "constant_a": 7, 95 | "max_v": 1000 96 | } 97 | 98 | # Analysis 99 | data = read_csv(input_path) 100 | 101 | variables_test, variables_train, outcome_test, outcome_train = train_test_split(data["a", "b", "c"], data["outcome"], test_size=test_split_proportion, random_seed=random_seed) 102 | 103 | model = Model() 104 | model.fit(variables_train, outcome_train) 105 | 106 | prediction = model.predict(variables_test, constant_a=prediction_parameters["constant_a"], max_v=prediction_parameters["max_v"]) 107 | 108 | prediction.to_csv(output_path) 109 | ``` 110 | 111 | ```{code-tab} r R 112 | # Note: this is not an example of good practice 113 | # This is intended as an example of basic in-code configuration. 114 | 115 | # Configuration 116 | input_path <- "C:/a/very/specific/path/to/input_data.csv" 117 | output_path <- "outputs/predictions.csv" 118 | 119 | random_seed = 42 120 | test_split_proportion = .3 121 | model_method = "glm.fit" 122 | 123 | #analysis 124 | data <- utils::read.csv(input_path) 125 | 126 | set.seed(random_seed) 127 | split <- caTools::sample.split(data, SplitRatio = test_split_proportion) 128 | 129 | train_data <- data[split, ] 130 | test_data <- data[!split, ] 131 | 132 | model <- glm(formula = outcome ~ a + b + c, family = binomial(link = "logit"), data = train_data, method = model_method) 133 | 134 | prediction <- predict(model, test_data, type = "response") 135 | 136 | utils::write.csv(prediction, output_path) 137 | ``` 138 | 139 | ```` 140 | 141 | Separating configuration from the rest of our code makes it easy to adjust these parameters and apply them consistently throughout the analysis script. 142 | We're able to use basic objects (like lists and dictionaries) to group related parameters. 143 | We then reference these objects in the analysis section of our script. 144 | 145 | Our configuration could be extended to include other parameters, including which variables we're selecting to train our model. 146 | However, we must keep the configuration simple and easy to maintain. 147 | Before moving aspects of code to the configuration, consider whether it improves your workflow. 148 | You should include things that are dependent on the computer that you are using (e.g., file paths) or are likely to change between runs of your analysis, in your configuration. 149 | 150 | 151 | ## Use separate configuration files 152 | 153 | We can take our previous example one step further using independent configuration files. 154 | We simply take our collection of variables, containing parameters and options for our analysis, and move them to a separate file. 155 | These files can be written in the same language as your code or other simple languages, as we'll describe in the following subsections. 156 | 157 | Storing our analysis configuration in a separate file to the analysis code is a useful separation. 158 | It means that we can version control our code based solely on changes to the overall logic - when we fix bugs or add new features. 159 | We can then keep a separate record of which configuration files were used with our code to generate specific results. 160 | We can easily switch between multiple configurations by providing our analysis code with different configuration files. 161 | 162 | You may not want to version control your configuration file if it includes file paths that are specific to your machine or references to sensitive data. 163 | In this case, include a sample or example configuration file, so others can use this as a template to configure the analysis for their own environment. 164 | It is key to keep this template up to date, so that it is compatible with your code. 165 | 166 | 167 | ### Use code files for configuration 168 | 169 | We can copy our parameter variables directly from our scripts to use another code script as our configuration file. 170 | Because these variables are defined in the programming language that our analysis uses, it's easy to access them in our analysis script. 171 | In Python, variables from these config files can be imported into your analysis script. 172 | In R, your script might `source()` the config file to read the variables into the R environment. 173 | 174 | 175 | ### Use dedicated configuration files 176 | 177 | Many other file formats can be used to store configuration parameters. 178 | You may have come across data-serialisation languages (including YAML, TOML, JSON and XML), which can be used independently of your programming language. 179 | 180 | If we represent our example configuration from above in YAML, it would look like this: 181 | 182 | ```yaml 183 | input_path: "C:/a/very/specific/path/to/input_data.csv" 184 | output_path: "outputs/predictions.csv" 185 | 186 | test_split_proportion: 0.3 187 | random_seed: 42 188 | 189 | prediction_parameters: 190 | constant_a: 7 191 | max_v: 1000 192 | ``` 193 | 194 | You can use relevant libraries to read configuration files that are written in other languages. 195 | For example, we could read the YAML example into our analysis like this: 196 | 197 | ````{tabs} 198 | 199 | ```{code-tab} python 200 | import yaml 201 | 202 | with open("./my_config.yaml") as file: 203 | config = yaml.safe_load(file) 204 | 205 | data = read_csv(config["input_path"]) 206 | ... 207 | ``` 208 | 209 | ```{code-tab} r R 210 | config <- yaml::yaml.load_file(config_path) 211 | 212 | data <- read.csv(config$input_path) 213 | ... 214 | ``` 215 | 216 | ```` 217 | 218 | Configuration file formats like YAML and TOML are compact and human-readable. 219 | This makes them easy to interpret and update, even without knowledge of the underlying code used in the analysis. 220 | Reading these files in produces a single object containing all of the `key:value` pairs defined in our configuration file. 221 | We can then select our configuration parameters using their keys in our analysis. 222 | 223 | 224 | ## Use configuration files as arguments 225 | 226 | In the previous example, we have stored our configuration options in a separate file and referenced this in our analysis script. 227 | Although this allows us to separate our configuration from the main codebase, we have used a hard-coded path to the configuration file. 228 | This is not ideal, as for the code to be run on another machine the configuration file must be saved on the same path. 229 | Furthermore, if we want to switch the configuration file that the analysis uses we must change this path or replace the configuration file at the specified path. 230 | 231 | We can adjust our analysis script to take the configuration file path as an argument when the analysis script is run to overcome this. 232 | We can achieve this in a number of ways, but we'll discuss a minimal example here: 233 | 234 | ````{tabs} 235 | 236 | ```{code-tab} python 237 | import sys 238 | import yaml 239 | 240 | if len(sys.argv) < 2: 241 | # The Python script name is counted as the first argument 242 | raise ValueError("Configuration file must be passed as an argument.") 243 | 244 | config_path = sys.argv[1] 245 | with open(config_path) as file: 246 | config = yaml.safe_load(file) 247 | ... 248 | ``` 249 | 250 | ```{code-tab} r R 251 | args <- commandArgs() 252 | if (length(args) < 1) { 253 | stop("Configuration file must be passed as an argument.") 254 | } 255 | 256 | config_path <- args[1] 257 | config <- yaml::yaml.load_file(config_path) 258 | ... 259 | ``` 260 | 261 | ```` 262 | 263 | When executing the analysis file above, we pass the path to our configuration file after calling the script. 264 | If our script was named 'analysis_script', it would be called from the command line as: 265 | 266 | ````{tabs} 267 | 268 | ```{code-tab} sh Python 269 | python analysis_script.py /path/to/my_configuration.yaml 270 | ``` 271 | 272 | ```{code-tab} sh R 273 | Rscript analysis_script.R /path/to/my_configuration.yaml 274 | ``` 275 | 276 | ```` 277 | 278 | If we now want to run our analysis with a different configuration we can simple pass another configuration file to the script. 279 | This means that we don't need to change our code to account for changes to the configuration. 280 | 281 | ```{note} 282 | It is possible to pass configuration options directly as arguments in this way, instead of referencing a configuration file. 283 | However, you should use configuration files as they allow us to document which configuration 284 | has been used to produce our analysis outputs, for reproducibility. 285 | ``` 286 | 287 | 288 | (environment-variables)= 289 | ## Configure secrets as environment variables 290 | 291 | Environment variables are variables that are available in a particular environment. 292 | In most analysis contexts, our environment is the user environment that we are running our code from. 293 | This might be your local machine or an analysis platform. 294 | 295 | If your code depends on credentials of some kind, do not write these in your code. 296 | You can store passwords and keys in configuration files, but there is a risk that these files may be included in [version control](version_control.md). 297 | To avoid this risk, store this information in local environment variables. 298 | 299 | Environment variables can also be useful for storing other environment-dependent variables. 300 | For example, the location of a database or a software dependency. 301 | We might prefer this over a configuration file the code requires very few other options. 302 | 303 | In Unix systems (e.g., Linux and Mac), you can set environment variables in the terminal using `export` and delete them using `unset`: 304 | 305 | ```none 306 | export SECRET_KEY="mysupersecretpassword" 307 | unset SECRET_KEY 308 | ``` 309 | 310 | In Windows, the equivalent commands to these are: 311 | 312 | ```none 313 | setx SECRET_KEY "mysupersecretpassword" 314 | reg delete HKCU\Environment /F /V SECRET_KEY 315 | ``` 316 | 317 | You can alternatively define them using a graphical interface under `Edit environment variables for your account` in your Windows settings. 318 | 319 | Once stored in environment variables, these variables will remain available in your environment until you delete them. 320 | 321 | You can access this variable in your code like so: 322 | 323 | ````{tabs} 324 | 325 | ```{code-tab} python 326 | import os 327 | 328 | my_key = os.environ.get("SECRET_KEY") 329 | ``` 330 | 331 | ```{code-tab} r R 332 | my_key <- Sys.getenv("SECRET_KEY") 333 | ``` 334 | 335 | ```` 336 | 337 | It is then safer for this code to be shared with others, as they can't acquire your credentials without access to your environment. 338 | -------------------------------------------------------------------------------- /book/continuous_integration.md: -------------------------------------------------------------------------------- 1 | # Automating code quality assurance 2 | 3 | ## Motivation 4 | 5 | You can automate various tasks to increase the quality of code and make development easier and less tedious. 6 | Automating the running of unit tests is especially important to establish trust in your pipeline or package, by ensuring that all unit tests pass before every merge. 7 | 8 | 9 | (automate-tests)= 10 | ### Automate tests to reduce risk of errors 11 | 12 | You should run tests whenever you make changes to your project. 13 | This ensures that changes do not break the existing, intended functionality of your code. 14 | However, it is easy to forget to run your tests at regular intervals. 15 | 16 | "Surely I can automate this too?" 17 | 18 | Absolutely! Automatic testing, amongst other quality assurance measures, can be triggered when you make changes to your remote version control repository. 19 | You can use these tools to ensure that all changes to a project are tested. 20 | Additionally, it allows others, who are reviewing your code, to see the results of your tests. 21 | 22 | 23 | (continuous-integration)= 24 | ## Commit small changes often 25 | 26 | Committing small changes regularly is often referred to as Continuous Integration (CI). 27 | You can achieve this easily through the use of [version control](version_control.md), such as Git. 28 | 29 | You should commit every time you make a working change. 30 | Fixed a typo? Commit. Fixed a bug? Commit. Added a function? Commit. Added a test? Commit. 31 | As a very rough guide, you should expect to commit a few times each hour and push your changes to your shared software repository at least once a day. 32 | If the task is unfinished at the end of the day, you should consider if the task has been sufficiently broken down. 33 | 34 | CI should be underpinned by automating routine code quality assurance tasks. 35 | This quality assurance includes verifying that your code successfully builds or installs. It also ensures your [code tests](testing_code.md) run successfully. 36 | You can achieve this in a number of ways such as use of Git hooks and workflows. 37 | 38 | 39 | ## Use Git hooks to encourage good practice 40 | 41 | [Git hooks](https://git-scm.com/docs/githooks) are scripts that can be set to run locally at specific points in your Git workflow, 42 | such as pre-commit, pre-push, etc. 43 | You can use them to automate code quality assurance tasks, e.g., run tests, follow style guides, or enforce commit standards. 44 | 45 | For example, you could set up a `pre-commit` or `pre-push` hook that runs your tests before you make each commit or push to the remote repository. 46 | This might stop your commit/push if the tests fail, so that you won't push breaking changes to your remote repository. 47 | 48 | ```{note} 49 | If your code is likely to be run on a range of software versions or operating systems, you can test on a variety of these. 50 | Tools exist to support local testing of combination software versions and package dependency versions: 51 | 52 | * [tox](https://tox.readthedocs.io/en/latest/) or [nox](https://nox.thea.codes/en/stable/) for Python 53 | * [rhub](https://r-hub.github.io/rhub/) for R 54 | ``` 55 | 56 | 57 | (linters-formatters)= 58 | ### Linters and formatters 59 | 60 | Style guides are important for making sure your code is clear and readable and should form part of your quality assurance process. 61 | However, as discussed in [](automate-style-checks), the process of checking and fixing code for style and formatting is tedious. 62 | Automation can speed up this work, either by providing suggestions as you write the code or by reformatting your code to follow your chosen style. 63 | 64 | Two main types of tool exist for these tasks: 65 | 66 | * Linters - these analyse your code to flag stylistic errors (and sometimes bugs or security issues too). 67 | * Formatters - these not only detect when you have diverged from a style, but will automatically correct the formatting of your code to conform to a particular style. 68 | 69 | ```{list-table} Packages that can be used for linting or formatting in Python and R 70 | :header-rows: 1 71 | :name: linters 72 | 73 | * - Language 74 | - Linters 75 | - Formatters 76 | * - Python 77 | - `flake8`, `pylint`, `Bandit` 78 | - `Black`, `Isort` 79 | * - R 80 | - `lintr` 81 | - `formatR`, `styler` 82 | * - Markdown 83 | - `pymarkdownlnt` 84 | - 85 | ``` 86 | 87 | You can use these tools locally (in the command line) or as git pre-commit hooks. 88 | As described above, using pre-commit hooks allows you to run these automatically every time there are changes; 89 | this can reduce the burden on developers and reviewers checking that code conforms to style guides. 90 | 91 | Make sure you read the documentation for these tools before you use them to understand what they are checking or changing in your code. 92 | You can configure some of them to ignore or detect specific types of formatting error. 93 | You can also run multiples of the tools to catch a broader range of stylistic or programmatic errors. 94 | 95 | ## Workflows 96 | 97 | GitHub Actions and GitLab Pipelines are both able to define custom workflows using YAML. 98 | Workflow refers to a defined sequence of steps and actions that you need to perform 99 | to complete a specific task or process. 100 | Workflows are commonly used in software development to automate repetitive or complex tasks, 101 | such as building and deploying software, testing code, and managing code reviews. 102 | GitHub Actions and GitLab Pipelines both allow automated workflows that trigger 103 | builds and tests whenever code changes are pushed to the repository. 104 | 105 | ### Example use cases for GitHub Actions 106 | 107 | Here are some examples to support understanding of these ideas. 108 | 109 | #### Configure GitHub actions to automate tests 110 | 111 | Here is an example configuration file, for use with GitHub actions. 112 | The `YAML` file format, used below, is common to a number of other CI tools. 113 | 114 | ```yaml 115 | name: Test python package 116 | 117 | on: 118 | push: 119 | branches: 120 | - master 121 | pull_request: 122 | 123 | jobs: 124 | build: 125 | 126 | runs-on: ubuntu-latest 127 | strategy: 128 | matrix: 129 | python-version: [3.8, 3.9, 3.10, 3.11] 130 | 131 | steps: 132 | - uses: actions/checkout@v3 133 | 134 | - name: Set up Python ${{ matrix.python-version }} 135 | uses: actions/setup-python@v4 136 | with: 137 | python-version: ${{ matrix.python-version }} 138 | 139 | - name: Install dependencies 140 | run: | 141 | python -m pip install --upgrade pip 142 | pip install pytest 143 | pip install -r requirements.txt 144 | 145 | - name: Test with pytest 146 | run: | 147 | pytest 148 | ``` 149 | 150 | The first section of this example describes when we should run our workflow. 151 | In this case, we're running the CI workflow whenever code is `push`ed to the `master` branch or where any Pull Request is created. 152 | In the case of Pull Requests, the results of the CI workflow will be report on the request's page. 153 | If any of the workflow stages fail, this can block the merge of these changes onto a more stable branch. 154 | Subsequent commits to the source branch will trigger the CI workflow to run again. 155 | 156 | Below `jobs`, we're defining what tasks we would like to run when we trigger our workflow. 157 | We define what operating system we would like to run our workflow on - the Linux operating system `ubuntu` here. 158 | The `matrix` section under `strategy` defines parameters for the workflow. 159 | We will repeat the workflow for each combination of parameters supplied here - in this case 4 recent Python versions. 160 | 161 | The individual stages of the workflow are defined under `steps`. 162 | `steps` typically have an informative name and run code to perform an action. 163 | Here `uses: actions/checkout@v3` references [existing code](https://github.com/actions/checkout) that will retrieve the code from our repo. 164 | The subsequent `steps` will use this code. 165 | The next step provides us with a specific Python version, as specified in the `matrix`. 166 | Then we install dependencies/requirements for our code and the `pytest` module. 167 | Finally, we run `pytest` to check that our code is working as expected. 168 | 169 | This workflow will report whether our test code ran successfully for each of the specified Python versions. 170 | 171 | #### Configure GitHub actions to build and deploy documentation 172 | 173 | It is important to maintain the documentation relating to your project to ensure contributors and users can understand, maintain, and use your product correctly. 174 | One basic way of doing this is maintaining markdown files within a GitHub repository. 175 | However, multiple tools exist that can transform these markdown files into HTML content. 176 | A popular tool for building and deploying HTML documentation is [Sphinx](https://www.sphinx-doc.org/en/master/). 177 | Here are two examples of repositories that use sphinx to build its documentation: 178 | 179 | * [Quality assurance of code for analysis and research (this book)](https://github.com/best-practice-and-impact/qa-of-code-guidance/blob/main/.github/workflows/book.yaml) 180 | * [govcookiecutter](https://github.com/best-practice-and-impact/govcookiecutter/blob/main/.github/workflows/govcookiecutter-deploy-documentation.yml) 181 | 182 | ### Example GitLab Pipeline 183 | 184 | GitLab has an equivalent to GitHub Actions called GitLab Pipelines. 185 | The use cases for these are practically the same, with a change in syntax and file structure. 186 | [PATRICK'S SOFTWARE BLOG](https://web.archive.org/web/20230321180431/https://www.patricksoftwareblog.com/setting-up-gitlab-ci-for-a-python-application/) provides a simple GitLab 187 | pipeline example and detailed description on how to use it. 188 | For further details [GitLab provides documentation on how to create and use GitLab Pipelines](https://docs.gitlab.com/ee/ci/). 189 | 190 | ### Comprehensive example of automating code quality assurance 191 | 192 | You can see a detailed example of CI in practice in the `jupyter-book` project. 193 | A recent version of the 194 | [`jupyter-book` CI workflow](https://github.com/executablebooks/jupyter-book/blob/6fb0cbe4abb5bc29e9081afbe24f71d864b40475/.github/workflows/tests.yml) includes: 195 | 196 | * Checking code against style guidelines, using [pre-commit](https://pre-commit.com/). 197 | * Running code tests over: 198 | * a range of Python versions. 199 | * multiple versions of specific dependencies (`sphinx` here). 200 | * multiple operating systems. 201 | * Reporting test coverage. 202 | * Checking that documentation builds successfully. 203 | * Deploying a new version of the `jupyter-book` package to [the python package index (PyPI)](https://pypi.org/). 204 | -------------------------------------------------------------------------------- /book/data.md: -------------------------------------------------------------------------------- 1 | # Data management 2 | 3 | Data management covers a broad range of disciplines, including organising, storing and maintaining data. 4 | Dedicated data architects and engineers typically handle data management. However, analysts are often expected to manage their own data or will work alongside other data professionals. 5 | This section aims to highlight good data management practices, so that you can either appreciate how your organisation handles its data 6 | or implement your own data management solutions. 7 | 8 | We need to be able to identify and access the same data that our analysis used if we are to reproduce a piece of analysis . 9 | This requires suitable storage of data, with documentation and versioning of the data where it may change over time. 10 | 11 | ```{admonition} Key strategies 12 | :class: admonition-strategies 13 | 14 | [The Government Data Quality Framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework) 15 | focuses primarily on assessing and improving the quality of input data. 16 | It should be a primary resource for all analysts working with data in the public sector. 17 | 18 | The Office for Statistics Regulation provides a standard for 19 | [quality assurance of administrative data](https://osr.statisticsauthority.gov.uk/guidance/administrative-data-and-official-statistics/). 20 | ``` 21 | 22 | 23 | ## Data storage 24 | 25 | We assume that most data are now stored digitally. 26 | 27 | Digital data risk becoming inaccessible as technology develops and commonly used software changes. 28 | Use open or standard file formats, not proprietary ones, for long term data storage. 29 | There are [recommended formats](https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats.aspx) for storing different data types, 30 | though we suggest avoiding formats that depend on proprietary software like SPSS, STATA, and SAS. 31 | 32 | Short term storage, for use in analysis, might use any format that is suitable for the analysis task. 33 | However, most analysis tools should support reading data directly from safe long term storage, including databases. 34 | 35 | 36 | ### Spreadsheets 37 | 38 | Spreadsheets (e.g., Microsoft Excel formats and open equivalents) are a very general data analysis tool. 39 | The cost of their easy to use interface and flexibility is increased difficulty of quality assurance. 40 | 41 | ```{figure} https://imgs.xkcd.com/comics/norm_normal_file_format.png 42 | --- 43 | width: 50% 44 | name: data-file-comic 45 | alt: A comic strip describing someone sending "data" in the form of a screenshot of a spreadsheet document. 46 | --- 47 | .NORM Normal File Format, from [xkcd](https://xkcd.com/2116/) 48 | ``` 49 | 50 | Do not use spreadsheets for storage of data (or statistics production and modelling processes). 51 | Using spreadsheets to store data introduces problems like these: 52 | 53 | * Lack of audibility - changes to data are not recorded. 54 | * Multiple users can't work with a single spreadsheet file at once, or risk complex versioning clashes which can be hard to resolve. 55 | * They are error prone and have no built in quality assurance. 56 | * Large files become cumbersome. 57 | * There is a risk of automatic "correction" of grammar and data type, which silently corrupts your data. 58 | * Converting dates to a different datetime format. 59 | * Converting numbers or text that resemble dates to dates. 60 | 61 | See the European Spreadsheet Risks Interest Group document 62 | [spreadsheet related errors and their consequences](https://eusprig.org/research-info/horror-stories/) for more information. 63 | 64 | 65 | ### Databases 66 | 67 | Databases are collections of related data, which can be easily accessed and managed. 68 | Each database contains one or more tables, which hold data. 69 | Database creation and management is carried out using a database management system (DBMS). 70 | DBMSs usually support authorisation of access to the database and support multiple users to concurrently access the database. 71 | Popular open source DBMS include: 72 | 73 | * SQLite 74 | * MySQL 75 | * PostgreSQL 76 | * Redis 77 | 78 | Relational databases are the most common form of database. 79 | Common keys (e.g., unique identifiers) link data in the tables of a relational database. 80 | This allows you to store data with minimal duplication within a table, but quickly collect related data when required. 81 | Relational DBMS are called RDBMS. 82 | 83 | Most DBMS communicate with databases using structured query language (SQL). 84 | 85 | ```{admonition} Key Learning 86 | :class: admonition-learning 87 | 88 | You might find this [foundations of SQL (government analysts only course)](https://learninghub.ons.gov.uk/enrol/index.php?id=1162) 89 | or [w3schools SQL tutorials](https://www.w3schools.com/sql/default.asp) useful for learning the basics of SQL. 90 | ``` 91 | 92 | Common analysis tools can interface with databases using SQL packages, or those which provide an object-relational mapping (ORM). 93 | An ORM is non-SQL-based interface to connect to a database. 94 | They are often user-friendly, but may not support all of the functionality that SQL offers. 95 | 96 | ```{admonition} Key Learning 97 | :class: admonition-learning 98 | 99 | This guide covers [Python SQL libraries](https://realpython.com/python-sql-libraries/) in detail. 100 | While this Software Carpentry course covers [SQL databases and R](http://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html). 101 | ``` 102 | 103 | Other database concepts: 104 | 105 | * Schema - a blueprint that describes the field names and types for a table, including any other rules (constraints). 106 | * Query - a SQL command that creates, reads, updates or deletes data from a database. 107 | * View - a virtual table that provides a quick way to look at part of your database, defined by a stored query. 108 | * Indexes - data structures that can increase the speed of particular queries. 109 | 110 | Good practices when working with databases include: 111 | 112 | * Use auto-generated primary keys, rather than composites of multiple fields. 113 | * Break your data into logical chunks (tables), to reduce redundancy in each table. 114 | * Lock tables that should not be modified. 115 | 116 | Other resources: 117 | 118 | * This [SQL lecture from Harvard's computer science course](https://www.youtube.com/watch?v=u5pDdEKnbKA) 119 | may be a useful introduction to working with databases from Python. 120 | * A guide to [using the `sqldf` R package](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html). 121 | 122 | ## Documenting data 123 | 124 | It is difficult to understand and work with a new dataset without documentation. 125 | 126 | For our analysis, we should be able to quickly grasp: 127 | 128 | * What data are available to us? 129 | * How were these data collected or generated? 130 | * How are these data represented? 131 | * Have these data been validated or manipulated? 132 | * How am I ethically and legally permitted to use the data? 133 | 134 | Data providers and analysts should create this information in the form of documentation. 135 | 136 | 137 | ### Data dictionary 138 | 139 | A data dictionary describes the contents and format of a dataset. 140 | 141 | For variables in tabular datasets, you might document: 142 | 143 | * A short description of what each variable represents. 144 | * The frame of reference of the data. 145 | * Variable labels, if categorical. 146 | * Valid values or ranges, if numerical. 147 | * Representation of missing data. 148 | * Reference to the question, if survey data. 149 | * Reference to any related variables in the dataset. 150 | * If derived, detail how variables were obtained or calculated. 151 | * Any rules for use or processing of the data, set by the data owner. 152 | 153 | See this detailed example - 154 | the [National Workforce Data Set](https://www.datadictionary.nhs.uk/data_sets/administrative_data_sets/national_workforce_data_set.html#dataset_national_workforce_data_set), 155 | from the NHS Data Model and Dictionary. 156 | 157 | Please see [UK Data Service guidance on documenting other data](https://www.ukdataservice.ac.uk/manage-data/document/data-level/tabular.aspx), 158 | including qualitative data. 159 | 160 | 161 | ### Information Asset Register (IAR) 162 | 163 | An information asset register (IAR) documents the information assets within your organisation. 164 | Your department should have an IAR in place, to document its information assets. 165 | As an analyst, you might use the register to identify contacts for data required for your analyses. 166 | 167 | Although this form of documentation may not contain detailed information on how to use each data source (provided by data dictionaries), an IAR does increase visibility of data flows. 168 | An IAR may include: 169 | 170 | * The owner of each dataset. 171 | * A high level description of the dataset. 172 | * The reason that your organisation holds the dataset. 173 | * How the information is stored and secured. 174 | * The risk of information being lost or compromised. 175 | 176 | GOV.UK provides [IAR templates](https://www.gov.uk/government/publications/information-asset-register) that your department might use to structure their IAR. 177 | 178 | 179 | ## Version control data 180 | 181 | To reproduce your analysis, you need to be able to identify the data that you used. 182 | Data change over time; 183 | Open data and other secondary data may be revised over time or cease to be available with no notice. 184 | You can't rely on owners of data to provide historical versions of their data. 185 | 186 | As an analyst, it is your responsibility to ensure you identify the exact data that you have used. 187 | 188 | You should version and document all changes to the data that you use whether using a primary or secondary data source,. 189 | You should include the reason why the version has changed in documentation for data versions. 190 | For example, if an open data source has been recollected, revisions have been made to existing data, or part of the data has been removed. 191 | 192 | You should be able to generate your analytical outputs reproducibly and, as such, treat them as disposable. 193 | If this is not the case, version outputs so that you can easily link them to the versioned input data and analysis code. 194 | 195 | To automate the versioning of data, you might use the Python package [DVC, which provides Git-like version control of data](https://dvc.org/). 196 | This tool can also relate the data version to the version of analysis code, further facilitating reproducibility. 197 | You can use Git to version data, but you should be mindful of where your remote repository stores this data. 198 | The [`daff` package summarises changes in tabular data files](https://github.com/paulfitz/daff), which can be integrated with Git to investigate changes to data. 199 | 200 | You might alternatively version your data manually e.g., by creating new database tables or files for each new version of the data. 201 | It must be possible to recreated previous versions of the data, for reproducibility. 202 | As such, it is important that you name data file versions uniquely, for example, using incrementing numbers and/or date of collection. 203 | Additionally, do not modify file versions after they have been used for analysis - they should be treated as read-only. 204 | All modifications to the data should result in new versions. 205 | 206 | ```{todo} 207 | Diagram of good manual data versioning workflow. 208 | 209 | [#22](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues/22) 210 | ``` 211 | 212 | Finally, for this to be effective, your analysis should record the version of data you used to generate a specified set of outputs. 213 | You might document this in analysis reports or automatically logged by your code. 214 | 215 | 216 | ## Use these standards and guidance when publishing data 217 | 218 | You should use the [5-star open data standards](https://5stardata.info/en/) to understand and improve the current utility of your published data. 219 | We recommend the [CSV on the Web (CSVW) standard](https://csvw.org/) for achieving the highest ratings of open data. 220 | 221 | When publishing statistics you should follow government guidance for [releasing statistics in spreadsheets](https://analysisfunction.civilservice.gov.uk/policy-store/releasing-statistics-in-spreadsheets/). 222 | 223 | When publishing or sharing tabular data, you should follow the [GOV.UK Tabular data standard](https://www.gov.uk/government/publications/recommended-open-standards-for-government/tabular-data-standard). 224 | 225 | Analysts producing published statistics may also be interested in [Connected Open Government Statistics (COGS)](https://analysisfunction.civilservice.gov.uk/the-gss-data-project/) 226 | and [the review of government data linking methods](https://www.gov.uk/government/publications/joined-up-data-in-government-the-future-of-data-linking-methods). 227 | 228 | The UK Data Service provides guidance on [data security considerations](https://www.ukdataservice.ac.uk/manage-data/store/security). 229 | -------------------------------------------------------------------------------- /book/glossary.md: -------------------------------------------------------------------------------- 1 | # Glossary of terms 2 | 3 | ```{note} 4 | This section is a draft, while we make sure that it meets user needs. 5 | 6 | Please get in touch with feedback [by creating a GitHub Issue](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues) or [emailing us](mailto:ASAP@ons.gov.uk). 7 | ``` 8 | 9 | ## Terms 10 | ### Abstraction 11 | 12 | Treating a problem as an idea or concept, rather than a detailed individual example. 13 | 14 | Abstraction is used to manage the complexity of software, by describing our logic in a more generic way or by hiding complexity. 15 | When we use similar logic in multiple parts of our code, we can abstract this logic into a generic function or method to be reused. 16 | When part of our process is very complex, but is self-contained, we can hide this complexity it by putting it inside a function and referring to the function. 17 | 18 | 19 | ### Application Programming Interface (API) 20 | 21 | An interface that defines how you can interact with software through code. 22 | For example, the functions or methods from a package that a typical user will interact with. 23 | 24 | 25 | ### Attribute 26 | 27 | A variable associated with a class object. 28 | 29 | 30 | ### Automated testing 31 | 32 | Tests that are written in code, to check that other code works as expected. 33 | Test are like a controlled experience, to check that our code produces the expected outcome. 34 | Tests can check code multiple levels, for example, checking that an individual function works or checking that a pipeline runs from end to end. 35 | 36 | 37 | ### Code interpreter 38 | 39 | A computer program that runs code in a particular programming language. 40 | For example, the program that reads your Python or R analysis code and runs it. 41 | A non-interactive interpreter runs code in order, which is important for reproducibility. 42 | 43 | Interactive interpreters allow you to run individual lines of code, which means that you can run code out of order. 44 | Notebooks use interactive interpreters. 45 | These are not suitable for running analysis pipelines, because they do not ensure that the code is run reproducibly. 46 | 47 | 48 | ### Class 49 | 50 | Contains code to create instances of objects and methods to edit these instances of classes and their attributes. Methods may also produce output other than the class instance. 51 | 52 | Examples of built in python classes are float, string, and bool. 53 | 54 | Examples of built in R classes are double, character, and logical. 55 | 56 | 57 | ### Cloud computing 58 | 59 | Cloud computing allows us to access scalable computing power and data storage on-demand. 60 | This means that we can access high-specification hardware or store large amounts of data, usually only paying for what we have used. 61 | 62 | The most common cloud platforms are Google Cloud Platform (GCP), Amazon Web Service (AWS) and Microsoft Azure. 63 | 64 | 65 | ### Code 66 | 67 | The part of software that contains program instructions. 68 | In analysis, code is a set of human-readable instructions that tell a computer how to carry out our analysis. 69 | 70 | 71 | ### Code repository 72 | 73 | A folder for storing the code and associated documents for a piece of software. 74 | These are often associated with Git, as the folder that is being version controlled. 75 | Also known as a 'repo'. 76 | 77 | 78 | ### Configuration file 79 | 80 | Files that are used to configure the parameters or settings of a program. 81 | In analysis, these often include paths to input data files and other information that may change between each run of the pipeline. 82 | 83 | 84 | ### Continuous integration and continuous delivery (CI/CD) 85 | 86 | CI/CD tools help to automate parts of your code development workflow. 87 | This can include running automated tests against your code and deploying the latest version of your code into production. 88 | GitHub Actions is an example tool for running CI/CD workflows. 89 | 90 | Continuous integration describes regularly combining code changes from multiple contributors on a single software project. 91 | Integrating these changes regularly helps to check that independent changes work correctly together. 92 | Continous delivery describes automation of the software release process. 93 | Automating both of these using a CI/CD tool increases the efficiency of building software and getting it into production. 94 | 95 | 96 | ### Debugger 97 | 98 | A computer program that is used to test software and to identify the root cause of errors or "bugs". 99 | Debuggers allow you to pause the code at specific points and to walk through code step by step, in order to understand how it is working. 100 | 101 | 102 | ### Dependencies 103 | 104 | Something that is required for your code to run correctly - your code depends upon it. 105 | This includes your operating system, environment, software and packages (and their versions), if these are needed to reproducibly run your analysis code. 106 | Dependencies are usually documented alongside code, so that others can prepare an environment to run the code. 107 | 108 | 109 | ### Documentation 110 | 111 | Human readable text that describes your analysis and code. 112 | There are many ways to document code. 113 | Low level documentation might describe a single function, a code comment might describe a decision you made when writing the code, 114 | and high level documentation might describe your overall approach to a piece of analysis. 115 | 116 | 117 | ### Dual Running 118 | 119 | In a RAP context, where a pipeline or process is run in different ways, on different systems, or by different people to verify reproducibility. 120 | 121 | 122 | ### Functions and methods 123 | 124 | Functions and methods are named units of logic that can be used multiple times in our code. 125 | Methods are similar to functions, but are associated with a class. 126 | Functions and methods are crucial for creating modular code. 127 | 128 | Functions are written to generalise a piece of logic, so that it can be used consistently in multiple places in our code. 129 | You might define a function for a particular statistical method, validation check or to manipulate data in a certain way. 130 | 131 | 132 | ### Git 133 | 134 | By far the most popular version control tool. 135 | Used to version changes to code for any programming language. 136 | 137 | 138 | ### Instance 139 | 140 | When an object is created by a constructor of a class, the resulting object is called an instance of the class. The variables specific to an object are called instance variables. 141 | 142 | 143 | ### Integrated development environment (IDE) 144 | 145 | An IDE is a tool that facilitates software development for a given programming language. 146 | An IDE will usually consist of a code editor, debugger and development automation tools like a text auto-complete. 147 | 148 | 149 | ### Interative Running 150 | 151 | Where a script is run as expressions with immediate feedback given, such as in a terminal or command line. 152 | 153 | 154 | ### Logging 155 | 156 | Automatically recording information when your analysis runs. 157 | This might include what options were used when configuring the pipeline and any decisions that the code made when running. 158 | Recording this information can help you to reproduce previous runs of the analysis and to identify the source of errors. 159 | 160 | 161 | ### Maintainability 162 | 163 | Ability to easily understand, modify and repair the code. 164 | 165 | 166 | ### Modularity 167 | 168 | Modular code is written in discrete, re-usable chunks. 169 | Similar code is kept close together, while pieces of code with different uses are stored separately. 170 | Breaking code down in this way makes code easier to work with, understand and review. 171 | 172 | 173 | ### Module 174 | 175 | A module is a script grouping related functions and/or classes that are reused across larger programs. It may also include statements used to initialise the module when it is imported in another script. 176 | 177 | 178 | ### Notebooks 179 | 180 | Notebooks are documents that contain text, executable code, and the output of that code. 181 | Notebooks are used to share the results of exploratory analysis. 182 | However, because parts of a notebook can be run out of order they are not suitable for production of published outputs. 183 | 184 | 185 | ### Object-oriented programming (OOP) 186 | 187 | Writing code that defines and uses classes. 188 | Classes are objects that contain both data (attributes) and logic (methods). 189 | Classes are often used to represent a real life entity - for example a bank account, which stores a balance and has method for withdrawing money. 190 | 191 | 192 | ### Open source 193 | 194 | A type of software license that allows users to view the software's code. 195 | Open-source code is openly available to anyone. 196 | Using open-source software and publishing our code with open-source licenses ensures that our users can easily understand and reproduce our analysis. 197 | 198 | Open-source programming languages are free to use. 199 | We recommend using Python and R, which support the good practices outlined in this guidance. 200 | 201 | Proprietary software, including SPSS, SAS, and Stata are closed source. 202 | These are expensive to use and do not support many of the good analysis practices outlined in this guidance. 203 | Not using open source analysis tools means that your users need to purchase software to reproduce your results. 204 | 205 | 206 | ### Packages 207 | 208 | Packages are file structures that contain code and documentation. 209 | These collections of code are designed for you to install and re-use them with ease. 210 | Packages in Python and R act as extensions, to allow us to reuse code that has already been written. 211 | 212 | "Library" is similarly used to describe a software collection, however, packages are more specifically for distribution of code. 213 | 214 | 215 | ### Parameters and arguments 216 | 217 | Parameters are the variables that a piece of code expects to be provided when it is used. 218 | They act as placeholders, to allow us to make code more generic. 219 | A function might have parameters to allow it to be run on different inputs. 220 | For example, there may be a parameter to allow the user to select which column of data a piece of logic should use. 221 | 222 | An argument is the value that is being passed to a given parameter. 223 | 224 | 225 | ### Peer review 226 | 227 | Having another developer review changes that you have made to code. 228 | This helps to assure that your code is readable and follows a sensible approach. 229 | It also helps to transfer understanding of the code between members of the team. 230 | 231 | 232 | ### Pipeline 233 | 234 | A orchestrated chain of programs that will execute the next program(s) in the chain when a program completes successfully. Outputs will be stored on completion of a program and will feed in as inputs to the next program(s) in the chain. 235 | 236 | 237 | ### Procedural Running 238 | 239 | Where a script is executed as statements that run one after the other. 240 | 241 | 242 | ### Program 243 | 244 | A main script that can contain statements, functions, and methods. The main script may import functions and methods from other scripts. This is the script that would be run by the user/orchestration tool. 245 | 246 | 247 | ### Python and R 248 | 249 | Python and R are high-level programming languages that are commonly used in government analysis. 250 | They are both free to use and open source. 251 | 252 | 253 | ### Readability 254 | 255 | Ability to gain an understanding of code within a reasonable amount of time. 256 | 257 | 258 | ### Reproducible Analytical Pipelines (RAP) 259 | 260 | Reproducible Analytical Pipelines (RAP) are analyses that are carried out following good software engineering practices that are described by this guidance. 261 | They focus on the use of open-source analytical tools and a variety of techniques to deliver reproducible, auditable, and assured data analyses. 262 | RAP is more generally a culture of wanting to improve the quality of our analysis, by improving the quality assurance of our analysis code. 263 | 264 | 265 | ### Scripts 266 | 267 | A script is a single file containing code. Code in scripts can be run interactively or procedurally. 268 | 269 | 270 | ### Variables 271 | 272 | A variable is a named code object for storing data within a programming language. 273 | 274 | 275 | ### Version control 276 | 277 | Saving versions of documents to keep an audit trails of changes. 278 | Version control tools typically allow you to backtrack to previous versions and to merge multiple concurrent changes from different users. 279 | 280 | 281 | ### Version control platforms 282 | 283 | For example, GitHub, GitLab and BitBucket. 284 | 285 | These services use Git to manage remote repositories. 286 | Having a remote version of your repository allows multiple users to collaborate using Git. 287 | Each platform has free and paid options and allow users to open-source their work. 288 | 289 | Some platforms (e.g. GitLab) are open source and can be hosted internally by your organisation. 290 | 291 | 292 | ### Virtual environment 293 | 294 | Virtual environments are tools for managing dependencies and isolating projects. 295 | A virtual environment is effectively a folder tree containing the programming language version and a list of dependencies. 296 | This helps keep track of dependencies and avoid conflicts between different projects. 297 | 298 | -------------------------------------------------------------------------------- /book/intro.md: -------------------------------------------------------------------------------- 1 | # Quality assurance of code for analysis and research 2 | 3 | This Government Analysis Function (AF) guidance, also known as the Duck Book, 4 | is produced by the Analysis Standards and Pipelines hub of the [Office for National Statistics](https://www.ons.gov.uk). 5 | 6 | The guidance is a living document, which is continually updated. 7 | We are extremely grateful for any feedback that you can provide on existing and future content. 8 | 9 | 10 | ## How to get the most out of the book 11 | 12 | This guidance describes software engineering good practices that are tailored to those working with data using code. 13 | It is designed to support you to quality assure your code and increase the reproducibility of your analyses. 14 | Software that apply these practices are referred to as reproducible analytical pipelines (RAP). 15 | 16 | This guidance is relevant if you are: 17 | 18 | - writing code to automate part of your work and would like to assure that it is working as expected 19 | - developing statistical or data engineering pipelines and would like to assure that they are sustainable and reproducible 20 | - developing models and would like to assure that they are transparent and reproducible 21 | - developing data science techniques and would like your code to be useful to others 22 | - looking for a high level introduction to software engineering practices in the context of analysis and research 23 | 24 | The good practices outlined in the book are general to many applications of programming, so may also be relevant for those outside of government. 25 | 26 | The [RAP learning pathway](https://learninghub.ons.gov.uk/course/view.php?id=1273) provides training for many of these good practices. 27 | This book can be used to guide your learning and as a reference when applying these practices in your work. 28 | Each chapter describes the risks that each practice may help to address. 29 | As recommended by [the Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government), 30 | you should strive to apply the most appropriate and proportionate quality assurance practices given the risks associated with your work. 31 | 32 | The principles in this book are computer language agnostic. 33 | The book does not aim to form a comprehensive learning resource and you may often need to study further resources to implement these practices. 34 | That said, we have provided examples and useful references for **Python** and **R**, as these open source languages are commonly applied across government. 35 | 36 | 37 | ## About us 38 | 39 | The Analysis Standards and Pipelines hub supports government analysis by providing guidance, consultancy and training. 40 | 41 | You can find us on: 42 | 43 | - Our websites: [Government Analysis Function](https://www.gov.uk/government/organisations/government-analysis-function) 44 | - Email: [Analysis Standards and Pipelines hub](mailto:ASAP@ons.gov.uk) 45 | - Slack: [Government Data Science](https://govdatascience.slack.com) 46 | - Twitter: [Government Analysis Function](https://twitter.com/gov_analysis) and [UK GSS](https://twitter.com/ukgss) 47 | 48 | 49 | ## Citing the book 50 | 51 | The book uses an Open Government Licence which can be read on the National Archives website: 52 | [National Archives](https://nationalarchives.gov.uk/doc/open-government-licence/version/3/) 53 | 54 | In summary, you are free to use, adapt and distribute the information in this book with citation but not imply endorsement, 55 | and the information here is provided 'as is' without warranty. 56 | 57 | The following structure can be used to reference the current version of the book: 58 | 59 | > UK Government Analytical Community. (2020). Quality assurance of code for analysis and research (version 2025.1). 60 | > Office for National Statistics, Analytical Standards and Pipelines hub: 61 | > [https://best-practice-and-impact.github.io/qa-of-code-guidance/](https://best-practice-and-impact.github.io/qa-of-code-guidance/) 62 | 63 | 64 | ## Accessibility statement 65 | 66 | This accessibility statement applies to the Quality assurance of code for analysis and research guidance. 67 | Please note that this does not include third-party content that is referenced from this guidance. 68 | 69 | The website is managed by the Quality and Improvement division of the Office for National Statistics. 70 | We would like this guidance to be accessible for as many people as possible. 71 | This means that you should be able to: 72 | 73 | - change colours, contrast levels and fonts 74 | - zoom in up to 300% without the text spilling off the screen 75 | - navigate most of the website using just a keyboard 76 | - navigate most of the website using speech recognition software 77 | - listen to most of the website using a screen reader (including the most recent versions of JAWS, NVDA and VoiceOver) 78 | 79 | For keyboard navigation, {kbd}`Up Arrow` and {kbd}`Down Arrow`keys can be used to scroll up and down on the current page. 80 | {kbd}`Left Arrow` and {kbd}`Right Arrow` keys can be used to move forwards and backwards through the pages of the book. 81 | Tabbed content (including code example) can be focused using the {kbd}`Tab` key. 82 | {kbd}`Left Arrow` and {kbd}`Right Arrow` keys are then used to focus the required tab option, 83 | where {kbd}`Enter` can be used to select that option and display the associated content. 84 | 85 | 86 | ### Help us improve this book 87 | 88 | We’re always looking to improve the accessibility of our guidance. 89 | If you find any problems not listed on this page or think that we’re not meeting accessibility requirements, please contact us by emailing [ASAP@ons.gov.uk](mailto:ASAP@ons.gov.uk). 90 | Please also get in touch if you are unable to access any part of this guidance, or require the content in a different format. 91 | 92 | We will consider your request and aim to get back to you within five working days. 93 | 94 | 95 | ### Enforcement procedure 96 | 97 | The Equality and Human Rights Commission (EHRC) is responsible for enforcing the 98 | [Public Sector Bodies (Websites and Mobile Applications) (No. 2) Accessibility Regulations 2018 (the ‘accessibility regulations’)](https://www.legislation.gov.uk/uksi/2018/952/made). 99 | If you’re not happy with how we respond to your complaint, you should [contact the Equality Advisory and Support Service (EASS)](https://www.equalityadvisoryservice.com/). 100 | -------------------------------------------------------------------------------- /book/learning.md: -------------------------------------------------------------------------------- 1 | # Learning resources 2 | 3 | This section links to training and self-led learning resources, which relate to sections of the guidance. 4 | 5 | Many of these learning resources point to the UK Statistics Authority Learning Hub. 6 | All government analysts can request an account for the hub via a [form](https://forms.office.com/pages/responsepage.aspx?id=vweIB4LOiEa84A2BFoTcRlPdx8djn4tCqrCyavGcBe9UNTY3RUpEQkkwVDhITDAxUk8wTUhBQ01MNS4u&route=shorturl&wdLOR=c0BA596B0-B641-4BF9-BCDB-8C188FCB872C). 7 | 8 | Please note that learning resources from non-government training providers (i.e. not accessed through the Hub) may not always follow best practice. 9 | However, exposure to a range of applied examples will still benefit your learning. 10 | You should compare and contrast your learning to the good practices outlined in the guidance. 11 | 12 | 13 | ## Core programming 14 | 15 | The [RAP learning pathway](https://learninghub.ons.gov.uk/course/view.php?id=1273) covers training for most of the good practices outlined in this book, 16 | with a focus on Python and R. 17 | Other courses below can be used to supplement this learning. 18 | 19 | 20 | ### Python 21 | 22 | * [Introduction to object-oriented programming (OOP) in Python](https://learninghub.ons.gov.uk/enrol/index.php?id=1199) 23 | * [The Official Python Getting Started Guide](https://www.python.org/about/gettingstarted/) 24 | * [Learn Python](https://www.learnpython.org/), supported by DataCamp 25 | 26 | 27 | ### R 28 | 29 | * [R Basics on Udemy](https://www.udemy.com/course/r-basics/) 30 | 31 | 32 | ### General 33 | 34 | * [Software carpentry](https://software-carpentry.org/lessons/) 35 | * [Data carpentry](https://datacarpentry.org/lessons/) 36 | 37 | 38 | ## Data manipulation 39 | 40 | ### Python 41 | 42 | * [Dataframes, manipulation and cleaning in Python](https://learninghub.ons.gov.uk/enrol/index.php?id=1156) 43 | * [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) 44 | * [Python for Data Analysis](https://github.com/wesm/pydata-book) 45 | 46 | 47 | ### R 48 | 49 | * [Dataframes, manipulation and cleaning in R](https://learninghub.ons.gov.uk/enrol/index.php?id=1158) 50 | * [R for Data Science](https://r4ds.had.co.nz/) 51 | * [Advanced R](https://adv-r.hadley.nz/index.html) 52 | 53 | 54 | ### SQL 55 | 56 | * [Foundations of SQL](https://learninghub.ons.gov.uk/enrol/index.php?id=1162) 57 | * [w3schools.com SQL Tutorial](https://www.w3schools.com/sql/default.asp) 58 | * [Python SQL libraries](https://realpython.com/python-sql-libraries/) 59 | * [SQL databases and R](http://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html) 60 | 61 | 62 | ## Command line tools 63 | 64 | ### General 65 | 66 | * [Windows and UNIX Command Line Basics](https://learninghub.ons.gov.uk/enrol/index.php?id=1166) 67 | * [Learn enough (UNIX) command line to be dangerous](https://www.learnenough.com/command-line-tutorial/basics) 68 | * [The UNIX workbench](https://seankross.com/the-unix-workbench/) 69 | 70 | 71 | (git-learning)= 72 | ### Git 73 | 74 | * [Intro to Git](https://learninghub.ons.gov.uk/enrol/index.php?id=1165) 75 | * [Introduction to continous integration](https://learninghub.ons.gov.uk/course/view.php?id=1200) 76 | * [The Pro Git book](https://git-scm.com/book/en/v2) 77 | * [Software Carpentry: Version Control with Git](https://swcarpentry.github.io/git-novice/) - an applied project 78 | * Interactive online training with [Katacoda](https://www.katacoda.com/courses/git) or [Learn Git Branching](https://learngitbranching.js.org/) 79 | * [GitHub's Git Cheatsheet](https://education.github.com/git-cheat-sheet-education.pdf) 80 | * [GitHub Git Handbook](https://guides.github.com/introduction/git-handbook/) 81 | * [Atlassian's Learn Git](https://www.atlassian.com/git) 82 | 83 | 84 | ## Testing 85 | 86 | * [Introduction to Unit Testing in Python and R](https://learninghub.ons.gov.uk/course/view.php?id=1171) 87 | 88 | 89 | ## Making software accessible 90 | 91 | * [A11Y Project software, books and tools](https://www.a11yproject.com/resources/) 92 | 93 | 94 | ## Reproducible Analytical Pipelines 95 | 96 | * [Analysis Function RAP overview](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/) 97 | * [The cross government RAP Champion Network](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/reproducible-analytical-pipeline-rap-champions/) 98 | * [ONS Data Science Campus RAP learning materials](https://github.com/datasciencecampus/gov-uk-rap-materials) 99 | 100 | 101 | ## Learning from mistakes 102 | 103 | Finally, it is often good to see what not to do and learn from the mistakes of others. 104 | The following is a compilation of bad code in Python, but worth a read whatever your language as most apply to R and Python, as well as other languages. 105 | 106 | * [The Little Book of Python Anti-Patterns](https://docs.quantifiedcode.com/python-anti-patterns/index.html) 107 | -------------------------------------------------------------------------------- /book/managers_guide.md: -------------------------------------------------------------------------------- 1 | # Managing analytical code development 2 | 3 | ```{note} 4 | This section is a draft, while we make sure that it meets user needs. 5 | 6 | It would benefit from case studies that demonstrate how to decide what level of quality assurance a piece of analysis requires. 7 | 8 | Please get in touch with feedback or case studies to support the guidance 9 | [by creating a GitHub Issue](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues) 10 | or emailing us at [emailing us](mailto:ASAP@ons.gov.uk). 11 | ``` 12 | 13 | This section of the guidance targets those who manage data analysis, science and engineering work in government 14 | or those acting as product owners for analytical products. 15 | 16 | It aims to help you support your team to apply the good quality assurance practices described in the wider 17 | [Quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html). 18 | We refer to processes that apply these good practices as a reproducible analytical pipelines (RAP). 19 | 20 | Before applying this guidance, you should have a basic awareness of the tools and 21 | techniques used to do quality analysis as code - the [introduction to RAP course](https://learninghub.ons.gov.uk/course/view.php?id=1236) outlines these. 22 | You should also be aware of any specific tools and platforms that are used in your department. 23 | 24 | [The Government Service Standard](https://www.gov.uk/service-manual/service-standard) outlines best practices for creating public services, which include analysis. 25 | You should use this when designing and managing the development of analysis as code. 26 | 27 | 28 | ## Apply quality assurance proportional to risk 29 | 30 | As the [the Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) notes, 31 | the quality assurance we apply to our analysis should be proportionate to the complexity and risk that the analysis carries. 32 | 33 | When managing analytical work, you should not need an in-depth understanding of the analysis code to trust that it is working correctly. 34 | However, you should be confident that the team who built the analysis process has taken an appropriate approach given the user need, 35 | and that proportionate quality assurance is being applied to the development and running of the analysis. 36 | 37 | You should work with your team to decide which quality assurance practices are necessary given each piece of analysis. 38 | You might find our [](checklists.md) useful templates for defining the target level of assurance. 39 | When possible, you should define the target assurance level before starting the analysis. 40 | 41 | 42 | ```{important} 43 | 44 | While quality assurance must be applied relative to the risk and complexity of the analysis, you must consider the skills of your team. 45 | It will take time to learn to apply the necessary good practices, so you should support their gradual development of these skills. 46 | 47 | [The RAP learning pathway](https://learninghub.ons.gov.uk/mod/page/view.php?id=8699) provides training in good practices. 48 | You can use the wider [Quality assurance of code for analysis and research guidance](https://best-practice-and-impact.github.io/qa-of-code-guidance/intro.html) 49 | as a reference to apply these practices to analysis. 50 | You should identify where each analyst is along the pathway - they should look to develop the next skill in the pathway and apply this, 51 | rather than attempting to adopt them all at once. 52 | 53 | Note that it is important to maintain technical skills in the analysis team for sustainability, to ensure that the analysis can be understood, updated, and maintained. 54 | ``` 55 | 56 | Despite the initial cost of developing technical skills, 57 | [evidence shows that applying good practices increases the efficiency of code development and maintainability of the code](https://www.devops-research.com/research.html). 58 | There are number of case studies that describe how 59 | [good quality assurance practices have improved government analysis](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/rap-case-studies/). 60 | 61 | Not following good practices creates [technical debt](https://en.wikipedia.org/wiki/Technical_debt), which slows down further development of the analysis. 62 | This can be necessary for delivering to short deadlines, but you should set aside time to address this for continued development of the analysis. 63 | 64 | Where quality assurance of the code doesn't meet your target level of assurance, 65 | for example where there is limited time or skill, 66 | then you should supplement this with more in-depth assurance of outputs. 67 | This might include dual running the analysis with an independent system and consistency checks across the output data. 68 | 69 | The remaining parts of this section provide questions that aim to help you assess the quality assurance practices that your team are applying in their analysis. 70 | 71 | 72 | ## Design quality analysis 73 | 74 | These questions aim to help you assess the design decisions at the beginning of the analysis. 75 | 76 | 77 | ### Why have you chosen to do the analysis this way? 78 | 79 | Understanding user needs ensures that the analysis is valuable. 80 | 81 | * You should have a plan to consult users at the beginning and throughout the development process, to ensure that you meet their needs. 82 | * The methodology and data should be suitable for the question being asked. 83 | * More than one person should develop the analysis to allow pair programming, peer review, and mentoring. This increases the sustainability of analysis. 84 | * You should use open-source analysis tools to carry out the analysis, wherever possible. 85 | Your team should be able to explain why they have chosen the analysis tools and why they are confident that they are fit for purpose. 86 | 87 | 88 | ### How are you storing input data? 89 | 90 | Versioning input data ensures that we can reproduce our analysis. 91 | 92 | * You should version input data so that you can reproduce analysis outputs. 93 | * You should store data in an open format (e.g. CSV or ODS), not formats that are restricted to proprietary software like SAS and Stata. 94 | * You should store large or complex datasets in a database. 95 | * You should monitor the quality of data, following [the government data quality framework](https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework). 96 | 97 | 98 | ### How will you keep track of changes to the code and why they were made? 99 | 100 | [Version control](version_control.md) of changes provides an audit trail. 101 | 102 | * You should version control the code, documentation, and peer reviews. Git software is most commonly used for this. 103 | * You should develop the code on an open source code platform, like GitHub. This transparency increases your users trust in the analysis and promotes collaboration and re-use. 104 | * You should have a clear record of every change and who made it. 105 | * Each change should be linked to a reason, for example, a new requirement or an issue in the existing code. 106 | * You should store reviews of changes with the version of the analysis that was reviewed. 107 | 108 | 109 | ## Quality assure throughout development 110 | 111 | These questions can be used throughout the development of the analysis, 112 | to assess how the team are progressing towards the target quality assurance practices that you have agreed. 113 | 114 | 115 | ### How is your code structured? 116 | 117 | [Modular code](modular_code.md) makes it easier to understand, update and reuse the code. 118 | 119 | * You should write logic as functions, so that it can be reused consistently and tested. 120 | * You should group related functions together in the same file, so that it is easy to find them. 121 | * You should clearly separate logic with different responsibilities (e.g., reading data versus transforming data). 122 | * When code can be reused for other analyses, you should store and share this as a package. 123 | 124 | 125 | ### How easy is it to adjust the way that the analysis runs? 126 | 127 | [Configuration files](configuration.md) allow you to change the way the code runs without editing the code itself. 128 | 129 | * You should store parameters for the code that often change between runs in separate configuration files. 130 | * Things that often change in analysis code include input and output file paths, reference dates and model parameters. 131 | 132 | 133 | ### How is the analysis documented? 134 | 135 | [Code documentation](code_documentation.md) is essential for business continuity and sustainability. 136 | 137 | * You should document every function in the code, so it is clear what each function is supposed to do. 138 | * Function documentation should include what goes in and what comes out of each function. 139 | * Where others will run or re-used the code, documentation should include usage examples and test data. 140 | 141 | 142 | ### What are the dependencies? 143 | 144 | [Project documentation](project_documentation.md) ensures that others can reproduce our analysis. 145 | 146 | * You should provide user instructions for running the analysis. 147 | * You should document software and package versions with the code. 148 | Typically, you record package versions using `setup.py` and `requirements.txt` files (Python) or a `DESCRIPTION` file (R). 149 | * Code should not be dependent on a specific computer to run. Running it on a colleague's system can help to check this. 150 | * When you run the same analysis multiple times or on different systems, it should give reproducible outcomes. 151 | * Container systems, like Docker, help to create reproducible environments to run code. 152 | 153 | 154 | ### What assumptions does the analysis make? 155 | 156 | Transparency of our analysis increases trust. 157 | 158 | * You should record assumptions and caveats of the analysis close to the code. 159 | * You must communicate these to users when releasing results from the analysis. 160 | 161 | 162 | ### How has peer review been done? 163 | 164 | [Peer review](peer_review.md) increases the quality of our analysis and transfers skills and knowledge. 165 | 166 | * Technical colleagues should conduct internal peer reviews of each change to the code. 167 | This will identify issues early on and makes the review process more manageable than reviewing only the final analysis. 168 | * Peer review should follow a standard procedure, so that reviews are consistent. Reviewers should check that each change follows the agreed good practices. 169 | * You should evidence that peer reviews are acted on and how. When issues or concerns are raised, they should be addressed before the analysis is used. 170 | * If the product is high risk, you should arrange and conduct external peer review. 171 | 172 | 173 | ### How have you tested the code? 174 | 175 | [Testing](testing_code.md) assures us that the code is working correctly. 176 | 177 | * Testing means making sure that the code produces the right outputs for realistic example input data. 178 | * You should apply automated 'unit' testing to each function to ensure that code continues to work after future changes. 179 | * You should test each function and the end-to-end analysis using minimal, realistic data. 180 | * Your testing should be more extensive on the most important or complex parts of the code. However, ideally you would test every single function. 181 | * Tests should account for realistic cases, which might include missing data, zeroes, infinities, negative numbers, wrong data types, and invalid inputs. 182 | * Reviewers should sign-off that there is enough testing to assure that the code is working as expected. 183 | * Each time you identify an error in the code,you should add a test to assure that the error does not reoccur. 184 | 185 | 186 | ### What happens when the analysis fails to run? 187 | 188 | Recording and reporting errors provides an audit trail and makes it easier for users to correctly use the code. 189 | 190 | * When code fails to run, it should provide informative error messages to describe the problem. 191 | * If another team will operate the code, error messages should help users to correct the problem. 192 | * Errors or warnings might also be raised when data validation is not met. 193 | * When code is run in production, errors should be recorded or logged. 194 | 195 | 196 | ### How does the analysis result differ from the previous analysis? 197 | 198 | Comparing analysis to previous results can help to ensure reproducibility and identify errors. 199 | 200 | * When repeating analysis over time, you should compare results between analyses. 201 | Large differences in the outcome of the results may indicate an issue with the analysis process. 202 | * When developing code to replace legacy analysis, you may wish to parallel this with the new method. This will allow you to identify differences and improvements. 203 | 204 | 205 | ### How can we further improve the quality of the analysis? 206 | 207 | Code quality improves over time, as your team learn more about good practices. 208 | 209 | * The team should be aiming to meet the agreed assurance level, but should also consider which practices could be applied next to improve the code beyond this. 210 | * You should review training needs in your team and allow time for continuous personal development of these practices. 211 | -------------------------------------------------------------------------------- /book/peer_review.md: -------------------------------------------------------------------------------- 1 | # Peer review 2 | 3 | Peer review of code is a quality assurance activity where a developer, other than the code's author, views and tests the usage of a piece of code. 4 | 5 | Peer review allows for a fresh pair of eyes to take a look at your work. 6 | It helps to assure that you have taken an appropriate approach to your analysis and may highlight errors in the analysis process. 7 | Constructive review feedback helps you to improve your code quality and provides confidence in your work. 8 | It acts to assure that your analysis is fit for purpose. 9 | 10 | ```{epigraph} 11 | For analysis to be used to inform a decision it must be possible to assess its utility, reliability, 12 | and the degree of validation and verification to which it has been subjected. 13 | 14 | -- [The Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) 15 | ``` 16 | 17 | [The Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) 18 | tells us that we should proportionately quality assurance our analysis depending on the complexity and business risk of the analysis. 19 | This means that you may require both internal and external peer review to adequately assure your analysis. 20 | We recommend External review if your analysis uses novel or complex techniques, 21 | if comparison with other analyses cannot be used to challenge your results, or if the analysis is business critical. 22 | 23 | ```{epigraph} 24 | Continuous challenge and improvement is essential to ensure that the people we serve – ministers and, of course, the public – have trust in our analysis. 25 | 26 | -- Nick Macphereson, former Permanent Secretary to the Treasury 27 | ``` 28 | 29 | 30 | ## Focus reviews on code quality 31 | 32 | Our [](checklists.md) provide an extensive list of good practices that reviewers can look for. 33 | Many of these criteria consider the whole project. 34 | You should tailor your review to the quality assurance criteria that the project is trying to meet and the scale of the review. 35 | 36 | Reviewing is centered around conversations - asking yourself and the reviewer questions. 37 | While reviewing changes to analytical code, example questions might be: 38 | 39 | 40 | ### Can I easily understand what the code does? 41 | 42 | In more depth: 43 | 44 | * Is the code sufficiently documented for me to understand it? 45 | * Is there duplication in the code that could be simplified by refactoring into functions and classes? 46 | * Are functions and class methods simple, using few parameters? 47 | 48 | As we discussed in [](readable_code.md), good quality, modular code is easier to read, understand, and maintain. 49 | Peer review improves the quality of our code through the reviewer's constructive challenges. 50 | You might do this as a reviewer by suggesting alternative ways to represent the analysis or 51 | by asking about decisions that have been made in the approach to the analysis. 52 | 53 | Another benefit, particularly of internal review, is knowledge transfer. 54 | Both the reviewer and reviewee are exposed to new ideas. 55 | The reviewer must understand what the code is doing to validate that it works as expected. 56 | This may provide your team members with the understanding required to use and maintain your code in the future. 57 | 58 | 59 | ### Is the required functionality tested sufficiently? 60 | 61 | If we do not test each part of the code, then we can't be sure that it works as expected. 62 | As a reviewer, you should ask whether the amount of testing is proportionate given the risk to the analysis if the code does not work. 63 | 64 | 65 | ### How easy will it be to alter this code when requirements change? 66 | 67 | In more depth: 68 | 69 | * Are high level parameters kept in dedicated configuration files? 70 | * Or would somebody need to work their way through the code with lots of manual edits to reconfigure for a new run? This is much more risky. 71 | 72 | Most analysis stems from some form of customer engagement. 73 | Throughout design, implementation and review of analysis we must continue to assess whether our analysis is fit for purpose: 74 | Does it meet the needs of the customer? 75 | Document the scope of your analysis and any requirements to make this assessment as easy as possible. 76 | Support the auditability of your analysis with additional documentation, including assumption logs, 77 | technical reports describing the analysis and documentation on any verification or validation that has already been carried out. 78 | 79 | 80 | ### Can I generate the same outputs that the analysis claims to produce? 81 | 82 | In more depth: 83 | 84 | * Have dependencies been sufficiently documented? 85 | * Is the code version, input data version and configuration recorded? 86 | 87 | A key aspect of review is checking that you get the same results when running your code with the same input data. 88 | Assurance that analysis can be reproduced increases the trust in the results. 89 | 90 | Note that each of these example questions focuses on important quality assurance practices in the code, rather than minor issues like code layout and style. 91 | 92 | 93 | ## Give practical feedback 94 | 95 | Provide practical and constructive feedback. 96 | For example, you should suggest an improvement or alternative that the developer may consider and learn from. 97 | You should avoid making feedback personal, although it may be necessary to highlight specific examples. 98 | 99 | The CEDAR feedback model can be a useful framework for structuring review comments. 100 | This model breaks review down into five sections: 101 | 102 | 1. Context - describe the issue and the potential impact. 103 | 2. Examples - give specific examples of when and where the issue has been present. 104 | 3. Diagnosis - use the example to discuss why this approach was taken, what could have been done differently and why the alternatives could be an improvement. 105 | 4. Actions - ask the person receiving feedback to suggest actions that they could follow to avoid this issue in future. 106 | 5. Review - revisit the discussion to look for progress following on the feedback. 107 | 108 | This approach has been designed from a coaching or mentoring perspective, and can work well in verbal discussion or when giving written feedback. 109 | 110 | 111 | ## Document review feedback and outcomes 112 | 113 | When you identify issues with code functionality or quality you should suggest improvements to solve these issues. 114 | The code developer may respond to your comments to justify why they have taken a certain approach, otherwise they should act on your feedback before using the analysis. 115 | 116 | You may need to re-review changes that have resulted from your initial review, to confirm that your feedback has been actioned appropriately. 117 | 118 | ```{important} 119 | Document what you have considered in your review and how the developer has responded to your feedback. 120 | You should keep this documentation close to the relevant version of the code, so that others can see what has been reviewed. 121 | The easiest way to manage this is by using the Merge Request or Pull Request feature on your version control platform to conduct the review. 122 | See [](version_control.md) for more information. 123 | ``` 124 | 125 | 126 | ## Give consistent feedback 127 | 128 | You might formalise your review using a template, to ensure that review is consistent within a project. 129 | Templates are useful for setting criteria to review against. 130 | You should tailor any template to reflect the scope of your review. 131 | For example, small regular reviews may focus on smaller aspects of the analysis compared to a large project-wide review. 132 | The general example below is written in Markdown, so that it can be used in version control platform Merge or Pull requests: 133 | 134 | 135 | ```{code-block} md 136 | ## Code review 137 | 138 | #### Documentation 139 | 140 | Any new code includes all the following forms of documentation: 141 | 142 | - [ ] **Function Documentation** as docstrings within the function definition. 143 | - [ ] **Examples** demonstrating major functionality, which runs successfully locally. 144 | 145 | #### Functionality 146 | 147 | - [ ] **Installation**: Installation or build of the code succeeds. 148 | - [ ] **Functionality**: Any functional claims of new code have been confirmed. 149 | - [ ] **Automated tests**: Unit tests cover essential functions for a reasonable range 150 | of inputs and conditions. All tests pass on your local machine. 151 | - [ ] **Packaging guidelines**: New code conforms to the project contribution 152 | guidelines. 153 | --- 154 | 155 | ### Review comments 156 | 157 | *Insert detailed comments here!* 158 | 159 | These might include, but not exclusively: 160 | 161 | - bugs that need fixing (does it work as expected? and does it work with other code that it is likely to interact with?) 162 | - alternative methods (could it be written more efficiently or with more clarity?) 163 | - documentation improvements (does the documentation reflect how the code actually works?) 164 | - additional tests that should be implemented (do the tests effectively assure that it works correctly?) 165 | - code style improvements (could the code be written more clearly?) 166 | 167 | Tailor your suggestions to the code that you are reviewing. 168 | Be critical and clear, but not mean. Ask questions and set actions. 169 | 170 | ``` 171 | 172 | Carry out internal review regularly within the development team. 173 | Reviewing code written by those with more and less experience than you is beneficial for both reviewer and developer. 174 | You can ask similar questions from both perspectives, for the reviewer to get a good understanding of the approach and decisions behind the analysis. 175 | 176 | 177 | ## Give timely feedback 178 | 179 | Provide feedback in good time so the review process does not hold up development of the code 180 | and that issues can be addressed before more code is written using the same practices. 181 | 182 | We strongly recommend applying pair programming for code review, as the most timely and practical method. 183 | However, you may find a separate review process necessary when multiple developers are not available to work at the same time. 184 | 185 | 186 | ### Review through pair programming 187 | 188 | > Two heads are better than one. 189 | 190 | Review doesn't have to be an arduous standalone task. 191 | Pair programming combines the code writing and the review process into a single step. 192 | Here, two or three developers work together to write a single piece of code. 193 | Each developer takes turns to actively author parts of the code, while others provide real-time feedback on the code being written. 194 | 195 | This practice encourages developers to consider why they are writing code in a particular way and to vocalise this ("programming out loud"). 196 | Additionally, it gives reviewers a chance to suggest improvements and question the author's approach as they write the code. 197 | Working in this way can be more efficient than reviewing code separately - you identify issues sooner, so they are easier to fix. 198 | Despite the upfront cost of two individuals writing the code, the resulting code is often higher quality and contains fewer bugs. 199 | 200 | The rotational aspect of pair programming ensures that all team members gain experience from both the author and review perspective. 201 | From both angles, you'll learn new programming and communication techniques. 202 | Additionally, sharing knowledge of how the code works across the team prevents putting too much risk on individuals. 203 | 204 | Developers working in pairs can approve changes to code as it is written. However, you should still document 205 | key discussions from pair programming sessions to demonstrate which aspects of the code have been reviewed and discussed. 206 | 207 | This blog post from the Government Digital Service provides 208 | [more detailed steps to apply pair programming](https://gds.blog.gov.uk/2018/02/06/how-to-pair-program-effectively-in-6-steps/). 209 | 210 | 211 | ### Review separately when necessary 212 | 213 | A separate review involves sharing your code with a reviewer, and receiving written feedback. 214 | We describe this as separate, because code development and review are separate and the code author does not need to be present for the review. 215 | This type of review is an iterative process, where the reviewer may make additional suggestions until they are satisfied with the code changes. 216 | 217 | This form of review works best when changes to the code are small and frequent. 218 | Requesting review of small but regular changes reduces the burden on reviewers, relative to large review of a complete project. 219 | Similarly to pair programming, reviewing small changes to code allows you to catch issues sooner. 220 | 221 | ```{important} 222 | If a project is only reviewed when all of the code has been written, this significantly reduces the benefit of review. 223 | 224 | This creates a much larger burden on the reviewer. 225 | Additionally, fixing any issues that are identified may take a lot of time to fix. 226 | A reviewer might highlight that certain quality assurance practices have not been used - 227 | for example, there has not been enough documentation or automated testing in the project. 228 | Adding documentation and testing for the whole project would take substantial effort. 229 | If this was identified earlier, the improved practices could be applied as the remaining code is developed. 230 | ``` 231 | 232 | When you must carry out a review of larger or complete pieces of work, consider reviewing different aspects of the code in separate sessions. 233 | For example, reviewing documentation in one session and functionality in the next. 234 | 235 | The thought of someone else reviewing your code in this way encourages good practices from the outset: 236 | 237 | * Clear code and documentation - so others with no experience can use and test your code. 238 | * Usable dependency management - so others can run your code in their own environment. 239 | 240 | Separate review is aided by version control platforms' features. See [](version_control.md) for more information. 241 | 242 | 243 | #### Case study - rOpenSci review 244 | 245 | Here we discuss a [review example from rOpenSci](https://ropensci.org/); 246 | a community led initiative that curates open source, statistical R packages. 247 | rOpenSci apply a rigorous peer review process to assure the quality of packages before including them in their collection. 248 | This peer review process is entirely remote and is performed in the open, via GitHub pull requests. 249 | 250 | In this example, from colleagues at Public Health England, [the `fingertipsR` package is reviewed](https://github.com/ropensci/software-review/issues/168). 251 | The initial comment describes the package that is being submitted and includes a check against a list of minimum requirements. 252 | The [`goodpractice` R package](http://mangothecat.github.io/goodpractice/) is used to check that good R packaging practices have been followed. 253 | [Continuous integration](https://www.atlassian.com/continuous-delivery/continuous-integration#:~:text=Continuous%20integration%20(CI)%20is%20the,builds%20and%20tests%20then%20run.) 254 | is commonly used to carry out automated checks on code repositories. 255 | The reports from these checks can save reviewers time, by providing indicators of things like code complexity and test coverage. 256 | 257 | Two external reviews conduct reviews before the package is accepted - these reviews include checking common aspects of code packages, 258 | like documentation, examples, and automated testing. 259 | Perhaps the most informative part of these reviews is the detailed bespoke comments. 260 | Here the reviewers highlight problems, ask questions to clarify aspects of the package design, and suggest improvements to the implementation of the code 261 | (with examples). 262 | 263 | Following the reviews, the authors wrote comments describing how the reviewers requested changes were addressed. 264 | And finally, a sign off confirmed that the reviewers are satisfied with the package. 265 | 266 | Although this review looked at an entire, mature package, you can apply parts of this review process to smaller pieces of code as required. 267 | -------------------------------------------------------------------------------- /book/principles.md: -------------------------------------------------------------------------------- 1 | # Principles 2 | 3 | When we do analysis, it must be fit for purpose. 4 | If it isn't, we risk misinforming decisions. 5 | Bad analysis can result in harm or misallocation of public funds. 6 | So we must take the right steps to ensure high quality analysis. 7 | 8 | This book recognises three founding principles of good analysis, each supported by the one before it. 9 | Programming in analysis makes each of these principles easier to fulfil in most cases. 10 | 11 | ```{figure} ./_static/repro_stack.png 12 | --- 13 | width: 50% 14 | name: repro_stack 15 | alt: Founding principles of good analysis. 16 | --- 17 | Founding principles of good analysis 18 | ``` 19 | 20 | Reproducibility guarantees that we have done what we are claiming to have done, and that others can easily replicate our work. 21 | Auditability means that we know why we chose our analysis, and who is responsible for each part of it - including assurance. 22 | Assurance improves the average quality and includes the communication of that quality to users. 23 | 24 | ```{admonition} Key strategies 25 | :class: admonition-strategies 26 | 27 | Government guidance is available to help you when develop analysis. 28 | We recommend: 29 | 30 | 1. The [Analysis Function reproducible analytical pipelines (RAP) strategy](https://analysisfunction.civilservice.gov.uk/policy-store/reproducible-analytical-pipelines-strategy/), 31 | when considering how your department should adopt good practices. 32 | 2. The [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government) 33 | to understand guidance around assurance. 34 | 3. The [Analysis Function Quality Strategy](https://analysisfunction.civilservice.gov.uk/policy-store/government-statistical-service-gss-quality-strategy/) 35 | when thinking about the quality of statistics. 36 | 4. The [DCMS Data Ethics Framework](https://www.gov.uk/government/publications/data-ethics-framework/data-ethics-framework) 37 | when approaching any analysis. 38 | 5. The [Communicating quality, uncertainty and change guidance](https://analysisfunction.civilservice.gov.uk/policy-store/communicating-quality-uncertainty-and-change/) 39 | whenever you must develop user-facing products. 40 | 41 | Each of these pieces of guidance advocate reproducibility as a core tenet of quality analysis. 42 | ``` 43 | 44 | 45 | ## Reproducible 46 | 47 | Reproducibility is the only thing that you can guarantee in your analysis. 48 | It is the first pillar of good analysis. 49 | If you can't prove that you can run the same analysis, with the same data, and obtain the same results, then you are not adding a valuable analysis. 50 | The additional assurances of peer review, rigorous testing, 51 | and validity are secondary to being able to reproduce any analysis that you carry out in a proportionate amount of time. 52 | 53 | A reproducible and transparent production process ensures that anyone can follow your steps and understand your results. 54 | This transparency eases reuse of your methods and results. 55 | Easy reproducibility helps your colleagues test and validate what you have done. 56 | Guaranteeing reproducibility means that users and colleagues can focus on verifying that the implementation is correct and that the research is useful for its intended purpose. 57 | 58 | Reproducibility relies on effective documentation. 59 | Good documentation should show how your methodology and your implementation map to each other. 60 | Good documentation should allow users and other researchers to reuse and adapt your analysis. 61 | 62 | Reproducible analysis supports the requirements of the [Code of Practice for Statistics](https://www.statisticsauthority.gov.uk/code-of-practice/) 63 | around quality assurance and transparency (auditability). 64 | We share the code we used to produce our outputs wherever possible, along with enough data to allow for proper testing. 65 | 66 | 67 | ## Auditable 68 | 69 | If decisions are made, based on your analysis, then you must make sure that your analysis and the evidence that you provide are available for scrutiny and audit. 70 | Auditable analysis is about being able to, at any point, answer: 71 | 72 | * Who made each decision? 73 | * When was this decision made? 74 | * What evidence was this decision based on? 75 | 76 | Answering these questions gives decision makers and users greater trust in your work. 77 | They know the story of your analysis and why you made certain analytical choices. 78 | They know who is responsible for each part of the analysis, including the assurance. 79 | They know exactly what changes have been made at any point. 80 | 81 | In a reproducible workflow, you must bring together the code and the data that you used to generate your results. 82 | Ideally, you would publish these alongside your reports, with a record of analytical choices made and the responsible owners of those choices. 83 | This increases the trustworthiness of your work. 84 | More eyes examining your work can point out challenges or flaws that can help you to improve. 85 | You can be fully open about the decisions you made when you generated your outputs, so that other analysts can follow what you did and re-create them. 86 | By making your analysis reproducible, you make it easier for others to quality assure, assess and critique. 87 | 88 | 89 | ## Assured 90 | 91 | Quality Assurance (QA) is vital for good quality analysis. 92 | If decisions are being made based on analysis then this analysis must be held to high standards. 93 | This is true for analysis carried out using any medium, including code. 94 | 95 | However, some of the analysis that we do in government doesn't bear on decisions at that level. 96 | We don't want to overburden analysts with QA procedures when they are not required. 97 | In government we advocate **proportionality** - the right quality assurance procedures for the right analysis. 98 | You can proportionately assure analysis through peer review and defensive design. 99 | 100 | We suggest following your department's guidance on what proportionate quality assurance looks like. 101 | Most departments derive their guidance from the [Aqua book](https://www.gov.uk/government/publications/the-aqua-book-guidance-on-producing-quality-analysis-for-government). 102 | Assurance is best demonstrated through peer review. 103 | Peer reviewers must be able to understand your analytical choices and be able to reproduce your conclusions. 104 | Consider dual running, particularly for high risk analysis. 105 | 106 | Publish guarantees of quality assurance alongside any report, to be taken into consideration by decision makers. 107 | 108 | 109 | ## Reproducible analytical pipelines 110 | 111 | Producing analysis, such as official statistics, can be time-consuming and painstaking. 112 | We need to make sure that our outputs are both accurate and timely. 113 | We aim to develop effective and efficient analytical workflows that are repeatable and sustainable over time. 114 | These workflows should follow the principles of reproducible analysis. 115 | We call these [Reproducible Analytical Pipelines (RAP)](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/). 116 | 117 | Reproducible analysis is still not widely practised across government. 118 | Many analysts use proprietary (paid-for) analytical tools like SAS or SPSS in combination with programs like Excel, Word, or Acrobat to create statistical products. 119 | The processes for creating statistics in this way are usually manual or semi-manual. 120 | Colleagues then typically repeat parts of the process manually to quality assure the outputs. 121 | 122 | This way of working is time consuming and can be frustrating, especially where the manual steps are difficult to replicate quickly. 123 | Processes like this are also prone to error, because the input data and the outputs are not connected directly, only through the analyst’s manual intervention. 124 | 125 | More recently, the tools and techniques available to analysts have evolved. 126 | Open-source tools like [Python](https://www.python.org/) and [R](https://www.r-project.org/) have become available. 127 | Coupled with version control and software management platforms like [Git](https://git-scm.com/) and Git-services, 128 | these tools have made it possible to develop automatic, streamlined processes, accompanied by a full audit trail. 129 | 130 | RAP was [first piloted in the Government Statistical Service in 2017](https://dataingovernment.blog.gov.uk/2017/03/27/reproducible-analytical-pipeline/) 131 | by analysts in the Department for Digital, Culture, Media and Sport (DCMS) and the Department for Education (DfE). 132 | They collaborated with data scientists from the Government Digital Service (GDS) to automate the production of statistical bulletins. 133 | 134 | To support the adoption of RAP across government, there is a network of [RAP champions](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/reproducible-analytical-pipeline-rap-champions/). 135 | RAP champions are responsible for promoting reproducible analysis through the use of reproducible analytical pipelines, 136 | and supporting others who want to develop RAP in their own departments. 137 | -------------------------------------------------------------------------------- /book/project_documentation.md: -------------------------------------------------------------------------------- 1 | # Project documentation 2 | 3 | Documenting your project will makes it much easier for others to understand your goal and ways of working, whether you're developing a package or collaborating on a piece of analysis. 4 | 5 | 6 | ## README 7 | 8 | When working on a collaborative or open coding project, it's good practice to describe an overview of your project in a README file. 9 | This allows users or developers to grasp the overall goal of your project. 10 | As well as a description of the project, it might include examples using your code or references to other associated projects. 11 | This file can be any text type, including `.txt`, `.md`, and `.rst`, and can be associated with your automated documentation. 12 | 13 | We suggest the following for a good README: 14 | 15 | - Short statement of intent. 16 | - Longer description describing the problem that your project solves and how it solves it. 17 | - Basic installation instructions or link to installation guide. 18 | - Example usage. 19 | - Screenshot if your project has a graphical user interface. 20 | - Links to other project guidance, for example methodological papers or document repositories. 21 | - Links to related projects. 22 | 23 | 24 | ## Contributing guidance 25 | 26 | When collaborating, it is also useful to set out the standards used within your project. 27 | This might include particular packages required for certain tasks and guidance on the [code style](code-style) used in the project. 28 | Consider including a code of conduct if you plan to have contributors from outside your organisation. 29 | Please [see GitHub](https://docs.github.com/en/github/building-a-strong-community/adding-a-code-of-conduct-to-your-project) for advice on creating a code of conduct. 30 | 31 | See the CONTRIBUTING file from our [gptables package](https://github.com/best-practice-and-impact/gptables/blob/master/CONTRIBUTING.md) for an example: 32 | 33 | `````{tabs} 34 | 35 | ````{tab} Markdown 36 | ```{code-block} 37 | # Contributing 38 | 39 | When contributing to this repository, please first discuss the change you wish 40 | to make via issue, email, or any other method with the owners before making a change. 41 | 42 | ## Pull/merge request process 43 | 44 | 1. Branch from the `dev` branch. If you are implementing a feature name it 45 | `feature/name_of_feature`, if you are implementing a bugfix name it 46 | `bug/issue_name`. 47 | 2. Update the README.md and other documentation with details of major changes 48 | to the interface, this includes new environment variables, useful file 49 | locations and container parameters. 50 | 3. Once you are ready for review please open a pull/merge request to the 51 | `dev` branch. 52 | 4. You may merge the Pull/Merge Request in once you have the sign-off of two 53 | maintainers. 54 | 5. If you are merging `dev` to `master`, you must increment the version number 55 | in the VERSION file to the new version that this Pull/Merge Request would 56 | represent. The versioning scheme we use is [SemVer](http://semver.org/). 57 | 58 | ## Code style 59 | 60 | - We name variables using few nouns in lowercase, e.g. `mapping_names` 61 | or `increment`. 62 | - We name functions using verbs in lowercase, e.g. `map_variables_to_names` or 63 | `change_values`. 64 | - We use the [numpydoc](https://numpydoc.readthedocs.io/en/latest/format.html) 65 | format for documenting features using docstrings. 66 | 67 | ## Review process 68 | 69 | 1. When we want to release the package we will request a formal review for any 70 | non-minor changes. 71 | 2. The review process follows a similar process to ROpenSci. 72 | 3. Reviewers will be requested from associated communities. 73 | 4. The `dev` branch will only be released once reviewers are satisfied. 74 | ``` 75 | ```` 76 | 77 | ````{tab} HTML 78 |

Contributing

79 | 80 | When contributing to this repository, please first discuss the change you wish 81 | to make via issue, email, or any other method with the owners of this 82 | repository before making a change. 83 | 84 |

Pull/merge request process

85 | 86 | 1. Branch from the `dev` branch. If you are implementing a feature name it 87 | `feature/name_of_feature`, if you are implementing a bugfix name it 88 | `bug/issue_name`. 89 | 2. Update the README.md and other documentation with details of major changes 90 | to the interface, this includes new environment variables, useful file 91 | locations and container parameters. 92 | 3. Once you are ready for review please open a pull/merge request to the 93 | `dev` branch. 94 | 4. You may merge the Pull/Merge Request in once you have the sign-off of two 95 | maintainers. 96 | 5. If you are merging `dev` to `master`, you must increment the version number 97 | in the VERSION file to the new version that this Pull/Merge Request would 98 | represent. The versioning scheme we use is [SemVer](http://semver.org/). 99 | 100 |

Code style

Review process

110 | 111 | 1. When we want to release the package we will request a formal review for any 112 | non-minor changes. 113 | 2. The review process follows a similar process to ROpenSci. 114 | 3. Reviewers will be requested from associated communities. 115 | 4. Only once reviewers are satisfied, will the `dev` branch be released. 116 | ```` 117 | ````` 118 | 119 | In this case we have outlined our standard practices for using version control on GitHub, 120 | the code style that we are using in the project and the review process that we follow. 121 | We have used the [Markdown](https://daringfireball.net/projects/markdown/syntax) (`.md`) markup language for this document, 122 | which is formatted into HTML when viewed on our repository. 123 | 124 | 125 | ## User desk instructions 126 | 127 | If your project is very user focused for one particular task, 128 | for example, developing a statistic production pipeline for other analysts to execute, 129 | it is very important that the code users understand how to appropriately run your code. 130 | 131 | These instructions should include: 132 | 133 | - How to set up an environment to run your code (including how to install dependencies). 134 | - How to configure the project, for example how to set folders and environment variables if you use them. 135 | - How to run your code. 136 | - What outputs (if any) your code or system produces and how these should be interpreted. 137 | - What quality assurance has been carried out and what further quality assurance of outputs is required. 138 | - How to maintain your project (including how to update data sources). 139 | 140 | 141 | ## Dependencies 142 | 143 | The environment that your code runs in includes the machine, the operating system (Windows, Mac, Linux...), the programming language, and any external packages. 144 | It is important to record this information to ensure reproducibility. 145 | 146 | The simplest way to document which packages your code is dependent on is to record them in a text file. 147 | We typically call this text file `requirements.txt`. 148 | 149 | Python packages record their dependencies within their `setup.py` file, via `setup(install_requires=...)`. 150 | You can get a list of your installed python packages using `pip freeze` in the command line. 151 | 152 | [R packages](https://r-pkgs.org/) and projects record their dependencies in a [DESCRIPTION](https://r-pkgs.org/description.html) file. 153 | Packages are listed under the `Imports` key. 154 | You can get a list of your installed R packages using the `installed.packages()` function. 155 | 156 | You will find environment management tools, such as 157 | [`renv`](https://rstudio.github.io/renv/articles/renv.html) for R or 158 | [`pyenv`](https://github.com/pyenv/pyenv) for python useful for keeping track of software and package versions used in a project. 159 | 160 | 161 | ## Citation 162 | 163 | It can be helpful to provide a citation template for research or analytical code that is likely to be referenced by others. 164 | You can include this in your code repository as a `CITATION` file or part of your `README`. 165 | For example, the R package `ggplot2` provides the following: 166 | 167 | ```none 168 | To cite ggplot2 in publications, please use: 169 | 170 | H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 171 | 2009. 172 | 173 | A BibTeX entry for LaTeX users is 174 | 175 | @Book{, 176 | author = {Hadley Wickham}, 177 | title = {ggplot2: elegant graphics for data analysis}, 178 | publisher = {Springer New York}, 179 | year = {2009}, 180 | isbn = {978-0-387-98140-6}, 181 | url = {http://had.co.nz/ggplot2/book}, 182 | } 183 | ``` 184 | 185 | If your project includes multiple datasets, pieces of code or outputs with their own 186 | [DOIs](https://en.wikipedia.org/wiki/Digital_object_identifier), this might include multiple citations. 187 | 188 | See this [GitHub guide for more information on making your public code citable](https://guides.github.com/activities/citable-code/). 189 | 190 | 191 | ## Vignettes 192 | 193 | Vignettes are a form of supplementary documentation, containing applied examples that demonstrate the intended use of the code in your project or package. 194 | Docstrings may contain examples applying individual functional units, while vignettes may show multiple units being used together. 195 | We use the term vignette with reference to R packages, for example this introduction to the 196 | [`dplyr` package](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) for data manipulation. 197 | However, the same long-form documentation is beneficial for projects in any programming language - for instance the 198 | [`pandas` basics guide](https://pandas.pydata.org/docs/user_guide/basics.html). 199 | 200 | We've seen that [docstrings](docstrings) can be used to describe individual functional code elements. 201 | Vignettes demonstrate the intended use for these classes and functions in a realistic context. 202 | This shows users how different code elements interact and how they might use your code in their own program. 203 | 204 | Another good example is this vignette describing [how to design vignettes](http://r-pkgs.had.co.nz/vignettes.html) in Rmarkdown. 205 | You can produce this type of documentation in any format, though Rmarkdown is particularly effectively at combining sections of code, 206 | code outputs, and descriptive text. 207 | 208 | You might also consider providing examples in an interactive notebook that users can run for themselves. 209 | 210 | 211 | ## Versioning 212 | 213 | Documenting the version of your code provides distinct points of reference in the code's development. 214 | Recording the version of code used for analysis is important for reproducing your work. 215 | Combining versioning with [](version_control.md) allows you to recover the exact code used to run your analysis, and thus reproduce the same results. 216 | 217 | [Semantic versioning](https://semver.org/) provides useful rules for versioning releases of your code. 218 | Follow these rules to help other users of your code understand how changes in your code may affect their software. 219 | Each level of version number indicates the extent of changes to the application programming interface (API) of your code, 220 | i.e., the part of the code that a user interacts with directly. 221 | Changes to the major version number indicate changes to the API that are not compatible with use of previous versions of the code. 222 | While changes to the minor and patch numbers indicate changes that are either compatible or have no effect on the use of the code, respectively. 223 | 224 | ```{figure} ./_static/semantic_versioning.png 225 | --- 226 | width: 70% 227 | name: semantic_versioning 228 | --- 229 | [Semantic versioning](https://semver.org/) 230 | ``` 231 | 232 | You'll see this, or a similar version numbering, on packages that you install for Python and R. 233 | 234 | Changes to this book don't cause backwards-compatibility issues in the same sense as code. 235 | We've chosen to use a form of calender versioning ([CalVer](https://calver.org/)). 236 | You'll see the current version below the site's table of contents, where the first four digits represent the year that latest changes were made. 237 | The incremental number following the full stop indicates how many versions of the guidance have been published in that year. 238 | As this guidance will change over time, this version number provides users with a reference for citing a specific state of the guidance. 239 | 240 | 241 | ## Changelog 242 | 243 | A changelog records the major changes that have occurred to a project or package between versioned releases of the code. 244 | 245 | ```{code-block} 246 | # Changelog 247 | All notable changes to this project will be documented in this file. 248 | 249 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), 250 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 251 | 252 | ## [1.0.0] - 2020-01-21 253 | ### Added 254 | - `add_to_each_in_list()` 255 | - online sphinx-generated documentation 256 | - contribution guide 257 | 258 | ### Removed 259 | - `subtract_to_each_in_list()` 260 | 261 | ### Changed 262 | - Improved function documentation 263 | 264 | ### Fixed 265 | - bug in `multiple_each_in_list()`, where output was not returned 266 | ``` 267 | 268 | Like versioning, users find a changelog is useful to determine whether an update to your code is compatible with their work, which may depend on your code. 269 | It can also document which parts of your code will no longer be supported in future version and which bugs in your code you have addressed. 270 | Your changelog can be in any format and should be associated with your code documentation, so that it is easy for users and other contributors to find. 271 | 272 | [keep a changelog](https://keepachangelog.com/en/1.0.0/) provides a simple but effective template for recording changes to your code. 273 | 274 | 275 | ## Copyright and Licenses 276 | 277 | Copyright indicates ownership of work. 278 | All material created by civil servants, ministers, government departments and their agencies are covered by 279 | [Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/). 280 | It is not essential to include a copyright notice on your work, but doing so can help to avoid confusion around ownership. We recommend including one in government projects. 281 | 282 | Licenses outline the conditions under which others may use, modify and/or redistribute your work. 283 | As such, including a license with code is important for users and other developers alike. 284 | This [online tool](https://choosealicense.com/) might help you to choose an appropriate license for your project. 285 | The Government Digital Service generally recommends using the 286 | [MIT license](https://opensource.org/licenses/MIT) for code and the 287 | [Open Government License (OGL)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) for documentation. 288 | 289 | Both copyright and license are usually placed in a LICENSE file in your project. 290 | For example, an MIT LICENSE file might look like: 291 | 292 | > Copyright 2020, Crown copyright 293 | > 294 | > Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), 295 | > to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 296 | >and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 297 | > 298 | > The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 299 | > 300 | > THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 301 | > INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 302 | > IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 303 | > WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 304 | 305 | 306 | ## Open source your code 307 | 308 | In government, we [support and promote open source](https://www.gov.uk/service-manual/service-standard/point-12-make-new-source-code-open) whenever possible. 309 | [Open source](https://opensource.com/resources/what-open-source) software is software with source code that anyone can freely inspect, modify and enhance. 310 | As a government analyst, you should aim to make all new source code open, unless you can justify witholding part of your source code. 311 | 312 | Open sourcing code benefits you, other government analysts, and the public. 313 | 314 | Personal benefits from open sourcing include: 315 | 316 | - Attribution - coding in the open creates a public record of your contributions to analysis and software. 317 | - Collaboration - you can gain experience working with analysts in other departments. 318 | - Review - peers and experts in the field can provide advice on improving your analysis and coding. 319 | - Feels good - we regularly benefit from open source software, so it's nice to give something back. 320 | 321 | While the public benefit from: 322 | 323 | - Transparency - stakeholders can understand and reproduce our analysis, which is a core element of the [Statistics Code of Practice](https://code.statisticsauthority.gov.uk/). 324 | - Sharing value - others can benefit from our work, either through reuse or demonstration of good practices. 325 | - Sharing opportunity - others can gain insight and experience from reading and possibly contributing to your code. 326 | 327 | Please see the [Government Data Service (GDS) guidance](https://www.gov.uk/government/publications/open-source-guidance/when-code-should-be-open-or-closed) 328 | for help deciding when code should be open or closed. 329 | Further 330 | [GDS guidance](https://www.gov.uk/government/publications/open-source-guidance/security-considerations-when-coding-in-the-open) addresses security concerns for coding in the open. Additional guidance on deciding when and how to open source, the benefits, and good practice is available from the [Analysis Function](https://analysisfunction.civilservice.gov.uk/policy-store/open-sourcing-analytical-code/). -------------------------------------------------------------------------------- /book/project_structure.md: -------------------------------------------------------------------------------- 1 | # Structuring your project 2 | 3 | When you are designing your analysis it can be difficult to keep your thoughts tidy. 4 | Analysis is often exploratory and subject to change. 5 | This means that scripts and programs can become messy. 6 | The messier the programs, the harder they are to maintain and change in future. 7 | 8 | Good directory structure and file hygiene can go a long way to mitigate this. 9 | It will help others to read your code more easily, and you will write better code into the bargain. 10 | Some structures have been found to be generally quite effective through trial and error. 11 | Others are more specific, and - as with all guidelines - should not be taken as mandatory. 12 | 13 | 14 | ## Run scripts from end to end to ensure your code is executed reproducibly 15 | 16 | As you begin developing your project, it's a good idea to save your working code in a script file. 17 | In R these are saved as `.R` files, and in Python as `.py`. 18 | You can use scripts within an Integrated Development Environment (IDE) like 19 | [Visual Studio Code](https://code.visualstudio.com/), [RStudio](https://rstudio.com/), or [PyCharm](https://www.jetbrains.com/pycharm/). 20 | Inside an IDE you can usually run through your script line-by-line, or run the whole file at once. 21 | This can be an easier workflow than running code in the Python or the R console and then rewriting the same code in a script later. 22 | 23 | Scripts serve as the basic units of saved code. 24 | Often, we like to define functions or reusable bits of code in one file and then use these in another file. 25 | For example we may write a few functions that help us to calculate `mean`, `mode`, and `median` of our dataset in the `functions.R` file. 26 | Then we can use those functions in our main script, saved in `main.R`. 27 | 28 | Outside of an IDE you can also run your scripts using the Python or R interpreters from the command line. 29 | Other programs can then use your scripts because of this. 30 | For example you can use the `Rcmd ` command to run your R scripts or the `python ` command to run your Python scripts. 31 | 32 | Running your analysis files from end to end ensures that your code is executed in the same order each time. 33 | It also runs the code with a clean environment, without variables or other objects from previous runs that can be a common source of errors. 34 | 35 | 36 | ## Keep your project structure clean 37 | 38 | As your analysis project grows it becomes more important to keep your project structure clean. 39 | Every project is different and the right way to organise your project might differ from another project. 40 | However, there are some principles that are useful to consider. 41 | 42 | 43 | ### Use good filename conventions 44 | 45 | Much like the names of elements in your code, good filenames inform you of the purpose of a file. 46 | Within a project, you should follow a standard file naming convention. 47 | Good naming practices improve your ability to locate and identify the contents of files. 48 | 49 | Good naming conventions include: 50 | 51 | * Consistency, above all else. 52 | * Short but descriptive and human readable names. 53 | * No spaces, for machine readability - underscores (`_`) or dashes (`-`) are preferred. 54 | * Use of consistent date formatting (e.g. [YYYY-MM-DD](https://en.wikipedia.org/wiki/ISO_8601)). 55 | * Padding the left side of numbers with zeros to maintain order - e.g. `001` instead of `1`. The number of zeros should reflect the expected number of files. 56 | 57 | When using dates or times to name files, start with the largest unit of time and work towards the smallest. 58 | So we would use `2020-10-15_data_input`, and not `15-10-2020_data_input`. 59 | This will ensure that the default ordering of these files is in chronological order. 60 | This makes it much easier to find the earliest or latest files. 61 | 62 | You should start filenames with numbers to order files, if ordering is logical and informative. 63 | For example, where the `001_introduction` should come before `002_methodology` and `003_results`. 64 | 65 | 66 | ### Organise analysis as a Directed Acyclic Graph 67 | 68 | Analysis can best be thought of as a Directed Acyclic Graph (DAG). 69 | Don't let the name scare you off! 70 | All we mean by this is that you start off with the input data, you finish with the output(s), and in between there are no lines that link backwards. 71 | 72 | ```{figure} https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Tred-G.svg/800px-Tred-G.svg.png 73 | --- 74 | width: 50% 75 | name: dag 76 | alt: Analysis is a directed, acyclic graph that links the input data through a series of steps to the outputs. 77 | --- 78 | Defining analysis as a DAG - linking the input data at (a) to the output at (e). 79 | ``` 80 | 81 | When thinking about how to structure your project it is useful to think in terms of what your project's DAG looks like. 82 | Each project should be kept in its own folder, with a descriptive name. 83 | Most analysis will have an ingest or input stage, a processing stage, and a reporting stage. 84 | You should use folders within each project to separate raw data, documentation, source code (or `src`) and results. 85 | 86 | A typical analytical project folder in Python might look like: 87 | 88 | ```` {tabs} 89 | ``` {code-tab} py 90 | |-- README.md 91 | |-- requirements.txt 92 | |-- data/ 93 | | -- incident_counts.csv 94 | |-- docs/ 95 | | -- notebook.md 96 | | -- manuscript.md 97 | | -- changelog.md 98 | |-- results/ 99 | | -- incident_counts_by_age.csv 100 | | -- incidents_over_time.svg 101 | |-- src/ 102 | | -- data_cleaning.py 103 | | -- main_analysis.py 104 | | -- generate_plots.py 105 | ``` 106 | 107 | ``` {code-tab} r R 108 | |-- README.md 109 | |-- project.Rproj 110 | |-- data/ 111 | | -- incident_counts.csv 112 | |-- docs/ 113 | | -- notebook.md 114 | | -- manuscript.md 115 | | -- changelog.md 116 | |-- results/ 117 | | -- incident_counts_by_age.csv 118 | | -- incidents_over_time.svg 119 | |-- R/ 120 | | -- data_cleaning.R 121 | | -- main_analysis.R 122 | | -- generate_plots.R 123 | ``` 124 | ```` 125 | 126 | Where you have written code that is used by multiple projects, this code should reside in its own separate folder. 127 | This will allow you to record changes to your code independent of other dependencies of each project. 128 | 129 | 130 | ### Preserve raw data 131 | 132 | You should not alter raw data - treat it as read-only. 133 | You should conduct data cleaning on a copy of the raw data, so that you can document which cleaning decisions have been made. 134 | 135 | Your project structure should include an immutable store for raw data. This does not have to be inside the code repository and it is usually best to separate operational folders for inputs and outputs from the code base. 136 | 137 | 138 | ### Check that outputs are disposable 139 | 140 | You should be able to dispose of your outputs, deleting them, without worrying, because you should be able to regenerate them easily by re-running your pipeline. 141 | If you are worried about deleting your outputs (i.e., results) then it is likely you lack confidence in reproducing your results. 142 | 143 | Frequently deleting and regenerating your outputs when developing analysis is good practice. 144 | 145 | 146 | ## Structure code as modules and packages 147 | 148 | Code that is complex, high risk or reusable between projects can benefit from being structured into a package. 149 | Modules are single files that contain one or more reusable units of code. 150 | A package typically contains multiple related modules. 151 | 152 | It's likely that you've already used a package written by somebody else as part of your analysis. 153 | For example, installing additional functionality for Python using `pip install ` on the command line or 154 | running `install.packages("")` in an R interpreter. 155 | 156 | ```{admonition} Key Learning 157 | :class: admonition-learning 158 | 159 | [The Python Packaging User Guide](https://python-packaging-user-guide.readthedocs.io/) describes good packaging practices using the most up-to-date Python tools. 160 | While [the R Packages book](https://r-pkgs.org/) provides a comprehensive summary of packaging in R. 161 | [The rOpenSci packaging guide](https://devguide.ropensci.org/building.html) also contains useful tips for packaging in R. 162 | ``` 163 | 164 | See [](project_documentation.md) for a summary of common package and project documentation types. 165 | 166 | 167 | ## Use project templates 168 | 169 | Many analysts choose to use similar structures for multiple projects even though project structure can be flexible. 170 | Consistency in structure makes it easier to navigate unfamiliar projects, especially if they use the same coding style. 171 | It means that members of the team can quickly orient themselves when joining an existing project or starting a new one. 172 | 173 | [Cookiecutter](https://github.com/cookiecutter/cookiecutter) is a command-line tool that creates projects from templates (cookiecutters). 174 | Using an existing cookiecutter, or creating one to meet your needs, can be a useful way to increase consistency in structure between your projects. 175 | Increase consistency and good practice by creating common folder structures, laying out essential documentation or even starting off your code with a basic boilerplate. 176 | This can also save you time. Laying out a structure to include documentation and code testing also encourages good practices. 177 | 178 | Useful cookiecutters include: 179 | 180 | * The government data science [govcookiecutter](https://github.com/ukgovdatascience/govcookiecutter), including data security features. 181 | * The comprehensive Python data science project template [cookiecutter-data-science](http://drivendata.github.io/cookiecutter-data-science/). 182 | * The Python package template [cookiecutter-pypackage](https://cookiecutter-pypackage.readthedocs.io/en/latest/). 183 | 184 | Rstudio provides a standard template for R packages via `File > New Project... > New Directory > R Package`. 185 | We have created some basic templates for an [R package](https://github.com/best-practice-and-impact/example-package-r) 186 | and a [Python package](https://github.com/best-practice-and-impact/example-package-python) that may be helpful. 187 | R project structures can also be set up or extended, one component at a time, using the [`usethis` workflow package](https://usethis.r-lib.org/). usethis() sets up packages that follow the structure in the R packages book. 188 | For example, `use_test()` will add the directories necessary for testing using `testthat` and generate basic test file templates for a given function name. 189 | 190 | 191 | ## Use version controlled repositories 192 | 193 | Repositories or 'repos' are typically project folders that are version controlled using Git or a similar version control system. 194 | One repository usually contains a single project. 195 | Developing your project using a version controlled repository has significant benefits for reproducibility. 196 | 197 | See [](version_control.md) for more information. 198 | -------------------------------------------------------------------------------- /book/tools.md: -------------------------------------------------------------------------------- 1 | # Tools 2 | 3 | This section highlights tools that support reproducible analysis and research. 4 | This includes tools for general software development and bespoke packages that have been developed for government analysis. 5 | Those developed or contributed to within government are marked with the abbreviation (gov). 6 | 7 | If you have developed a package for use in analysis or recommend any that are not included here, please add them to the list. 8 | You can request a new tool to be added to the list by [creating an issue on GitHub](https://github.com/best-practice-and-impact/qa-of-code-guidance/issues/new/choose) 9 | or [contacting us by email](mailto:ASAP@ons.gov.uk?subject=Duck%20Book%20Tools). 10 | Alternatively, you can add it directly to the project by creating a Pull Request. 11 | You can do this using the "Suggest edit" link under the GitHub logo at the top of this page. 12 | Please include a link and brief description when requesting a new tool to be added. 13 | 14 | The tools included on this page will in general follow the good quality assurance practices described in this guidance. 15 | However, as with any software there is a chance that they may still contain bugs or limitations. 16 | Please apply your own judgement when using them. 17 | If you feel a tool should no longer be included in this list, please suggest an edit or get in touch. 18 | 19 | ## Data manipulation and analysis 20 | 21 | Manipulating and analysing data. 22 | 23 | ### Python 24 | 25 | * [pandas](https://pandas.pydata.org/) - common data analysis and manipulation 26 | * [Polars](https://www.pola.rs/) - high performance data manipulation 27 | * [PySpark](https://spark.apache.org/docs/latest/api/python/) - data manipulation for distributed (large) data 28 | * [Splink](https://moj-analytical-services.github.io/splink/) (gov) - probabilistic data linkage 29 | 30 | ### R 31 | 32 | * [dplyr](https://dplyr.tidyverse.org/) - common data analysis and manipulation 33 | * [sparklyr](https://spark.rstudio.com/) - for distributed (large) data 34 | 35 | ## Publishing 36 | 37 | * afcolours ([Python](https://pypi.org/project/py-af-colours/) and [R](https://cran.r-project.org/web/packages/afcolours/index.html)) (gov) - ease the use of the Analysis Function colour palettes for visually accessible graphs. 38 | * [Quarto](https://quarto.org/) - reproducible documents for Python and R 39 | * [a11ytables (R only)](https://co-analysis.github.io/a11ytables/index.html) (gov) - creating reproducible, accessible spreadsheets 40 | * [gptables (Python and R)](https://gptables.readthedocs.io/en/latest/index.html) (gov) - creating reproducible, accessible spreadsheets 41 | 42 | ## Testing 43 | 44 | Tools for implementing automated code testing. 45 | 46 | ### Python 47 | 48 | * [pytest](https://docs.pytest.org/en/stable/) - common testing framework 49 | * [unittest](https://docs.python.org/3/library/unittest.html) - common testing framework 50 | * [hypothesis](https://hypothesis.readthedocs.io/en/latest/) - property-based testing 51 | * [chispa (PySpark)](https://pypi.org/project/chispa/) - helper for testing PySpark code 52 | * [coverage](https://coverage.readthedocs.io/en/coverage-5.3/) - measuring test coverage 53 | 54 | 55 | ### R 56 | 57 | * [testthat](https://testthat.r-lib.org/) - common testing framework 58 | * [assertr](https://docs.ropensci.org/assertr/) - common testing framework 59 | * [patrick](https://github.com/google/patrick) - parameterised testing extension for `testthat` 60 | * [covr](https://covr.r-lib.org/) - measuring test coverage 61 | 62 | ## Dependency management 63 | 64 | * [venv (Python)](https://docs.python.org/3/library/venv.html) - manage packages using virtual environments 65 | * [pyenv (Python)](https://github.com/pyenv/pyenv) - manage independent Python versions for different projects 66 | * [renv (R)](https://rstudio.github.io/renv/articles/renv.html) - virtual environments for managing packages 67 | * [conda](https://docs.conda.io/en/latest/) - manage language versions and packages for most languages 68 | 69 | ## Version control 70 | 71 | * [Git](https://git-scm.com/) - common open source version control system 72 | * [pre-commit](https://pre-commit.com/) - trigger checks (e.g. linters and formatters) before Git commits are created 73 | 74 | ## Project templates 75 | 76 | * [govcookiecutter (Python)](https://github.com/best-practice-and-impact/govcookiecutter) (gov) - template project for reproducible analysis 77 | * [Rgovcookiecutter (R)](https://github.com/best-practice-and-impact/Rgovcookiecutter) (gov) - template project for reproducible analysis 78 | 79 | ## Code Linters 80 | 81 | Analysing code for stylistic errors, and sometimes bugs. 82 | 83 | ### Python 84 | 85 | * [pylint](https://www.pylint.org/) - check coding style and identify some logical errors 86 | * [flake8](https://flake8.pycqa.org/en/latest/) - check code style 87 | * [Bandit](https://bandit.readthedocs.io/en/latest/) - check for common security issues 88 | * [mypy](https://mypy.readthedocs.io/en/stable/) - check static types 89 | * [Radon](https://radon.readthedocs.io/en/latest/) - check code complexity 90 | 91 | ### R 92 | 93 | * [lintr](https://github.com/jimhester/lintr) - check code style 94 | 95 | 96 | ## Code Formatters 97 | 98 | Automated code formatters. 99 | These check code style, like linters, but also actively make changes to your code to meet a particular style. 100 | 101 | 102 | ### Python 103 | 104 | * [Black](https://black.readthedocs.io/en/stable/) 105 | * [Isort](https://pycqa.github.io/isort/) 106 | 107 | 108 | ### R 109 | 110 | * [formatR](https://yihui.org/formatr/) 111 | * [styler](https://styler.r-lib.org/) 112 | 113 | 114 | ## Packaging Code 115 | 116 | Creating and releasing code as a package. 117 | 118 | ### Python 119 | 120 | * [twine](https://pypi.org/project/twine/) - utility for publishing Python packages to [the Python Package Index PyPI](https://pypi.org/) 121 | 122 | ### R 123 | 124 | * [goodpractice](http://mangothecat.github.io/goodpractice/) - gives advice on the quality of your R packages 125 | * [fusen](https://thinkr-open.github.io/fusen/) - builds R packages from Rmarkdown file specifications 126 | 127 | 128 | ## Pipeline Orchestration 129 | 130 | * [Apache Airflow](https://airflow.apache.org/) - workflow management platform 131 | * [targets (R)](https://wlandau.github.io/targets-manual/) - defining and executing pipelines in R 132 | 133 | 134 | (CI-tools)= 135 | ## Continuous Integration Platforms 136 | 137 | * [GitHub Actions](https://github.com/features/actions) 138 | * [GitLab CI/CD](https://docs.gitlab.com/ee/ci/) 139 | * [Travis](https://travis-ci.org/) 140 | * [Jenkins](https://www.jenkins.io/) 141 | * [Coveralls](https://coveralls.io/) 142 | -------------------------------------------------------------------------------- /dev-requirements.txt: -------------------------------------------------------------------------------- 1 | bump-my-version 2 | pre-commit 3 | pymarkdownlnt -------------------------------------------------------------------------------- /early_development/references.md: -------------------------------------------------------------------------------- 1 | # Other references 2 | 3 | ## Examples in government 4 | 5 | Here we point to examples which apply good practices, from across government. 6 | 7 | Other guidance from across government: 8 | * [Department for Education good code practice](https://dfe-analytical-services.github.io/good-code-practice/index.html) 9 | 10 | 11 | ## Government blog posts 12 | 13 | * [Pair programming](https://gds.blog.gov.uk/2018/02/06/how-to-pair-program-effectively-in-6-steps/) 14 | 15 | ## Troubleshooting code 16 | 17 | When asking for help from your peers or online communities like [stackoverflow](https://stackoverflow.com/), it's best to provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). 18 | Without a simplified example, it can be difficult to diagnose your problem and suggest potential solutions. 19 | You might even find that breaking the problem down to simplify the example leads you to a solution. 20 | 21 | * We're all guilty of using tools without reading the manual - give it a search 22 | * [Google](https://www.google.co.uk) 23 | * [stackoverflow](https://stackoverflow.com/) 24 | * Reach out to communities, like the [Government Data Science Slack](https://govdatascience.slack.com) 25 | * Take a break - come back at it with a clear head 26 | 27 | 28 | ## Machine learning 29 | 30 | Although machine learning is not tackled specifically in this guidance, many of the good practices outlined here are applicable. 31 | These links detail application of good practices to machine learning. 32 | 33 | * [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/index.html) - a guide to making black box models explainable 34 | * [Effective testing for machine learning systems](https://www.jeremyjordan.me/testing-ml/) 35 | * [Machine learning with limited data](https://www.gov.uk/government/publications/machine-learning-with-limited-data) 36 | 37 | ## Miscellaneous learning 38 | 39 | * [Comparing data wrangling in Python and R](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) 40 | * [The Unix Workbench](https://seankross.com/the-unix-workbench/) - a guide to getting started with Unix operating systems. 41 | * [The Missing Semester](https://www.youtube.com/channel/UCuXy5tCgEninup9cGplbiFw) - recorded computer science lectures from MIT 42 | * [Teaching materials for computational social science](https://github.com/collections/teaching-computational-social-science) 43 | * [Google Tech Dev Guide](https://techdevguide.withgoogle.com/) 44 | * [Packt free learning](https://www.packtpub.com/free-learning) - daily free e-books 45 | * [The Hitchhiker's Guide to Python](https://docs.python-guide.org/) - everything Python 46 | * [An introduction to R Markdown - writing dynamic and reproducible documents, by Olivier Gimenez](https://oliviergimenez.github.io/intro_rmarkdown/#1) 47 | * [Contributing to open source: A short guide for organizations, by Chris Holdgraf](https://predictablynoisy.com/posts/2020/organizations-help-oss-guide/) 48 | * [Abstract Base Classes in Python](https://pymotw.com/3/abc/) 49 | 50 | ## Challenge yourself 51 | 52 | * [Project Euler](https://projecteuler.net/) 53 | * [Codewars](https://www.codewars.com/) 54 | * [Rosalind](http://rosalind.info/problems/locations/) 55 | -------------------------------------------------------------------------------- /early_development/validation-errors-logs.md: -------------------------------------------------------------------------------- 1 | # Validation, errors and logging 2 | 3 | How will you know if something goes wrong? 4 | How will your pipeline react if there is an unexpected upstream change? 5 | What diagnostics do you need? 6 | 7 | In a perfect world, we would expect our data to come in on time, in the right shape, and with no new errors. 8 | Unfortunately we live in the real world. 9 | Data come in late or early, in the wrong form, with errors that make you go "how in hell did that happen". 10 | 11 | Your program must be resistant. 12 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | sphinx>5 2 | jupyter-book>0.12 3 | sphinx-tabs 4 | --------------------------------------------------------------------------------