├── bin
├── lib
├── data
├── test_data
├── docs
    ├── 01_brief_description.md
    ├── 03_compute_requirements.md
    ├── 11_other.md
    ├── 05_related_protocols.md
    ├── 09_troubleshooting.md
    ├── 02_introduction.md
    ├── 06_input_example.md
    ├── 04_install_and_run.md
    ├── 10_FAQ.md
    ├── 07_outputs.md
    ├── 08_pipeline_overview.md
    └── 06_input_parameters.md
├── .gitignore
├── .gitmodules
├── .github
    ├── workflows
    │   └── issue-autoreply.yml
    └── ISSUE_TEMPLATE
    │   ├── config.yml
    │   ├── question.yml
    │   ├── feature_request.yml
    │   └── bug_report.yml
├── .pre-commit-config.yaml
├── test
    ├── run_fastq_ingress_test.sh
    └── test_fastq_ingress.py
├── .gitlab-ci.yml
├── nextflow.config
├── output_definition.json
├── CHANGELOG.md
├── main.nf
├── LICENSE
├── nextflow_schema.json
└── README.md


/bin:
--------------------------------------------------------------------------------
1 | wf-metagenomics/bin


--------------------------------------------------------------------------------
/lib:
--------------------------------------------------------------------------------
1 | wf-metagenomics/lib


--------------------------------------------------------------------------------
/data:
--------------------------------------------------------------------------------
1 | wf-metagenomics/data


--------------------------------------------------------------------------------
/test_data:
--------------------------------------------------------------------------------
1 | wf-metagenomics/test_data


--------------------------------------------------------------------------------
/docs/01_brief_description.md:
--------------------------------------------------------------------------------
1 | Taxonomic classification of 16S rRNA gene sequencing data.


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | nextflow
2 | .nextflow*
3 | template-workflow
4 | .*.swp
5 | .*.swo
6 | *.pyc
7 | *.pyo
8 | .DS_store
9 | 


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "wf-metagenomics"]
2 | 	path = wf-metagenomics
3 | 	url = https://github.com/epi2me-labs/wf-metagenomics
4 | 


--------------------------------------------------------------------------------
/.github/workflows/issue-autoreply.yml:
--------------------------------------------------------------------------------
 1 | name: Issue Auto-Reply
 2 | 
 3 | on:
 4 |   issues:
 5 |     types: [opened]
 6 | 
 7 | jobs:
 8 |   auto-reply:
 9 |     uses: epi2me-labs/.github/.github/workflows/issue-autoreply.yml@main
10 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
1 | blank_issues_enabled: false
2 | contact_links:
3 |     - name: Nanopore customer support
4 |       url: https://nanoporetech.com/contact
5 |       about: For general support, including bioinformatics questions.
6 | 


--------------------------------------------------------------------------------
/docs/03_compute_requirements.md:
--------------------------------------------------------------------------------
 1 | Recommended requirements:
 2 | 
 3 | + CPUs = 12
 4 | + Memory = 32GB
 5 | 
 6 | Minimum requirements:
 7 | 
 8 | + CPUs = 6
 9 | + Memory = 16GB
10 | 
11 | Approximate run time: ~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database.
12 | 
13 | ARM processor support: True
14 | 


--------------------------------------------------------------------------------
/docs/11_other.md:
--------------------------------------------------------------------------------
1 | + [How to build and use databases to run wf-metagenomics and wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/).
2 | + [Selecting the correct databases in the wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/).
3 | + [How to evaluate unclassified sequences](https://epi2me.nanoporetech.com/post-meta-analysis/)
4 | 
5 | See the [EPI2ME website](https://labs.epi2me.io/) for lots of other resources and blog posts.
6 | 


--------------------------------------------------------------------------------
/docs/05_related_protocols.md:
--------------------------------------------------------------------------------
1 | This workflow is designed to take input sequences that have been produced by [Oxford Nanopore Technologies](https://nanoporetech.com/) devices using protocols associated with either of the kits listed below:
2 | 
3 | - [SQK-MAB114.24](https://nanoporetech.com/document/microbial-amplicon-barcoding-sequencing-for-16s-and-its-sqk-mab114-24)
4 | - [SQK-16S114.24](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/rapid-sequencing-DNA-16s-barcoding-kit-v14-sqk-16114-24)
5 | 
6 | Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/).


--------------------------------------------------------------------------------
/docs/09_troubleshooting.md:
--------------------------------------------------------------------------------
1 | + If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
2 | + See how to interpret some common nextflow exit codes [here](https://labs.epi2me.io/trouble-shooting/).
3 | + When using the Minimap2 pipeline with a custom database, you must make sure that the `ref2taxid` and reference files are coherent, as well as the taxonomy database.
4 | + If your device doesn't have the resources to use large Kraken2 databases, you can enable `kraken2_memory_mapping` to reduce the amount of memory required.
5 | + To enable IGV viewer with a custom reference, this must be a FASTA file and not a minimap2 MMI format index.
6 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.yml:
--------------------------------------------------------------------------------
 1 | name: Question
 2 | description: Ask a generic question about this project unrelated to features or bugs.
 3 | labels: ["question"]
 4 | body:
 5 |   - type: markdown
 6 |     attributes:
 7 |       value: |
 8 |         Please reserve this form for issues not related to bugs or feature requests. If our developers deem your questions to be related to bugs or features you will be asked to fill in the appropriate form.
 9 |   - type: textarea
10 |     id: question1
11 |     attributes:
12 |       label: Ask away!
13 |       placeholder: |
14 |           Bad question: How do I use this workflow in my HPC cluster?
15 |           Good question: My HPC cluster uses a GridEngine scheduler. Can you point me to documentation for how to use your workflows to efficiently submit jobs to my cluster?
16 |     validations:
17 |       required: true
18 | 


--------------------------------------------------------------------------------
/docs/02_introduction.md:
--------------------------------------------------------------------------------
 1 | This workflow can be used for the following:
 2 | 
 3 | + Taxonomic classification of 16S rRNA, 18S rRNA and ITS amplicons using [default or custom databases](#faqs). Default databases:
 4 |     - NCBI targeted loci: 16S rDNA, 18S rDNA, ITS (ncbi_16s_18s, ncbi_16s_18s_28s_ITS; see [here](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) for details).
 5 | + Generate taxonomic profiles of one or more samples.
 6 | 
 7 | The workflow default parameters are optimised for analysis of 16S rRNA gene amplicons.
 8 | For ITS amplicons, it is strongly recommended that some parameters are changed from the defaults, please see the [ITS presets](#analysing-its-amplicons) section for more information.
 9 | 
10 | Additional features:
11 | + Two different approaches are available: `minimap2` (using alignment, default option) or `kraken2` (k-mer based).
12 | + Results include:
13 |     - An abundance table with counts per taxa in all the samples.
14 |     - Interactive sankey and sunburst plots to explore the different identified lineages.
15 |     - A bar plot comparing the abundances of the most abundant taxa in all the samples.
16 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.yml:
--------------------------------------------------------------------------------
 1 | name: Feature request
 2 | description: Suggest an idea for this project
 3 | labels: ["feature request"]
 4 | body:
 5 |   
 6 |   - type: textarea
 7 |     id: question1
 8 |     attributes:
 9 |       label: Is your feature related to a problem?
10 |       placeholder: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
11 |     validations:
12 |       required: true
13 |   - type: textarea
14 |     id: question2
15 |     attributes:
16 |       label: Describe the solution you'd like
17 |       placeholder: A clear and concise description of what you want to happen.
18 |     validations:
19 |       required: true
20 |   - type: textarea
21 |     id: question3
22 |     attributes:
23 |       label: Describe alternatives you've considered
24 |       placeholder: A clear and concise description of any alternative solutions or features you've considered.
25 |     validations:
26 |       required: true
27 |   - type: textarea
28 |     id: question4
29 |     attributes:
30 |       label: Additional context
31 |       placeholder: Add any other context about the feature request here.
32 |     validations:
33 |       required: false
34 | 
35 | 


--------------------------------------------------------------------------------
/docs/06_input_example.md:
--------------------------------------------------------------------------------
 1 | This workflow accepts either FASTQ or BAM files as input.
 2 | 
 3 | The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
 4 | 
 5 | ```
 6 | (i)                     (ii)                 (iii)    
 7 | input_reads.fastq   ─── input_directory  ─── input_directory
 8 |                         ├── reads0.fastq     ├── barcode01
 9 |                         └── reads1.fastq     │   ├── reads0.fastq
10 |                                              │   └── reads1.fastq
11 |                                              ├── barcode02
12 |                                              │   ├── reads0.fastq
13 |                                              │   ├── reads1.fastq
14 |                                              │   └── reads2.fastq
15 |                                              └── barcode03
16 |                                               └── reads0.fastq
17 | ```


--------------------------------------------------------------------------------
/.pre-commit-config.yaml:
--------------------------------------------------------------------------------
 1 | repos:
 2 |   - repo: local
 3 |     hooks:
 4 |       - id: docs_readme
 5 |         name: docs_readme
 6 |         entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_input_example 06_input_parameters 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json
 7 |         language: python
 8 |         always_run: true
 9 |         pass_filenames: false
10 |         additional_dependencies:
11 |           - epi2melabs==0.0.58
12 |   - repo: https://github.com/pycqa/flake8
13 |     rev: 5.0.4
14 |     hooks:
15 |       - id: flake8
16 |         pass_filenames: false
17 |         additional_dependencies:
18 |           - flake8-rst-docstrings
19 |           - flake8-docstrings
20 |           - flake8-import-order
21 |           - flake8-forbid-visual-indent
22 |           - pep8-naming
23 |           - flake8-no-types
24 |           - flake8-builtins
25 |           - flake8-absolute-import
26 |           - flake8-print
27 |           # avoid snowballstemmer>=3.0 as it causes flake8-docstrings to stop working [CW-6098]
28 |           - snowballstemmer==2.2.0
29 |         args: [
30 |             "bin",
31 |             "--import-order-style=google",
32 |             "--statistics",
33 |             "--max-line-length=88",
34 |             "--per-file-ignores=bin/workflow_glue/models/*:NT001",
35 |         ]
36 | 


--------------------------------------------------------------------------------
/test/run_fastq_ingress_test.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | set -exo pipefail
 3 | 
 4 | get-test_data-from-aws () {
 5 |     # get aws-cli
 6 |     curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
 7 |     unzip -q awscliv2.zip
 8 | 
 9 |     # get test data
10 |     aws/dist/aws s3 cp --recursive --quiet \
11 |         "$S3_TEST_DATA" \
12 |         test_data_from_S3
13 | }
14 | 
15 | fastq=$1
16 | wf_output_dir=$2
17 | sample_sheet=$3
18 | 
19 | # `fastq` and `wf_output_dir` are required
20 | if ! [[ $# -eq 2 || $# -eq 3 ]]; then
21 |     echo "Provide 2 or 3 arguments!" >&2
22 |     exit 1
23 | fi
24 | 
25 | # get test data from s3 if required
26 | if [[ $fastq =~ ^s3:// ]]; then
27 |     get-test_data-from-aws
28 |     fastq="$PWD/test_data_from_S3/${fastq#*test_data/}"
29 |     [[ -n $sample_sheet ]] &&
30 |         sample_sheet="$PWD/test_data_from_S3/${sample_sheet#*test_data/}"
31 | fi
32 | 
33 | # add CWD if paths are relative
34 | [[ ( $fastq != /* ) ]] && fastq="$PWD/$fastq"
35 | [[ ( $wf_output_dir != /* ) ]] && wf_output_dir="$PWD/$wf_output_dir"
36 | [[ ( -n $sample_sheet ) && ( $sample_sheet != /* ) ]] &&
37 |     sample_sheet="$PWD/$sample_sheet"
38 | 
39 | # add flags to parameters (need an array for `fastq` here as there might be spaces in
40 | # the filename)
41 | fastq=("--fastq" "$fastq")
42 | wf_output_dir="--wf-output-dir $wf_output_dir"
43 | [[ -n $sample_sheet ]] && sample_sheet="--sample_sheet $sample_sheet"
44 | 
45 | # get container hash from config
46 | img_hash=$(grep 'common_sha.\?=' nextflow.config | grep -oE 'sha[0-9,a-f,A-F]+')
47 | 
48 | # run test
49 | docker run -v "$PWD":"$PWD" \
50 |     ontresearch/wf-common:"$img_hash" \
51 |     python "$PWD"/test/test_fastq_ingress.py "${fastq[@]}" $wf_output_dir $sample_sheet
52 | 


--------------------------------------------------------------------------------
/docs/04_install_and_run.md:
--------------------------------------------------------------------------------
 1 | 
 2 | These are instructions to install and run the workflow on command line.
 3 | You can also access the workflow via the
 4 | [EPI2ME Desktop application](https://labs.epi2me.io/downloads/).
 5 | 
 6 | The workflow uses [Nextflow](https://www.nextflow.io/) to manage
 7 | compute and software resources,
 8 | therefore Nextflow will need to be
 9 | installed before attempting to run the workflow.
10 | 
11 | The workflow can currently be run using either
12 | [Docker](https://docs.docker.com/get-started/)
13 | or [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html)
14 | to provide isolation of the required software.
15 | Both methods are automated out-of-the-box provided
16 | either Docker or Singularity is installed.
17 | This is controlled by the
18 | [`-profile`](https://www.nextflow.io/docs/latest/config.html#config-profiles)
19 | parameter as exemplified below.
20 | 
21 | It is not required to clone or download the git repository
22 | in order to run the workflow.
23 | More information on running EPI2ME workflows can
24 | be found on our [website](https://labs.epi2me.io/wfindex).
25 | 
26 | The following command can be used to obtain the workflow.
27 | This will pull the repository in to the assets folder of
28 | Nextflow and provide a list of all parameters
29 | available for the workflow as well as an example command:
30 | 
31 | ```
32 | nextflow run epi2me-labs/wf-16s --help
33 | ```
34 | To update a workflow to the latest version on the command line use
35 | the following command:
36 | ```
37 | nextflow pull epi2me-labs/wf-16s
38 | ```
39 | 
40 | A demo dataset is provided for testing of the workflow.
41 | It can be downloaded and unpacked using the following commands:
42 | ```
43 | wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz
44 | tar -xzvf wf-16s-demo.tar.gz
45 | ```
46 | The workflow can then be run with the downloaded demo data using:
47 | ```
48 | nextflow run epi2me-labs/wf-16s \
49 | 	--fastq 'wf-16s-demo/test_data' \
50 | 	--minimap2_by_reference \
51 | 	-profile standard
52 | ```
53 | 
54 | For further information about running a workflow on
55 | the command line see https://labs.epi2me.io/wfquickstart/
56 | 


--------------------------------------------------------------------------------
/docs/10_FAQ.md:
--------------------------------------------------------------------------------
 1 | If your question is not answered here, please report any issues or suggestions on the [github issues](https://github.com/epi2me-labs/wf-16s/issues) page or start a discussion on the [community](https://community.nanoporetech.com/). 
 2 | 
 3 | + *Which database is used per default?* - By default, the workflow uses the NCBI 16S + 18S rRNA database. It will be downloaded the first time the workflow is run and re-used in subsequent runs.
 4 | 
 5 | + *Are more databases available?* - Other 16s databases (listed below) can be selected with the `database_set` parameter, but the workflow can also be used with a custom database if required (see [here](https://labs.epi2me.io/how-to-meta-offline/) for details).
 6 |     * 16S, 18S, ITS
 7 |         * ncbi_16s_18s and ncbi_16s_18s_28s_ITS:  Archaeal, bacterial and fungal 16S/18S and ITS data. There are two databases available using the data from [NCBI](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/)
 8 |         * SILVA_138_1: The [SILVA](https://www.arb-silva.de/) database (version 138) is also available. Note that SILVA uses its own set of taxids, which do not match the NCBI taxids. We provide the respective taxdump files, but if you prefer using the NCBI ones, you can create them from the SILVA files ([NCBI](https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/ncbi/)). As the SILVA database uses genus level, the last taxonomic rank at which the analysis is carried out is genus (`taxonomic_rank G`).
 9 | 
10 | + *How can I use Kraken2 indexes?* - There are different databases available [here](https://benlangmead.github.io/aws-indexes/k2).
11 | 
12 | + *How can I use custom databases?* - If you want to run the workflow using your own Kraken2 database, you'll need to provide the database and an associated taxonomy dump. For a custom Minimap2 reference database, you'll need to provide a reference FASTA (or MMI) and an associated ref2taxid file. For a guide on how to build and use custom databases, take a look at our [article on how to run wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/).
13 | 
14 | + *How can I run the workflow with less memory?* -
15 |     When running in Kraken mode, you can set the `kraken2_memory_mapping` parameter if the available memory is smaller than the size of the database.
16 | 
17 | + *How can I run the workflow offline?* - To run wf-16s offline you can use the workflow to download the databases from the internet and prepare them for offline re-use later. If you want to use one of the databases supported out of the box by the workflow, you can run the workflow with your desired database and any input (for example, the test data). The database will be downloaded and prepared in a directory on your computer. Once the database has been prepared, it will be used automatically the next time you run the workflow without needing to be downloaded again. You can find advice on picking a suitable database in our [article on selecting databases for wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/).
18 | 
19 | + *When and how are coverage and identity filters applied when using the minimap2 approach?* - With minimap2-based classification, coverage and identity filtering is applied by using the `min_ref_coverage` and `min_percent_identity` options respectively. All reads that mapped to a reference, but failed to pass these filters, are relabelled as unclassified. If the `include_read_assignments` option is used, tables in the output will show read classifications after this filtering step. However, the output BAM file always contains the raw minimap2 alignment results. To read more about both filters, see [minimap2 Options](#minimap2-options).
20 | 
21 | 


--------------------------------------------------------------------------------
/docs/07_outputs.md:
--------------------------------------------------------------------------------
 1 | Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
 2 | 
 3 | | Title | File path | Description | Per sample or aggregated |
 4 | |-------|-----------|-------------|--------------------------|
 5 | | workflow report | wf-16s-report.html | Report for all samples. | aggregated |
 6 | | Abundance table with counts per taxa | abundance_table_{{ taxonomic_rank }}.tsv | Per-taxa counts TSV, including all samples. | aggregated |
 7 | | Bracken report file | bracken/{{ alias }}.kraken2_bracken.report | TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report). | per-sample |
 8 | | Kraken2 taxonomic assignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.report.txt | Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format). | per-sample |
 9 | | Kraken2 taxonomic asignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.assignments.tsv | TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format). | per-sample |
10 | | Host BAM file | host_bam/{{ alias }}.bam | BAM file generated from mapping filtered input reads to the host reference. | per-sample |
11 | | BAM index file of host reads | host_bam/{{ alias }}.bai | BAM index file generated from mapping filtered input reads to the host reference. | per-sample |
12 | | BAM file (minimap2) | bams/{{ alias }}.reference.bam | BAM file generated from mapping filtered input reads to the reference. | per-sample |
13 | | BAM index file (minimap2) | bams/{{ alias }}.reference.bam.bai | Index file generated from mapping filtered input reads to the reference. | per-sample |
14 | | BAM flagstat (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv | Mapping results per reference | per-sample |
15 | | Minimap2 alignment statistics (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz | Per read stats after aligning | per-sample |
16 | | Reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz | Reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
17 | | Index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.fai | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
18 | | GZI index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.gzi | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
19 | | JSON configuration file for IGV browser | igv.json | JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference. | aggregated |
20 | | Taxonomic assignment per read. | reads_assignments/{{ alias }}.*.assignments.tsv | TSV file with the taxonomic assignment per read. | per-sample |
21 | | FASTQ of the selected taxids. | extracted/{{ alias }}.minimap2.extracted.fastq | FASTQ containing/excluding the reads of the selected taxids. | per-sample |
22 | | Unclassified FASTQ. | unclassified/{{ alias }}.unclassified.fq.gz | FASTQ containing the reads that have not been classified against the database. | per-sample |
23 | | Alignment statistics TSV | alignment_tables/{{ alias }}.alignment-stats.tsv | Coverage and taxonomy of each reference. | per-sample |
24 | 


--------------------------------------------------------------------------------
/.gitlab-ci.yml:
--------------------------------------------------------------------------------
 1 | # Include shared CI
 2 | include:
 3 |     - project: "epi2melabs/ci-templates"
 4 |       file: "wf-containers.yaml"
 5 | 
 6 | variables:
 7 |     NF_BEFORE_SCRIPT: mkdir -p ${CI_PROJECT_NAME}/data/ && wget -O ${CI_PROJECT_NAME}/data/wf-16s-demo.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz && tar -xzvf ${CI_PROJECT_NAME}/data/wf-16s-demo.tar.gz -C ${CI_PROJECT_NAME}/data/
 8 |     NF_WORKFLOW_OPTS: "--fastq ${CI_PROJECT_NAME}/data/wf-16s-demo/test_data/
 9 |     --classifier minimap2
10 |     --minimap2_by_reference
11 |     --database_set ncbi_16s_18s"
12 |     PYTEST_CONTAINER_NAME: "wf-metagenomics"
13 |     NF_IGNORE_PROCESSES: "rebatchFastq"
14 |     GIT_SUBMODULE_STRATEGY: recursive
15 |     CI_FLAVOUR: "new"
16 |     CWG_AWS_ENV_NAME: "stack"
17 | 
18 | aws-run:
19 |     variables:
20 |         NF_WORKFLOW_OPTS: "--fastq test_data/case01 --store_dir s3://$${XAWS_BUCKET}/${CI_PROJECT_NAME}/store"
21 |         NF_IGNORE_PROCESSES: "rebatchFastq"
22 |     artifacts:
23 |         when: always
24 |         paths:
25 |             - ${CI_PROJECT_NAME}
26 |             - .nextflow.log
27 |         exclude: [] # give me everything pal
28 |     allow_failure: false
29 | 
30 | 
31 | docker-run:
32 | 
33 |     # Remove this directive in downstream templates
34 |     tags:
35 |         - large_ram
36 | 
37 |     # Define a 1D job matrix to inject a variable named MATRIX_NAME into
38 |     #   the CI environment, we can use the value of MATRIX_NAME to determine
39 |     #   which options to apply as part of the rules block below
40 |     # NOTE There is a slightly cleaner way to define this matrix to include
41 |     #   the variables, but it is broken when using long strings! See CW-756
42 |     parallel:
43 |         matrix:
44 |             - MATRIX_NAME: [
45 |                 "kraken2", "minimap2", "minimap2-sample-sheet",
46 |                 "kraken2-bam", "minimap2-bam"]
47 | 
48 |     rules:
49 |      - if: ($CI_COMMIT_BRANCH == null || $CI_COMMIT_BRANCH == "dev-template")
50 |        when: never
51 |      - if: $MATRIX_NAME == "kraken2"
52 |        variables:
53 |           NF_PROCESS_FILES: "wf-metagenomics/subworkflows/kraken_pipeline.nf"
54 |           NF_WORKFLOW_OPTS: "--fastq test_data/case01 --classifier kraken2 --include_read_assignments"
55 |           NF_IGNORE_PROCESSES: ""
56 |           AFTER_NEXTFLOW_CMD: >
57 |             if [ ! -f $$PWD/$$CI_PROJECT_NAME/wf-16s-report.html ]; then (echo -e "Report not found" && exit 1); fi
58 |      - if: $MATRIX_NAME == "minimap2"
59 |        variables:
60 |           NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf"
61 |           NF_WORKFLOW_OPTS: "--fastq test_data/case01 --minimap2_by_reference --keep_bam"
62 |           NF_IGNORE_PROCESSES: "extractMinimap2Reads"
63 |           AFTER_NEXTFLOW_CMD: >
64 |             if [ ! -f $$PWD/$$CI_PROJECT_NAME/wf-16s-report.html ]; then (echo -e "Report not found" && exit 1); fi
65 |      - if: $MATRIX_NAME == "minimap2-sample-sheet"
66 |        variables:
67 |           NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf"
68 |           NF_WORKFLOW_OPTS: "--fastq test_data/case02 --sample_sheet test_data/case02/sample_sheet.csv --taxonomic_rank G --n_taxa_barplot 5 --abundance_threshold 1"
69 |           NF_IGNORE_PROCESSES: "extractMinimap2Reads,getAlignmentStats"
70 |     # BAM INGRESS
71 |      # Compare counts with case01_no_duplicateIDs, must be the same
72 |      - if: $MATRIX_NAME == "kraken2-bam"
73 |        variables:
74 |           NF_PROCESS_FILES: "wf-metagenomics/subworkflows/kraken_pipeline.nf"
75 |           NF_WORKFLOW_OPTS: "--bam test_data/case05_bam --include_read_assignments --abundance_threshold 1 --classifier kraken2"
76 |           NF_IGNORE_PROCESSES: ""
77 |      ## Regular test minimap2 - mapping stats
78 |      - if: $MATRIX_NAME == "minimap2-bam"
79 |        variables:
80 |           NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf"
81 |           NF_WORKFLOW_OPTS: "--bam test_data/case05_bam --minimap2_by_reference --database_set ncbi_16s_18s --classifier minimap2"
82 |           NF_IGNORE_PROCESSES: "extractMinimap2Reads"   
83 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.yml:
--------------------------------------------------------------------------------
  1 | name: Bug Report
  2 | description: File a bug report
  3 | labels: ["triage"]
  4 | body:
  5 |   - type: markdown
  6 |     attributes:
  7 |       value: |
  8 |         Thanks for taking the time to fill out this bug report!
  9 | 
 10 | 
 11 |   - type: markdown
 12 |     attributes:
 13 |       value: |
 14 |           # Background
 15 |   - type: dropdown
 16 |     id: os
 17 |     attributes:
 18 |       label: Operating System
 19 |       description: What operating system are you running?
 20 |       options:
 21 |         - Windows 10
 22 |         - Windows 11
 23 |         - macOS
 24 |         - Ubuntu 22.04
 25 |         - CentOS 7
 26 |         - Other Linux (please specify below)
 27 |     validations:
 28 |       required: true
 29 |   - type: input
 30 |     id: other-os
 31 |     attributes:
 32 |       label: Other Linux
 33 |       placeholder: e.g. Fedora 38
 34 |   - type: input
 35 |     id: version
 36 |     attributes:
 37 |       label: Workflow Version
 38 |       description: This is most easily found in the workflow output log
 39 |       placeholder: v1.2.3
 40 |     validations:
 41 |       required: true
 42 |   - type: dropdown
 43 |     id: execution
 44 |     attributes:
 45 |       label: Workflow Execution
 46 |       description: Where are you running the workflow?
 47 |       options:
 48 |         - EPI2ME Desktop (Local)
 49 |         - EPI2ME Desktop (Cloud)
 50 |         - Command line (Local)
 51 |         - Command line (Cluster)
 52 |         - Other (please describe)
 53 |     validations:
 54 |       required: true
 55 |   - type: input
 56 |     id: other-workflow-execution
 57 |     attributes:
 58 |       label: Other workflow execution
 59 |       description: If "Other", please describe
 60 |       placeholder: Tell us where / how you are running the workflow.
 61 | 
 62 |   - type: markdown
 63 |     attributes:
 64 |       value: |
 65 |         # EPI2ME Desktop Application
 66 |         If you are using the application please provide the following.
 67 |   - type: input
 68 |     id: labs-version
 69 |     attributes:
 70 |       label: EPI2ME Version
 71 |       description: Available from the application settings page.
 72 |       placeholder: v5.1.1
 73 |     validations:
 74 |       required: false
 75 | 
 76 | 
 77 |   - type: markdown
 78 |     attributes:
 79 |       value: |
 80 |         # Command-line execution
 81 |         If you are using nextflow on a command-line, please provide the following.
 82 |   - type: textarea
 83 |     id: cli-command
 84 |     attributes:
 85 |       label: CLI command run
 86 |       description: Please tell us the command you are running
 87 |       placeholder: e.g. nextflow run epi2me-labs/wf-human-variations -profile standard --fastq my-reads/fastq
 88 |     validations:
 89 |       required: false
 90 |   - type: dropdown
 91 |     id: profile
 92 |     attributes:
 93 |       label: Workflow Execution - CLI Execution Profile
 94 |       description: Which execution profile are you using? If you are using a custom profile or nextflow configuration, please give details below.
 95 |       options:
 96 |         - standard (default)
 97 |         - singularity
 98 |         - custom
 99 |     validations:
100 |       required: false
101 | 
102 | 
103 |   - type: markdown
104 |     attributes:
105 |       value: |
106 |         # Report details
107 |   - type: textarea
108 |     id: what-happened
109 |     attributes:
110 |       label: What happened?
111 |       description: Also tell us, what did you expect to happen?
112 |       placeholder: Tell us what you see!
113 |     validations:
114 |       required: true
115 |   - type: textarea
116 |     id: logs
117 |     attributes:
118 |       label: Relevant log output
119 |       description: For CLI execution please include the full output from running nextflow. For execution from the EPI2ME application please copy the contents of the "Workflow logs" panel from the "Logs" tab corresponding to your workflow instance. (This will be automatically formatted into code, so no need for backticks).
120 |       render: shell
121 |     validations:
122 |       required: true
123 |   - type: textarea
124 |     id: activity-log
125 |     attributes:
126 |       label: Application activity log entry
127 |       description: For use with the EPI2ME application please see the Settings > View Activity Log page, and copy the contents of any items listed in red using the Copy to clipboard button.
128 |       render: shell
129 |     validations:
130 |       required: false
131 |   - type: dropdown
132 |     id: run-demo
133 |     attributes:
134 |       label: Were you able to successfully run the latest version of the workflow with the demo data?
135 |       description: For CLI execution, were you able to successfully run the workflow using the demo data available in the [Install and run](./README.md#install-and-run) section of the `README.md`? For execution in the EPI2ME application, were you able to successfully run the workflow via the "Use demo data" button?
136 |       options:
137 |         - 'yes'
138 |         - 'no'
139 |         - other (please describe below)
140 |     validations:
141 |       required: true
142 |   - type: textarea
143 |     id: demo-other
144 |     attributes:
145 |       label: Other demo data information
146 |       render: shell
147 |     validations:
148 |       required: false
149 | 
150 | 


--------------------------------------------------------------------------------
/nextflow.config:
--------------------------------------------------------------------------------
  1 | //
  2 | // Notes to End Users.
  3 | //
  4 | // The workflow should run without editing this configuration file,
  5 | // however there may be instances in which you wish to edit this
  6 | // file for compute performance or other reasons. Please see:
  7 | //
  8 | //   https://nextflow.io/docs/latest/config.html#configuration
  9 | //
 10 | // for further help editing this file.
 11 | 
 12 | 
 13 | params {
 14 |     help = false
 15 |     version = false
 16 |     fastq = null
 17 |     bam = null
 18 |     sample = null
 19 |     sample_sheet = null
 20 |     classifier = "minimap2"
 21 |     exclude_host = null
 22 |     // Advanced_options
 23 |     max_len = 2000
 24 |     min_len = 800
 25 |     min_read_qual = null
 26 |     threads = 4
 27 |     // Databases
 28 |     taxonomy = null
 29 |     reference = null
 30 |     ref2taxid = null
 31 |     database = null
 32 |     taxonomic_rank = 'G'
 33 |     // Minimap
 34 |     minimap2filter = null
 35 |     minimap2exclude = false
 36 |     keep_bam = false
 37 |     minimap2_by_reference = false
 38 |     min_percent_identity = 95
 39 |     min_ref_coverage = 90
 40 |     // Output
 41 |     store_dir = "store_dir"
 42 |     out_dir = "output"
 43 |     include_read_assignments = false
 44 |     // Extra features
 45 |     igv = false
 46 |     output_unclassified = false
 47 |     // Kraken
 48 |     bracken_length = null
 49 |     bracken_threshold = 10
 50 |     kraken2_memory_mapping = false
 51 |     kraken2_confidence = 0
 52 |     // Databases
 53 |     database_set = "ncbi_16s_18s"
 54 |     database_sets = [
 55 |         'ncbi_16s_18s': [
 56 |             'reference': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna',
 57 |             // database already includes kmer_dist_file
 58 |             'database': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz',
 59 |             'ref2taxid': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv',
 60 |             'taxonomy': 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2025-01-01.zip'
 61 |         ],
 62 |         'ncbi_16s_18s_28s_ITS': [
 63 |             'reference': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna',
 64 |             'database': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS_kraken2.tar.gz',
 65 |             'ref2taxid': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ref2taxid.ncbi_16s_18s_28s_ITS.tsv',
 66 |             'taxonomy': 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2025-01-01.zip'
 67 |         ],
 68 |         'SILVA_138_1': [
 69 |             // It uses the taxids from the Silva database, which doesn't match the taxids from NCBI
 70 |             // Database create from scratch using kraken2-build command. It automatically downloads the files.
 71 |             'database': null
 72 |         ]
 73 |     ]
 74 |     // Report options
 75 |     abundance_threshold = 1
 76 |     n_taxa_barplot = 9
 77 |     // AMR options
 78 |     amr = false
 79 |     amr_db = "resfinder"
 80 |     amr_minid = 80
 81 |     amr_mincov = 80
 82 |     // AWS
 83 |     aws_image_prefix = null
 84 |     aws_queue = null
 85 |     // Other options
 86 |     disable_ping = false
 87 |     monochrome_logs = false
 88 |     validate_params = true
 89 |     show_hidden_params = false
 90 |     analyse_unclassified = false
 91 |     schema_ignore_params = 'show_hidden_params,validate_params,monochrome_logs,aws_queue,aws_image_prefix,wf,database_sets,amr,amr_db,amr_minid,amr_mincov'
 92 | 
 93 |     // Workflows images
 94 |     wf {
 95 |         example_cmd = [
 96 |             "--fastq 'wf-16s-demo/test_data'",
 97 |             "--minimap2_by_reference"
 98 |         ]
 99 |         agent = null
100 |         container_sha = "sha1d71a4d15f57a1c32aacdb94cacdeb268205548e"
101 |         common_sha = "sha72f3517dd994984e0e2da0b97cb3f23f8540be4b"
102 |     }
103 | }
104 | 
105 | 
106 | manifest {
107 |     name            = 'epi2me-labs/wf-16s'
108 |     author          = 'Oxford Nanopore Technologies'
109 |     homePage        = 'https://github.com/epi2me-labs/wf-16s'
110 |     description     = 'Taxonomic classification of 16S rRNA gene sequencing data.'
111 |     mainScript      = 'main.nf'
112 |     nextflowVersion = '>=23.04.2'
113 |     version         = 'v1.6.0'
114 | }
115 | 
116 | 
117 | epi2melabs {
118 |     tags = "wf-16s,targeted,16S,18S,ITS,bacteria,fungi,metagenomics"
119 | }
120 | 
121 | // used by default for "standard" (docker) and singularity profiles,
122 | // other profiles may override.
123 | process {
124 |     withLabel:wfmetagenomics {
125 |         container = "ontresearch/wf-metagenomics:${params.wf.container_sha}"
126 |     }
127 |     withLabel:wf_common {
128 |         container = "ontresearch/wf-common:${params.wf.common_sha}"
129 |     }
130 |     shell = ['/bin/bash', '-euo', 'pipefail']
131 | }
132 | 
133 | 
134 | profiles {
135 |     // the "standard" profile is used implicitely by nextflow
136 |     // if no other profile is given on the CLI
137 |     standard {
138 |         docker {
139 |             enabled = true
140 |             // this ensures container is run as host user and group, but
141 |             //    also adds host user to the within-container group
142 |             runOptions = "--user \$(id -u):\$(id -g) --group-add 100"
143 |         }
144 |     }
145 | 
146 |     // using singularity instead of docker
147 |     singularity {
148 |         singularity {
149 |             enabled = true
150 |             autoMounts = true
151 |         }
152 |     }
153 | 
154 | 
155 |     conda {
156 |         conda.enabled = true
157 |     }
158 | 
159 |     // Using AWS batch.
160 |     // May need to set aws.region and aws.batch.cliPath
161 |     awsbatch {
162 |         process {
163 |             executor = 'awsbatch'
164 |             queue = "${params.aws_queue}"
165 |             memory = '16G'
166 |             withLabel:wfmetagenomics {
167 |                 container = "${params.aws_image_prefix}-wf-metagenomics:${params.wf.container_sha}"
168 |             }
169 |             withLabel:wf_common {
170 |                 container = "${params.aws_image_prefix}-wf-common:${params.wf.common_sha}"
171 |             }
172 |             shell = ['/bin/bash', '-euo', 'pipefail']
173 |         }
174 |     }
175 | 
176 |     // local profile for simplified development testing
177 |     local {
178 |         process.executor = 'local'
179 |     }
180 | }
181 | 
182 | 
183 | timeline {
184 |     enabled = true
185 |     overwrite = true
186 |     file = "${params.out_dir}/execution/timeline.html"
187 | }
188 | report {
189 |     enabled = true
190 |     overwrite = true
191 |     file = "${params.out_dir}/execution/report.html"
192 | }
193 | trace {
194 |     enabled = true
195 |     overwrite = true
196 |     file = "${params.out_dir}/execution/trace.txt"
197 | }
198 | 
199 | env {
200 | 	PYTHONNOUSERSITE = 1
201 |     JAVA_TOOL_OPTIONS = "-Xlog:disable -Xlog:all=warning:stderr"
202 | }
203 | 


--------------------------------------------------------------------------------
/output_definition.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "files": {
  3 |     "workflow-report": {
  4 |       "filepath": "wf-16s-report.html",
  5 |       "title": "workflow report",
  6 |       "description": "Report for all samples.",
  7 |       "mime-type": "text/html",
  8 |       "optional": false,
  9 |       "type": "aggregated"
 10 |     },
 11 |     "abundance-table-rank": {
 12 |       "filepath": "abundance_table_{{ taxonomic_rank }}.tsv",
 13 |       "title": "Abundance table with counts per taxa",
 14 |       "description": "Per-taxa counts TSV, including all samples.",
 15 |       "mime-type": "text/tab-separated-values",
 16 |       "optional": false,
 17 |       "type": "aggregated"
 18 |     },
 19 |     "bracken-report": {
 20 |       "filepath": "bracken/{{ alias }}.kraken2_bracken.report",
 21 |       "title": "Bracken report file",
 22 |       "description": "TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report).",
 23 |       "mime-type": "text/tab-separated-values",
 24 |       "optional": true,
 25 |       "type": "per-sample"
 26 |     },
 27 |     "kraken-report": {
 28 |       "filepath": "kraken2/{{ alias }}.kraken2.report.txt",
 29 |       "title": "Kraken2 taxonomic assignment per read (Kraken2 pipeline)",
 30 |       "description": "Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format).",
 31 |       "mime-type": "text/txt",
 32 |       "optional": true,
 33 |       "type": "per-sample"
 34 |     },
 35 |     "kraken-assignments": {
 36 |       "filepath": "kraken2/{{ alias }}.kraken2.assignments.tsv",
 37 |       "title": "Kraken2 taxonomic asignment per read (Kraken2 pipeline)",
 38 |       "description": "TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format).",
 39 |       "mime-type": "text/tab-separated-values",
 40 |       "optional": true,
 41 |       "type": "per-sample"
 42 |     },
 43 |     "host-bam": {
 44 |       "filepath": "host_bam/{{ alias }}.bam",
 45 |       "title": "Host BAM file",
 46 |       "description": "BAM file generated from mapping filtered input reads to the host reference.",
 47 |       "mime-type": "application/gzip",
 48 |       "optional": true,
 49 |       "type": "per-sample"
 50 |     },
 51 |     "host-bai": {
 52 |       "filepath": "host_bam/{{ alias }}.bai",
 53 |       "title": "BAM index file of host reads",
 54 |       "description": "BAM index file generated from mapping filtered input reads to the host reference.",
 55 |       "mime-type": "application/octet-stream",
 56 |       "optional": true,
 57 |       "type": "per-sample"
 58 |     },
 59 |     "minimap2-bam": {
 60 |       "filepath": "bams/{{ alias }}.reference.bam",
 61 |       "title": "BAM file (minimap2)",
 62 |       "description": "BAM file generated from mapping filtered input reads to the reference.",
 63 |       "mime-type": "application/gzip",
 64 |       "optional": true,
 65 |       "type": "per-sample"
 66 |     },
 67 |     "minimap2-index": {
 68 |       "filepath": "bams/{{ alias }}.reference.bam.bai",
 69 |       "title": "BAM index file (minimap2)",
 70 |       "description": "Index file generated from mapping filtered input reads to the reference.",
 71 |       "mime-type": "application/octet-stream",
 72 |       "optional": true,
 73 |       "type": "per-sample"
 74 |     },
 75 |     "minimap2-flagstats": {
 76 |       "filepath": "bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv",
 77 |       "title": "BAM flagstat (minimap2)",
 78 |       "description": "Mapping results per reference",
 79 |       "mime-type": "text/tab-separated-values",
 80 |       "optional": true,
 81 |       "type": "per-sample"
 82 |     },
 83 |     "minimap2-bamreadstats": {
 84 |       "filepath": "bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz",
 85 |       "title": "Minimap2 alignment statistics (minimap2)",
 86 |       "description": "Per read stats after aligning",
 87 |       "mime-type": "application/gzip",
 88 |       "optional": true,
 89 |       "type": "per-sample"
 90 |     },
 91 |     "reduced-reference": {
 92 |       "filepath": "igv_reference/reduced_reference.fasta.gz",
 93 |       "title": "Reduced reference FASTA file",
 94 |       "description": "Reference FASTA file containing only those sequences that have reads mapped against them.",
 95 |       "mime-type": "application/gzip",
 96 |       "optional": true,
 97 |       "type": "aggregated"
 98 |     },
 99 |     "reduced-reference-index": {
100 |       "filepath": "igv_reference/reduced_reference.fasta.gz.fai",
101 |       "title": "Index of the reduced reference FASTA file",
102 |       "description": "Index of the reference FASTA file containing only those sequences that have reads mapped against them.",
103 |       "mime-type": "text/tab-separated-values",
104 |       "optional": true,
105 |       "type": "aggregated"
106 |     },
107 |     "reduced-reference-gzi-index": {
108 |       "filepath": "igv_reference/reduced_reference.fasta.gz.gzi",
109 |       "title": "GZI index of the reduced reference FASTA file",
110 |       "description": "Index of the reference FASTA file containing only those sequences that have reads mapped against them.",
111 |       "mime-type": "application/octet-stream",
112 |       "optional": true,
113 |       "type": "aggregated"
114 |     },
115 |     "igv-config": {
116 |       "filepath": "igv.json",
117 |       "title": "JSON configuration file for IGV browser",
118 |       "description": "JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference.",
119 |       "mime-type": "text/json",
120 |       "optional": true,
121 |       "type": "aggregated"
122 |     },
123 |     "read-assignments": {
124 |       "filepath": "reads_assignments/{{ alias }}.*.assignments.tsv",
125 |       "title": "Taxonomic assignment per read.",
126 |       "description": "TSV file with the taxonomic assignment per read.",
127 |       "mime-type": "text/tab-separated-values",
128 |       "optional": true,
129 |       "type": "per-sample"
130 |     },
131 |     "extracted-fastq": {
132 |       "filepath": "extracted/{{ alias }}.minimap2.extracted.fastq",
133 |       "title": "FASTQ of the selected taxids.",
134 |       "description": "FASTQ containing/excluding the reads of the selected taxids.",
135 |       "mime-type": "text",
136 |       "optional": true,
137 |       "type": "per-sample"
138 |     },
139 |     "unclassified-fastq": {
140 |       "filepath": "unclassified/{{ alias }}.unclassified.fq.gz",
141 |       "title": "Unclassified FASTQ.",
142 |       "description": "FASTQ containing the reads that have not been classified against the database.",
143 |       "mime-type": "application/gzip",
144 |       "optional": true,
145 |       "type": "per-sample"
146 |     },
147 |     "alignment-table": {
148 |       "filepath": "alignment_tables/{{ alias }}.alignment-stats.tsv",
149 |       "title": "Alignment statistics TSV",
150 |       "description": "Coverage and taxonomy of each reference.",
151 |       "mime-type": "text/tab-separated-values",
152 |       "optional": true,
153 |       "type": "per-sample"
154 |     }
155 |   }
156 | }


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
  1 | # Changelog
  2 | All notable changes to this project will be documented in this file.
  3 | 
  4 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
  5 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
  6 | 
  7 | 
  8 | ## [v1.6.0]
  9 | This release of wf-16s updates documentation to include guidance for analysis of ITS amplicons with the SQK-MAB114 kit. Additionally, this version of wf-16s fixes issues with missing files and division by zero, which were triggered when input data coverage was very low. This release removes the real time analysis options to simplify the workflow; new solutions for real time taxonomic classification are in development but users who wish to continue using this functionality will need to pin wf-16s to v1.5.0.
 10 | ### Changed
 11 | - Update wf-metagenomics to [v2.14.0](https://github.com/epi2me-labs/wf-metagenomics/blob/master/CHANGELOG.md#v2140):
 12 |     - Update wf-template to v5.6.2, which changes:
 13 |         - Reduce verbosity of debug logging from fastcat which can occasionally occlude errors found in FASTQ files during ingress.
 14 |         - Log banner art to say "EPI2ME" instead of "EPI2ME Labs" to match current branding. This has no effect on the workflow outputs.
 15 |         - pre-commit configuration to resolve an internal dependency problem with flake8. This has no effect on the workflow.
 16 |     - Values in the diversity table appear as None if there are no reads in the sample.
 17 |     - Values in the abundance table are now integers instead of floats.
 18 |     - Samples with fewer than 50% of the median read count across all samples are excluded from the rarefaction table. This is to avoid the rest of the samples being rarefied to a very low number of reads, which would lead to a loss of information.
 19 | ### Added
 20 | - Section in the README about presets for analysing ITS sequencing.
 21 | ### Fixed
 22 | - Update to wf-metagenomics [v2.14.0](https://github.com/epi2me-labs/wf-metagenomics/blob/master/CHANGELOG.md#v2140):
 23 |     - Update wf-template to v5.6.2, which fixes:
 24 |         - Sequence summary read length N50 incorrectly displayed minimum read length, it now correctly shows the N50.
 25 |         - Sequence summary component alignment and coverage plots failed to plot under some conditions.
 26 |     - Missing output file containing per-read assignments after identity and coverage filters when using include_read_assignments with the minimap2 subworkflow; this table is now correctly published to {alias}_lineages.minimap2.assignments.tsv.
 27 |     - Missing output file(s) encountered in the prepare_databases:determine_bracken_length process when using the bracken_length option.
 28 |     - Missing output file(s) encountered in the minimap_pipeline:getAlignmentStats process when all reads are unclassified.
 29 |     - pandas.errors.EmptyDataError encountered in the getAlignmentStats process when reference coverage does not reach 1x
 30 |     - ZeroDivisionError: division by zero encountered in the progressive_bracken process when there are no taxa identified at all.
 31 |     - Versions of some tools were not properly displayed in the report.
 32 |     - raise ValueError("All objects passed were None") caused by all samples containing zero classified reads after applying bracken threshold.
 33 | 
 34 | ### Removed
 35 | - Real time functionality has been removed to simplify the workflow. The following parameters have been removed as they are no longer required: `server_threads`, `kraken_clients`, `port`, `host`, `external_kraken2`, `batch_size`, `real_time`, `read_limit`. Using these parameters in v1.6.0 onwards will cause an error.
 36 |  - Update image to remove kraken2-server dependency as it was only required by the real time workflow.
 37 | 
 38 | ## [v1.5.0]
 39 | ### Changed
 40 | - Bump to wf-metagenomics v2.13.0
 41 |     - NCBI Taxonomy database updated to the 2025-01-01 release
 42 |     - Reconciled workflow with wf-template v5.5.0.
 43 |     - Fix error: bracken-build: line 231: syntax error: unexpected end of file when using SILVA database.
 44 | ### Added
 45 | - `output_unclassified` parameter. When True, output unclassified FASTQ sequences for both minimap2 and kraken2 modes (default: False).
 46 | - Table with alignment stats is now an output: alignment_tables/{{ alias }}.alignment-stats.tsv
 47 | 
 48 | ## [v1.4.0]
 49 | ### Changed
 50 | - Bump to wf-metagenomics v2.12.0
 51 | ### Added
 52 | - `bracken_threshold` parameter to adjust bracken minimum read threshold, default 10.
 53 | 
 54 | ## [v1.3.0]
 55 | ### Fixed
 56 | - Switch to markdown links in the outputs table in the README.
 57 | - Exclude samples if all the reads are removed during host depletion.
 58 | ### Added
 59 | - `igv` option to enable IGV in the EPI2ME Desktop Application.
 60 | - `include_read_assignments` option to output a file with the taxonomy of each read.
 61 | - `Reads` section in the report to track the number of reads after filtering, host depletion and unclassified.
 62 | ### Changed
 63 | - Bump to wf-metagenomics v2.11.0
 64 | - `keep_bam` is now only required to output BAM files.
 65 | - `include_kraken2_assignments` has been replaced by `include_read_assignments`.
 66 | - Update databases:
 67 |     - Taxonomy database to the one released 2024-09-01
 68 | ### Removed
 69 | - `split-prefix` parameter, as the workflow automatically enables this option for large reference genomes.
 70 | - Plot showing number of reads per sample has been replaced for a new table in `Reads` section.
 71 | 
 72 | ## [v1.2.0]
 73 | ### Added
 74 | - Output IGV configuration file if the `keep_bam` option is enabled and a custom reference is provided (in minimap2 mode).
 75 | - Output reduced reference file if the `keep_bam` option is enabled (in minimap2 mode).
 76 | - `abundance_threshold` reduces the number of references to be displayed in IGV.
 77 | ### Fixed
 78 | - `exclude-host` can input a file in the EPI2ME Desktop Application.
 79 | ### Changed
 80 | - Bump to wf-metagenomics v2.10.0
 81 | 
 82 | ## [v1.1.3]
 83 | ### Added
 84 | - Reads below percentages of identity (`min_percent_identity`) and the reference covered (`min_ref_coverage`) are considered as unclassified in the minimap2 approach.
 85 | ### Fixed
 86 | - Files that are empty following the fastcat filtering are discarded from downstream analyses.
 87 | ### Changed
 88 | - Bump to wf-metagenomics v2.9.4
 89 | - `bam` folder within output has been renamed to `bams`
 90 | 
 91 | ## [v1.1.2]
 92 | ### Fixed
 93 | - "Can only use .dt accessor with datetimelike values" error in makeReport 
 94 | - "invalid literal for int() with base 10" error in makeReport
 95 | ### Changed
 96 | - Bump to wf-metagenomics v2.9.2
 97 | 
 98 | ## [v1.1.1]
 99 | ### Changed
100 | - Bump to wf-metagenomics v2.9.1
101 | 
102 | ## [v1.1.0]
103 | ### Added
104 | - Workflow now accepts BAM or FASTQ files as input (using the `--bam` or `--fastq` parameters, respectively).
105 | ### Changed
106 | - Bump to wf-metagenomics v2.9.0
107 | - Default for `--n_taxa_barplot` increased from 8 to 9.
108 | 
109 | ## [v1.0.0]
110 | ### Changed
111 | - Bump to wf-metagenomics v2.8.0
112 | - Update docs
113 | 
114 | ## [v0.0.4]
115 | ### Changed
116 | - Bump to wf-metagenomics v2.7.0
117 | - Fixed CHANGELOG format
118 | 
119 | ## [v0.0.3]
120 | ### Changed
121 | - Bump to wf-metagenomics v2.6.1
122 | 
123 | ## [v0.0.2]
124 | ### Changed
125 | - Bump to wf-metagenomics v2.6.0
126 | 
127 | ## [v0.0.1]
128 | - First release.


--------------------------------------------------------------------------------
/docs/08_pipeline_overview.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ### Workflow defaults and parameters
  3 | The workflow sets default values for parameters optimised for the analysis of full-length 16S rRNA gene amplicons, including `min_len`, `max_len`, `min_ref_coverage`, and `min_percent_identity`.
  4 | Descriptions of the parameters and their defaults can be found in the [input parameters section](#input-parameters).
  5 | 
  6 | #### Analysing ITS amplicons
  7 | For analysis of ITS amplicons users should adjust the following parameters:
  8 | - `min_len` should be decreased to 300, as ITS amplicons may be shorter than the current `min_len` default value which will cause them to be excluded.
  9 | - `database_set` should be changed to `ncbi_16s_18s_28s_ITS` or a [custom database](#faqs) containing the relevant ITS references.
 10 | 
 11 | ### 1. Concatenate input files and generate per read stats
 12 | 
 13 | [fastcat](https://github.com/epi2me-labs/fastcat) is used to concatenate input FASTQ files prior to downstream processing of the workflow. It will also output per-read stats including read lengths and average qualities.
 14 | 
 15 | You may want to choose which reads are analysed by filtering them using the flags `max_len`, `min_len` and `min_read_qual`.
 16 | 
 17 | ### 2. Remove host sequences (optional)
 18 | 
 19 | We have included an optional filtering step to remove any host sequences that map (using [Minimap2](https://github.com/lh3/minimap2)) against a provided host reference (e.g. human), which can be a FASTA file or a MMI index. To use this option provide the path to your host reference with the `exclude_host` parameter. The mapped reads are output in a BAM file and excluded from further analysis.
 20 | 
 21 | ```
 22 | nextflow run epi2me-labs/wf-16s --fastq test_data/case04/reads.fastq.gz --exclude_host test_data/case04/host.fasta.gz
 23 | ```
 24 | 
 25 | ### 3. Classify reads taxonomically
 26 | 
 27 | There are two different approaches to taxonomic classification:
 28 | 
 29 | #### 3.1 Using Minimap2
 30 | 
 31 | [Minimap2](https://github.com/lh3/minimap2) provides better resolution, but, depending on the reference database used, can take significantly more time. This is the option by default.
 32 | 
 33 | ```
 34 | nextflow run epi2me-labs/wf-16s --fastq test_data/case01 --classifier minimap2
 35 | ```
 36 | 
 37 | The creation of alignment statistics plots can be enabled with the `minimap2_by_reference` flag. Using this option produces a table and scatter plot in the report showing sequencing depth and coverage of each reference. The report also contains a heatmap indicating the sequencing depth over relative genomic coordinates for the references with the highest coverage (references with a mean coverage of less than 1% of the one with the largest value are omitted).
 38 | 
 39 | In addition, the user can output BAM files in a folder called `bams` by using the option `keep_bam`. If the user provides a custom database and uses the `igv` option, the workflow will also output the references with reads mappings, as well as an IGV configuration file. This configuration file allows the user to view the alignments in the EPI2ME Desktop Application in the Viewer tab. Note that the number of references can be reduced using the `abundance_threshold` option, which will select those references with a number of reads aligned higher than this value. Please, consider that the view of the alignment is highly dependent on the reference selected.
 40 | 
 41 | #### 3.2 Using Kraken2
 42 | 
 43 | [Kraken2](https://github.com/DerrickWood/kraken2) provides the fastest method for the taxonomic classification of the reads. Then, [Bracken](https://github.com/jenniferlu717/Bracken) is used to provide an estimate of the genus (or the selected taxonomic rank) abundance in the sample.
 44 | 
 45 | ### 4. Output
 46 | 
 47 | The main output of the wf-16s pipeline is the `wf-16s-report.html` which can be found in the output directory. It contains a summary of read statistics, the taxonomic composition of the sample and some diversity metrics. The results shown in the report can also be customised with several options. For example, you can use `abundance_threshold` to remove all taxa less prevalent than the threshold from the abundance table. When setting this parameter to a natural number, taxa with fewer absolute counts are removed. You can also pass a decimal between 0.0-1.0 to drop taxa of lower relative abundance. Furthermore, `n_taxa_barplot` controls the number of taxa displayed in the bar plot and groups the rest under the category ‘Other’.
 48 | 
 49 | You can use the flag `include_read_assignments` to include a per-sample TSV file that indicates how each input sequence was classified, as well as the taxon that has been assigned to each read.
 50 | 
 51 | For more information about remaining workflow outputs, please see [minimap2 Options](#minimap2-options).
 52 | 
 53 | ### 5. Diversity indices
 54 | 
 55 | Species diversity refers to the taxonomic composition in a specific microbial community. There are some useful concepts to take into account:
 56 | * Richness: number of unique taxonomic groups present in the community,
 57 | * Taxonomic group abundance: number of individuals of a particular taxonomic group present in the community,
 58 | * Evenness: refers to the equitability of the different taxonomic groups in terms of their abundances.
 59 |     Two different communities can host the same number of different taxonomic groups (i.e. they have the same richness), but they can have different evenness. For instance, if there is one taxon whose abundance is much larger in one community compared to the other.
 60 | 
 61 | There are three types of biodiversity measures described over a special scale <sup>[1](https://doi.org/10.2307/1218190), [2](https://doi.org/10.1016/B978-0-12-384719-5.00036-8)</sup>: alpha-, beta-, and gamma-diversity.
 62 | * Alpha-diversity refers to the richness that occurs within a community given area within a region.
 63 | * Beta-diversity defined as variation in the identities of species among sites, provides a direct link between biodiversity at local scales (alpha diversity) and the broader regional species pool (gamma diversity).
 64 | * Gamma-diversity is the total observed richness within an entire region.
 65 | 
 66 | To provide a quick overview of the alpha-diversity of the microbial community, we provide some of the most common diversity metrics calculated for a specific taxonomic rank <sup>[3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4224527/)</sup>, which can be chosen by the user with the `taxonomic_rank` parameter ('D'=Domain,'P'=Phylum, 'C'=Class, 'O'=Order, 'F'=Family, 'G'=Genus, 'S'=Species). By default, the rank is 'G' (genus-level). Some of the included alpha diversity metrics are:
 67 | 
 68 | * Shannon Diversity Index (H): Shannon entropy approaches zero if a community is almost entirely made up of a single taxon.
 69 | 
 70 | ```math
 71 | H = -\sum_{i=1}^{S}p_i*ln(p_i)
 72 | ```
 73 | 
 74 | * Simpson's Diversity Index (D): The range is from 0 (low diversity) to 1 (high diversity).    
 75 | 
 76 | ```math
 77 | D = \sum_{i=1}^{S}p_i^2
 78 | ```
 79 | 
 80 | * Pielou Index (J): The values range from 0 (presence of a dominant species) and 1 (maximum evennes).    
 81 | 
 82 | ```math
 83 | J = H/ln(S)
 84 | ```
 85 | 
 86 | * Berger-Parker dominance index (BP): expresses the proportional importance of the most abundant type, i.e., the ratio of number of individuals of most abundant species to the total number of individuals of all the species in the sample.
 87 | 
 88 | ```math
 89 | BP = n_i/N
 90 | ```
 91 |    where n<sub>i</sub> refers to the counts of the most abundant taxon and N is the total of counts.     
 92 | 
 93 | 
 94 | * Fisher’s alpha: Fisher (see Fisher, 1943<sup>[4](https://doi.org/10.2307/1411)</sup>) noticed that only a few species tend to be abundant while most are represented by only a few individuals ('rare biosphere'). These differences in species abundance can be incorporated into species diversity measurements such as the Fisher’s alpha. This index is based upon the logarithmic distribution of number of individuals of different species. 
 95 | 
 96 | ```math
 97 | S = \alpha * ln(1 + N/\alpha)
 98 | ```
 99 |    where S is the total number of taxa, N is the total number of individuals in the sample. The value of Fisher's $`\alpha`$ is calculated by iteration.
100 | 
101 | These indices are calculated by default using the original abundance table (see McMurdie and Holmes<sup>[5](https://pubmed.ncbi.nlm.nih.gov/24699258/)</sup>, 2014 and Willis<sup>[6](https://www.frontiersin.org/articles/10.3389/fmicb.2019.02407/full)</sup>, 2019). If you want to calculate them from a rarefied abundance table (i.e. all the samples have been subsampled to contain the same number of counts per sample, which is the 95% of the minimum number of total counts), you can download the rarefied table from the report.
102 | 
103 | The report also includes the rarefaction curve per sample which displays the mean of species richness for a subsample of reads (sample size). Generally, this curve initially grows rapidly, as most abundant species are sequenced and they add new taxa in the community, then slightly flattens due to the fact that 'rare' species are more difficult of being sampled, and because of that is more difficult to report an increase in the number of observed species.
104 | 
105 | > Note: Within each rank, each named taxon is a unique unit. The counts are the number of reads assigned to that taxon. All `Unknown` sequences are considered as a unique taxon
106 | 


--------------------------------------------------------------------------------
/docs/06_input_parameters.md:
--------------------------------------------------------------------------------
 1 | ### Input Options
 2 | 
 3 | | Nextflow parameter name  | Type | Description | Help | Default |
 4 | |--------------------------|------|-------------|------|---------|
 5 | | fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. |  |
 6 | | bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. |  |
 7 | | classifier | string | Kraken2 or Minimap2 workflow to be used for classification of reads. | Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info. | minimap2 |
 8 | | analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
 9 | | exclude_host | string | A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis. |  |  |
10 | 
11 | 
12 | ### Sample Options
13 | 
14 | | Nextflow parameter name  | Type | Description | Help | Default |
15 | |--------------------------|------|-------------|------|---------|
16 | | sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed. |  |
17 | | sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. |  |  |
18 | 
19 | 
20 | ### Reference Options
21 | 
22 | | Nextflow parameter name  | Type | Description | Help | Default |
23 | |--------------------------|------|-------------|------|---------|
24 | | database_set | string | Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database. | This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options. | ncbi_16s_18s |
25 | | database | string | Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory]. | By default uses database chosen in database_set parameter. |  |
26 | | taxonomy | string | Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file  [.tar.gz or directory]. | By default NCBI taxonomy file will be downloaded and used. |  |
27 | | reference | string | Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index. | This option should be used in conjunction with the database parameter to specify a custom database. |  |
28 | | ref2taxid | string | Not required but can be used to specify a  ref2taxid mapping. Format is .tsv (refname  taxid), no header row. | By default uses ref2taxid for option chosen in database_set parameter. |  |
29 | | taxonomic_rank | string | Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species). |  | G |
30 | 
31 | 
32 | ### Kraken2 Options
33 | 
34 | | Nextflow parameter name  | Type | Description | Help | Default |
35 | |--------------------------|------|-------------|------|---------|
36 | | bracken_length | integer | Set the length value Bracken will use | Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300 |  |
37 | | bracken_threshold | integer | Set the minimum read threshold Bracken will use to consider a taxon | Bracken will only consider taxa with a read count greater than or equal to this value. | 10 |
38 | | kraken2_memory_mapping | boolean | Avoids loading database into RAM | Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database. | False |
39 | | kraken2_confidence | number | Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1 | Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works. | 0.0 |
40 | 
41 | 
42 | ### Minimap2 Options
43 | 
44 | | Nextflow parameter name  | Type | Description | Help | Default |
45 | |--------------------------|------|-------------|------|---------|
46 | | minimap2filter | string | Filter output of minimap2 by taxids inc. child nodes, E.g. "9606,1404" | Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs. |  |
47 | | minimap2exclude | boolean | Invert minimap2filter and exclude the given taxids instead | Exclude a list of taxids from analysis outputs. | False |
48 | | keep_bam | boolean | Copy bam files into the output directory. |  | False |
49 | | minimap2_by_reference | boolean | Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report |  | False |
50 | | min_percent_identity | number | Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified. |  | 95 |
51 | | min_ref_coverage | number | Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths. |  | 90 |
52 | 
53 | 
54 | ### Report Options
55 | 
56 | | Nextflow parameter name  | Type | Description | Help | Default |
57 | |--------------------------|------|-------------|------|---------|
58 | | abundance_threshold | number | Remove those taxa whose abundance is equal or lower than the chosen value. | To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1. | 1 |
59 | | n_taxa_barplot | integer | Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the "Other" category. |  | 9 |
60 | 
61 | 
62 | ### Output Options
63 | 
64 | | Nextflow parameter name  | Type | Description | Help | Default |
65 | |--------------------------|------|-------------|------|---------|
66 | | out_dir | string | Directory for output of all user-facing files. |  | output |
67 | | igv | boolean | Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index. |  | False |
68 | | include_read_assignments | boolean | A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow. |  | False |
69 | | output_unclassified | boolean | Output a FASTQ of the unclassified reads. |  | False |
70 | 
71 | 
72 | ### Advanced Options
73 | 
74 | | Nextflow parameter name  | Type | Description | Help | Default |
75 | |--------------------------|------|-------------|------|---------|
76 | | min_len | integer | Specify read length lower limit. | Any reads shorter than this limit will not be included in the analysis. | 800 |
77 | | min_read_qual | number | Specify read quality lower limit. | Any reads with a quality lower than this limit will not be included in the analysis. |  |
78 | | max_len | integer | Specify read length upper limit | Any reads longer than this limit will not be included in the analysis. | 2000 |
79 | | threads | integer | Maximum number of CPU threads to use in each parallel workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. | 4 |
80 | 
81 | 
82 | 


--------------------------------------------------------------------------------
/main.nf:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env nextflow
  2 | 
  3 | import groovy.json.JsonBuilder
  4 | nextflow.enable.dsl = 2
  5 | 
  6 | include { fastq_ingress; xam_ingress } from './lib/ingress'
  7 | include { getParams } from './lib/common'
  8 | include { run_common } from './wf-metagenomics/subworkflows/common_pipeline'
  9 | include { minimap_pipeline } from './wf-metagenomics/subworkflows/minimap_pipeline'
 10 | // standard kraken2
 11 | include { kraken_pipeline } from './wf-metagenomics/subworkflows/kraken_pipeline'
 12 | 
 13 | // databases
 14 | include { prepare_databases } from "./wf-metagenomics/modules/local/databases.nf"
 15 | include {
 16 |     makeReport;
 17 |     getVersions;
 18 |     getVersionsCommon;
 19 | } from "./wf-metagenomics/modules/local/common"
 20 | 
 21 | OPTIONAL_FILE = file("$projectDir/data/OPTIONAL_FILE")
 22 | nextflow.preview.recursion=true
 23 | 
 24 | // entrypoint workflow
 25 | WorkflowMain.initialise(workflow, params, log)
 26 | workflow {
 27 |     Pinguscript.ping_start(nextflow, workflow, params)
 28 | 
 29 |     dataDir = projectDir + '/data'
 30 | 
 31 |     // Ready the optional file
 32 |     OPTIONAL = file("$projectDir/data/OPTIONAL_FILE")
 33 | 
 34 | 
 35 |     // Checking user parameters
 36 |     log.info("Checking inputs.")
 37 | 
 38 |     // Check maximum and minimum length
 39 |     ArrayList fastcat_extra_args = []
 40 |     if (params.min_len) { fastcat_extra_args << "-a $params.min_len" }
 41 |     if (params.max_len) { fastcat_extra_args << "-b $params.max_len" }
 42 |     if (params.min_read_qual) { fastcat_extra_args << "-q $params.min_read_qual" }
 43 |     // If BAM files are output, keep runIDs in case they are reused in the wf to track them.
 44 |     boolean keep_bam = (params.keep_bam || params.igv)
 45 |     if (keep_bam) {fastcat_extra_args << "-H"}
 46 | 
 47 |     // Check source param is valid
 48 |     sources = params.database_sets
 49 |     if (params.containsKey('include_kraken2_assignments')){
 50 |         throw new Exception("`include_kraken2_assignments` is now deprecated in favour of `include_read_assignments`.")
 51 |     }
 52 | 
 53 |     // Stop the pipeline in case not valid parameters combinations
 54 |     if (params.classifier == 'minimap2' && params.database) {
 55 |         throw new Exception("To use minimap2 with your custom database, you need to use `--reference` (instead of `--database`) and `--ref2taxid`.")
 56 |     }
 57 | 
 58 |     boolean output_igv = params.igv
 59 |     if (params.classifier == 'minimap2' && params.reference && params.igv) {
 60 |         ArrayList ref_exts = [".fa", ".fa.gz", ".fasta", ".fasta.gz", ".fna", ".fna.gz"]
 61 |         if (! ref_exts.any { ext -> file(params.reference).name.endsWith(ext) }) {
 62 |             output_igv = false
 63 |             log.info("The custom database reference must be a FASTA format file in order to view within IGV.")
 64 |         } else {
 65 |             output_igv=true
 66 |         }
 67 |     }
 68 | 
 69 |     if ((params.classifier == 'kraken2' ) && params.reference) {
 70 |         throw new Exception("To use kraken2 with your custom database, you need to use `--database` (instead of `--reference`) and include the `bracken_dist` within it.")
 71 |     }
 72 | 
 73 |     // If user provides each database, set to 'custom' the params.database_set
 74 |     if (params.reference || params.database) {
 75 |         source_name = 'custom'
 76 |         // distinguish between taxonomy and database to be able to use taxonomy default db in some cases.
 77 |         // this can be potentially risky but might be justified if the reference and ref2taxid use NCBI taxids.
 78 |         source_data_database = null
 79 |         source_name_taxonomy = params.database_set
 80 |         source_data_taxonomy = sources.get(source_name_taxonomy, false)
 81 |         log.info("Note: Reference/Database are custom.")
 82 |         log.info("Note: Memory available to the workflow must be slightly higher than size of the database $source_name index.")
 83 |         if (params.classifier == "kraken2"){
 84 |             log.info("Note: Or consider to use the --kraken2_memory_mapping.")
 85 |         }
 86 | 
 87 |     }
 88 |     if(params.taxonomy){
 89 |         // this can be useful if the user wants to use a new taxonomy database (maybe updated) but the default reference.
 90 |         source_name = params.database_set
 91 |         source_data_database = sources.get(source_name, false)
 92 |         source_data_taxonomy = null
 93 |         log.info("Note: Taxonomy database is custom.")
 94 |     } else {
 95 |         source_name = params.database_set
 96 |         source_data_database = sources.get(source_name, false)
 97 |         source_data_taxonomy = sources.get(source_name, false)
 98 |         if (!sources.containsKey(source_name) || !source_data_database) {
 99 |             keys = sources.keySet()
100 |             throw new Exception("Source $params.database_set is invalid, must be one of $keys")
101 |         }
102 |     }
103 | 
104 |     // Input data
105 |     if (params.fastq) {
106 |             ingress_samples = fastq_ingress([
107 |                 "input":params.fastq,
108 |                 "sample": params.sample,
109 |                 "sample_sheet": params.sample_sheet,
110 |                 "analyse_unclassified":params.analyse_unclassified,
111 |                 "stats": true,
112 |                 "fastcat_extra_args": fastcat_extra_args.join(" "),
113 |                 "per_read_stats": false
114 |             ])
115 |     } else {
116 |             // if we didn't get a `--fastq`, there must have been a `--bam` (as is codified
117 |             // by the schema)
118 |             ingress_samples = xam_ingress([
119 |                 "input":params.bam,
120 |                 "sample":params.sample,
121 |                 "sample_sheet":params.sample_sheet,
122 |                 "analyse_unclassified":params.analyse_unclassified,
123 |                 "return_fastq": true,
124 |                 "keep_unaligned": true,
125 |                 "stats": true,
126 |                 "per_read_stats": false
127 |             ])
128 |     }
129 |     
130 | 
131 |     // Discard empty samples
132 |     log.info(
133 |         "Note: Empty files or those files whose reads have been discarded after filtering based on " +
134 |         "read length and/or read quality will not appear in the report and will be excluded from subsequent analysis.")
135 |     ingress_samples_filtered = ingress_samples
136 |         | filter { meta, _seqs, _stats ->
137 |             def valid = meta['n_seqs'] > 0
138 |             if (!valid) {
139 |                 log.warn "Found empty file for sample '${meta["alias"]}'."
140 |             }
141 |             valid
142 |         }
143 | 
144 |     // Set minimap2 common options
145 |     ArrayList common_minimap2_opts = [
146 |         "-ax map-ont",
147 |         "--cap-kalloc 100m",
148 |         "--cap-sw-mem 50m",
149 |     ]
150 | 
151 | 
152 |     // Run common
153 |     versions = getVersionsCommon(getVersions())
154 |     parameters = getParams()
155 | 
156 |     if (params.exclude_host) {
157 |         host_reference = file(params.exclude_host, checkIfExists: true)
158 |         samples = run_common(ingress_samples_filtered, host_reference, common_minimap2_opts).samples
159 |     } else {
160 |         samples = ingress_samples_filtered
161 |     }
162 | 
163 |     if (params.classifier == "minimap2") {
164 |         log.info("Minimap2 pipeline.")
165 |         if (keep_bam) {
166 |             common_minimap2_opts = common_minimap2_opts + ["-y"]
167 |         }
168 |         databases = prepare_databases(
169 |             source_data_taxonomy,
170 |             source_data_database
171 |         )
172 |         results = minimap_pipeline(
173 |             samples,
174 |             databases.reference,
175 |             databases.ref2taxid,
176 |             databases.taxonomy,
177 |             databases.taxonomic_rank,
178 |             common_minimap2_opts,
179 |             output_igv
180 |             )
181 |         alignment_stats = results.alignment_reports
182 |     } else {
183 |     // Handle getting kraken2 database files if kraken2 classifier selected
184 |         log.info("Kraken2 pipeline.")
185 |         alignment_stats = Channel.empty()
186 |         databases = prepare_databases(
187 |                 source_data_taxonomy,
188 |                 source_data_database
189 |         )
190 |         results = kraken_pipeline(
191 |             samples,
192 |             databases.taxonomy,
193 |             databases.database,
194 |             databases.bracken_length,
195 |             databases.taxonomic_rank,
196 |         )
197 |     }
198 | 
199 |     // Use initial reads stats (after fastcat) QC,
200 |     // and after host_depletion
201 |     // but update meta after running pipelines
202 |     for_report = ingress_samples_filtered
203 |         | map { meta, _path, stats ->
204 |             [ meta.alias, stats ] }
205 |         | combine(
206 |             results.metadata_after_taxonomy,
207 |             by: 0 )  // on alias
208 |         | multiMap { _alias, stats, meta ->
209 |             meta: meta
210 |             stats: stats }
211 |     // Reporting
212 |     makeReport(
213 |         workflow.manifest.version,
214 |         for_report.meta.collect(),
215 |         for_report.stats.collect(),
216 |         results.abundance_table,
217 |         alignment_stats.ifEmpty(OPTIONAL_FILE),
218 |         results.lineages,
219 |         versions,
220 |         parameters,
221 |         databases.taxonomic_rank,
222 |         OPTIONAL_FILE
223 |     )
224 | }
225 | 
226 | workflow.onComplete {
227 |     Pinguscript.ping_complete(nextflow, workflow, params)
228 | }
229 | workflow.onError {
230 |     Pinguscript.ping_error(nextflow, workflow, params)
231 | }
232 | 


--------------------------------------------------------------------------------
/test/test_fastq_ingress.py:
--------------------------------------------------------------------------------
  1 | """Test `fastq_ingress` result of previously run workflow."""
  2 | import argparse
  3 | import json
  4 | import os
  5 | import pathlib
  6 | import re
  7 | import sys
  8 | 
  9 | import pandas as pd
 10 | import pysam
 11 | import pytest
 12 | 
 13 | 
 14 | FASTQ_EXTENSIONS = ["fastq", "fastq.gz", "fq", "fq.gz"]
 15 | ROOT_DIR = pathlib.Path(__file__).resolve().parent.parent
 16 | 
 17 | 
 18 | def is_fastq_file(fname):
 19 |     """Check if file is a FASTQ file."""
 20 |     return any(map(lambda ext: fname.endswith(ext), FASTQ_EXTENSIONS))
 21 | 
 22 | 
 23 | def get_fastq_files(path):
 24 |     """Return a list of FASTQ files for a given path."""
 25 |     return filter(is_fastq_file, os.listdir(path)) if os.path.isdir(path) else [path]
 26 | 
 27 | 
 28 | def create_metadict(**kwargs):
 29 |     """Create dict from metadata and check if required values are present."""
 30 |     if "alias" not in kwargs or kwargs["alias"] is None:
 31 |         raise ValueError("Meta data needs 'alias'.")
 32 |     defaults = dict(barcode=None, type="test_sample", run_ids=[])
 33 |     if "run_ids" in kwargs:
 34 |         # cast to sorted list to compare to workflow output
 35 |         kwargs["run_ids"] = sorted(list(kwargs["run_ids"]))
 36 |     defaults.update(kwargs)
 37 |     defaults["alias"] = defaults["alias"].replace(" ", "_")
 38 |     return defaults
 39 | 
 40 | 
 41 | def get_fastq_names_and_runids(fastq_file):
 42 |     """Create a dict of names and run_ids for entries in a FASTQ file."""
 43 |     names = []
 44 |     run_ids = set()
 45 |     with pysam.FastxFile(fastq_file) as f:
 46 |         for entry in f:
 47 |             names.append(entry.name)
 48 |             (run_id,) = re.findall(r"runid=([^\s]+)", entry.comment) or [None]
 49 |             if run_id:
 50 |                 run_ids.add(run_id)
 51 |     return dict(names=names, run_ids=run_ids)
 52 | 
 53 | 
 54 | def args():
 55 |     """Parse and process input arguments. Use the workflow params for those missing."""
 56 |     # get the path to the workflow output directory
 57 |     parser = argparse.ArgumentParser()
 58 |     parser.add_argument(
 59 |         "--wf-output-dir",
 60 |         default=ROOT_DIR / "output",
 61 |         help=(
 62 |             "path to the output directory where the workflow results have been "
 63 |             "published; defaults to 'output' in the root directory of the workflow if "
 64 |             "not provided"
 65 |         ),
 66 |     )
 67 |     parser.add_argument(
 68 |         "--fastq",
 69 |         help=(
 70 |             "Path to FASTQ input file / directory with FASTQ files / sub-directories; "
 71 |             "will take input path from workflow output if not provided"
 72 |         ),
 73 |     )
 74 |     parser.add_argument(
 75 |         "--sample_sheet",
 76 |         help=(
 77 |             "Path to sample sheet CSV file. If not provided, will take sample sheet "
 78 |             "path from workflow params (if available)."
 79 |         ),
 80 |     )
 81 |     args = parser.parse_args()
 82 | 
 83 |     wf_output_dir = pathlib.Path(args.wf_output_dir)
 84 |     fastq_ingress_results_dir = wf_output_dir / "fastq_ingress_results"
 85 | 
 86 |     # make sure that there are fastq_ingress results (i.e. that the workflow has been
 87 |     # run successfully and that the correct wf output path was provided)
 88 |     if not fastq_ingress_results_dir.exists():
 89 |         raise ValueError(
 90 |             f"{fastq_ingress_results_dir} does not exist. Has `wf-template` been run?"
 91 |         )
 92 | 
 93 |     # get the workflow params
 94 |     with open(wf_output_dir / "params.json", "r") as f:
 95 |         params = json.load(f)
 96 |     input_path = args.fastq if args.fastq is not None else ROOT_DIR / params["fastq"]
 97 |     sample_sheet = args.sample_sheet
 98 |     if sample_sheet is None and params["sample_sheet"] is not None:
 99 |         sample_sheet = ROOT_DIR / params["sample_sheet"]
100 | 
101 |     if not os.path.exists(input_path):
102 |         raise ValueError(f"Input path '{input_path}' does not exist.")
103 | 
104 |     return input_path, sample_sheet, fastq_ingress_results_dir, params
105 | 
106 | 
107 | def get_valid_inputs(input_path, sample_sheet, params):
108 |     """Get valid input paths and corresponding metadata."""
109 |     # find the valid inputs
110 |     valid_inputs = []
111 |     if os.path.isfile(input_path):
112 |         # handle file case
113 |         fastq_entries = get_fastq_names_and_runids(input_path)
114 |         valid_inputs.append(
115 |             [
116 |                 create_metadict(
117 |                     alias=params["sample"]
118 |                     if params["sample"] is not None
119 |                     else os.path.basename(input_path).split(".")[0],
120 |                     run_ids=fastq_entries["run_ids"],
121 |                 ),
122 |                 input_path,
123 |             ]
124 |         )
125 |     else:
126 |         # is a directory --> check if fastq files in top-level dir or in sub-dirs
127 |         tree = list(os.walk(input_path))
128 |         top_dir_has_fastq_files = any(map(is_fastq_file, tree[0][2]))
129 |         subdirs_have_fastq_files = any(
130 |             any(map(is_fastq_file, files)) for _, _, files in tree[1:]
131 |         )
132 |         if top_dir_has_fastq_files and subdirs_have_fastq_files:
133 |             raise ValueError(
134 |                 f"Input directory '{input_path}' cannot contain FASTQ "
135 |                 "files and sub-directories with FASTQ files."
136 |             )
137 |         # make sure we only have fastq files in either (top-level dir or sub-dirs) and
138 |         # not both
139 |         if not top_dir_has_fastq_files and not subdirs_have_fastq_files:
140 |             raise ValueError(
141 |                 f"Input directory '{input_path}' contains neither sub-directories "
142 |                 "nor FASTQ files."
143 |             )
144 |         if top_dir_has_fastq_files:
145 |             run_ids = set()
146 |             for fastq_file in get_fastq_files(input_path):
147 |                 curr_fastq_entries = get_fastq_names_and_runids(
148 |                     pathlib.Path(input_path) / fastq_file
149 |                 )
150 |                 run_ids.update(curr_fastq_entries["run_ids"])
151 |             valid_inputs.append(
152 |                 [
153 |                     create_metadict(
154 |                         alias=params["sample"]
155 |                         if params["sample"] is not None
156 |                         else os.path.basename(input_path),
157 |                         run_ids=run_ids,
158 |                     ),
159 |                     input_path,
160 |                 ]
161 |             )
162 |         else:
163 |             # iterate over the sub-directories
164 |             for subdir, subsubdirs, files in tree[1:]:
165 |                 # make sure we don't have sub-sub-directories containing fastq files
166 |                 if subsubdirs and any(
167 |                     is_fastq_file(file)
168 |                     for subsubdir in subsubdirs
169 |                     for file in os.listdir(pathlib.Path(subdir) / subsubdir)
170 |                 ):
171 |                     raise ValueError(
172 |                         f"Input directory '{input_path}' cannot contain more "
173 |                         "than one level of sub-directories with FASTQ files."
174 |                     )
175 |                 # handle unclassified
176 |                 if (
177 |                     os.path.basename(subdir) == "unclassified"
178 |                     and not params["analyse_unclassified"]
179 |                 ):
180 |                     continue
181 |                 # only process further if sub-dir has fastq files
182 |                 if any(map(is_fastq_file, files)):
183 |                     run_ids = set()
184 |                     for fastq_file in get_fastq_files(subdir):
185 |                         curr_fastq_entries = get_fastq_names_and_runids(
186 |                             pathlib.Path(subdir) / fastq_file
187 |                         )
188 |                         run_ids.update(curr_fastq_entries["run_ids"])
189 | 
190 |                     barcode = os.path.basename(subdir)
191 |                     valid_inputs.append(
192 |                         [
193 |                             create_metadict(
194 |                                 alias=barcode,
195 |                                 barcode=barcode,
196 |                                 run_ids=run_ids,
197 |                             ),
198 |                             subdir,
199 |                         ]
200 |                     )
201 |     # parse the sample sheet in case there was one
202 |     if sample_sheet is not None:
203 |         sample_sheet = pd.read_csv(sample_sheet).set_index(
204 |             # set 'barcode' as index while also keeping the 'barcode' column in the df
205 |             "barcode",
206 |             drop=False,
207 |         )
208 |         # now, get the corresponding inputs for each entry in the sample sheet (sample
209 |         # sheet entries for which no input directory was found will have `None` as their
210 |         # input path); we need a dict mapping barcodes to valid input paths for this
211 |         valid_inputs_dict = {os.path.basename(path): path for _, path in valid_inputs}
212 |         # reset `valid_inputs`
213 |         valid_inputs = []
214 |         for barcode, meta in sample_sheet.iterrows():
215 |             path = valid_inputs_dict.get(barcode)
216 |             run_ids = set()
217 |             if path is not None:
218 |                 for fastq_file in get_fastq_files(path):
219 |                     curr_fastq_entries = get_fastq_names_and_runids(
220 |                         pathlib.Path(path) / fastq_file
221 |                     )
222 |                     run_ids.update(curr_fastq_entries["run_ids"])
223 |             valid_inputs.append([create_metadict(**dict(meta), run_ids=run_ids), path])
224 |     return valid_inputs
225 | 
226 | 
227 | # prepare data for the tests
228 | @pytest.fixture(scope="module")
229 | def prepare():
230 |     """Prepare data for tests."""
231 |     input_path, sample_sheet, fastq_ingress_results_dir, params = args()
232 |     valid_inputs = get_valid_inputs(input_path, sample_sheet, params)
233 |     return fastq_ingress_results_dir, valid_inputs, params
234 | 
235 | 
236 | # define tests
237 | def test_result_subdirs(prepare):
238 |     """
239 |     Test if workflow results dir contains all expected samples.
240 | 
241 |     Tests if the published sub-directories in `fastq_ingress_results_dir` contain all
242 |     the samples we expect.
243 |     """
244 |     fastq_ingress_results_dir, valid_inputs, _ = prepare
245 |     _, subdirs, files = next(os.walk(fastq_ingress_results_dir))
246 |     assert not files, "Files found in top-level dir of fastq_ingress results"
247 |     assert set(subdirs) == set([meta["alias"] for meta, _ in valid_inputs])
248 | 
249 | 
250 | def test_fastq_entry_names(prepare):
251 |     """
252 |     Test FASTQ entries.
253 | 
254 |     Tests if the concatenated sequences indeed contain all the FASTQ entries of the
255 |     FASTQ files in the valid inputs.
256 |     """
257 |     fastq_ingress_results_dir, valid_inputs, _ = prepare
258 |     for meta, path in valid_inputs:
259 |         if path is None:
260 |             # this sample sheet entry had no input dir (or no reads)
261 |             continue
262 |         # get FASTQ entries in the result file produced by the workflow
263 |         fastq_entries = get_fastq_names_and_runids(
264 |             fastq_ingress_results_dir / meta["alias"] / "seqs.fastq.gz"
265 |         )
266 |         # now collect the FASTQ entries from the individual input files
267 |         exp_fastq_names = []
268 |         exp_fastq_runids = []
269 |         for fastq_file in get_fastq_files(path):
270 |             curr_fastq_entries = get_fastq_names_and_runids(
271 |                 pathlib.Path(path) / fastq_file
272 |             )
273 |             exp_fastq_names += curr_fastq_entries["names"]
274 |             exp_fastq_runids += curr_fastq_entries["run_ids"]
275 |         assert set(fastq_entries["names"]) == set(exp_fastq_names)
276 |         assert set(fastq_entries["run_ids"]) == set(exp_fastq_runids)
277 | 
278 | 
279 | def test_stats_present(prepare):
280 |     """Tests if the `fastcat` stats are present when they should be."""
281 |     fastq_ingress_results_dir, valid_inputs, params = prepare
282 |     for meta, path in valid_inputs:
283 |         if path is None:
284 |             # this sample sheet entry had no input dir (or no reads)
285 |             continue
286 |         # we expect `fastcat` stats in two cases: (i) they were requested explicitly or
287 |         # (ii) the input was a directory containing multiple FASTQ files
288 |         expect_stats = (
289 |             params["wf"]["fastcat_stats"]
290 |             or os.path.isdir(path)
291 |             and len(list(filter(is_fastq_file, os.listdir(path)))) > 1
292 |         )
293 |         stats_dir = fastq_ingress_results_dir / meta["alias"] / "fastcat_stats"
294 |         # assert that stats are there when we expect them
295 |         assert expect_stats == stats_dir.exists()
296 |         # make sure that the per-file and per-read stats files are there
297 |         if expect_stats:
298 |             for fname in ("per-file-stats.tsv", "per-read-stats.tsv"):
299 |                 assert (
300 |                     fastq_ingress_results_dir / meta["alias"] / "fastcat_stats" / fname
301 |                 ).is_file()
302 | 
303 | 
304 | def test_metamap(prepare):
305 |     """Test if the metamap in the `fastq_ingress` results is as expected."""
306 |     fastq_ingress_results_dir, valid_inputs, params = prepare
307 |     for meta, _ in valid_inputs:
308 |         # if there were no fastcat stats, we can't expect run IDs in the metamap
309 |         if not params["wf"]["fastcat_stats"]:
310 |             meta["run_ids"] = []
311 |         with open(fastq_ingress_results_dir / meta["alias"] / "metamap.json", "r") as f:
312 |             metamap = json.load(f)
313 |         assert meta == metamap
314 | 
315 | 
316 | if __name__ == "__main__":
317 |     # trigger pytest
318 |     ret_code = pytest.main([os.path.realpath(__file__), "-vv"])
319 |     sys.exit(ret_code)
320 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Oxford Nanopore Technologies PLC. Public License Version 1.0
  2 | =============================================================
  3 | 
  4 | 1. Definitions
  5 | --------------
  6 | 
  7 | 1.1. "Contributor"
  8 |     means each individual or legal entity that creates, contributes to
  9 |     the creation of, or owns Covered Software.
 10 | 
 11 | 1.2. "Contributor Version"
 12 |     means the combination of the Contributions of others (if any) used
 13 |     by a Contributor and that particular Contributor’s Contribution.
 14 | 
 15 | 1.3. "Contribution"
 16 |     means Covered Software of a particular Contributor.
 17 | 
 18 | 1.4. "Covered Software"
 19 |     means Source Code Form to which the initial Contributor has attached
 20 |     the notice in Exhibit A, the Executable Form of such Source Code
 21 |     Form, and Modifications of such Source Code Form, in each case
 22 |     including portions thereof.
 23 | 
 24 | 1.5. "Executable Form"
 25 |     means any form of the work other than Source Code Form.
 26 | 
 27 | 1.6. "Larger Work"
 28 |     means a work that combines Covered Software with other material, in
 29 |     a separate file or files, that is not Covered Software.
 30 | 
 31 | 1.7. "License"
 32 |     means this document.
 33 | 
 34 | 1.8. "Licensable"
 35 |     means having the right to grant, to the maximum extent possible,
 36 |     whether at the time of the initial grant or subsequently, any and
 37 |     all of the rights conveyed by this License.
 38 | 
 39 | 1.9. "Modifications"
 40 |     means any of the following:
 41 | 
 42 |     (a)	  any file in Source Code Form that results from an addition to,
 43 |           deletion from, or modification of the contents of Covered
 44 |           Software; or
 45 |     (b)   any new file in Source Code Form that contains any Covered
 46 |           Software.
 47 | 
 48 | 1.10. "Research Purposes"
 49 |     means use for internal research and not intended for or directed
 50 |     towards commercial advantages or monetary compensation; provided,
 51 |     however, that monetary compensation does not include sponsored
 52 |     research of research funded by grants.
 53 | 
 54 | 1.11  "Secondary License"
 55 |     means either the GNU General Public License, Version 2.0, the GNU
 56 |     Lesser General Public License, Version 2.1, the GNU Affero General
 57 |     Public License, Version 3.0, or any later versions of those
 58 |     licenses.
 59 | 
 60 | 1.12. "Source Code Form"
 61 |     means the form of the work preferred for making modifications.
 62 | 
 63 | 1.13. "You" (or "Your")
 64 |     means an individual or a legal entity exercising rights under this
 65 |     License. For legal entities, "You" includes any entity that
 66 |     controls, is controlled by, or is under common control with You. For
 67 |     purposes of this definition, "control" means (a) the power, direct
 68 |     or indirect, to cause the direction or management of such entity,
 69 |     whether by contract or otherwise, or (b) ownership of more than
 70 |     fifty percent (50%) of the outstanding shares or beneficial
 71 |     ownership of such entity.
 72 | 
 73 | 2. License Grants and Conditions
 74 | --------------------------------
 75 | 
 76 | 2.1. Grants
 77 | 
 78 | Each Contributor hereby grants You a world-wide, royalty-free,
 79 | non-exclusive license under Contributor copyrights Licensable by such
 80 | Contributor to use, reproduce, make available, modify, display,
 81 | perform, distribute, and otherwise exploit solely for Research Purposes
 82 | its Contributions, either on an unmodified basis, with Modifications,
 83 | or as part of a Larger Work.
 84 | 
 85 | 2.2. Effective Date
 86 | 
 87 | The licenses granted in Section 2.1 with respect to any Contribution
 88 | become effective for each Contribution on the date the Contributor
 89 | first distributes such Contribution.
 90 | 
 91 | 2.3. Limitations on Grant Scope
 92 | 
 93 | The licenses granted in this Section 2 are the only rights granted under
 94 | this License. No additional rights or licenses will be implied from the
 95 | distribution or licensing of Covered Software under this License. The
 96 | License is incompatible with Secondary Licenses.  Notwithstanding
 97 | Section 2.1 above, no copyright license is granted:
 98 | 
 99 | (a) for any code that a Contributor has removed from Covered Software;
100 |     or
101 | 
102 | (b) use of the Contributions or its Contributor Version other than for
103 | Research Purposes only; or
104 | 
105 | (c) for infringements caused by: (i) Your and any other third party’s
106 | modifications of Covered Software, or (ii) the combination of its
107 | Contributions with other software (except as part of its Contributor
108 | Version).
109 | 
110 | This License does not grant any rights in the patents, trademarks,
111 | service marks, or logos of any Contributor (except as may be necessary
112 | to comply with the notice requirements in Section 3.4).
113 | 
114 | 2.4. Subsequent Licenses
115 | 
116 | No Contributor makes additional grants as a result of Your choice to
117 | distribute the Covered Software under a subsequent version of this
118 | License (see Section 10.2) or under the terms of a Secondary License
119 | (if permitted under the terms of Section 3.3).
120 | 
121 | 2.5. Representation
122 | 
123 | Each Contributor represents that the Contributor believes its
124 | Contributions are its original creation(s) or it has sufficient rights
125 | to grant the rights to its Contributions conveyed by this License.
126 | 
127 | 2.6. Fair Use
128 | 
129 | This License is not intended to limit any rights You have under
130 | applicable copyright doctrines of fair use, fair dealing, or other
131 | equivalents.
132 | 
133 | 2.7. Conditions
134 | 
135 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted
136 | in Section 2.1.
137 | 
138 | 3. Responsibilities
139 | -------------------
140 | 
141 | 3.1. Distribution of Source Form
142 | 
143 | All distribution of Covered Software in Source Code Form, including any
144 | Modifications that You create or to which You contribute, must be under
145 | the terms of this License. You must inform recipients that the Source
146 | Code Form of the Covered Software is governed by the terms of this
147 | License, and how they can obtain a copy of this License. You may not
148 | attempt to alter or restrict the recipients’ rights in the Source Code Form.
149 | 
150 | 3.2. Distribution of Executable Form
151 | 
152 | If You distribute Covered Software in Executable Form then:
153 | 
154 | (a) such Covered Software must also be made available in Source Code
155 |     Form, as described in Section 3.1, and You must inform recipients of
156 |     the Executable Form how they can obtain a copy of such Source Code
157 |     Form by reasonable means in a timely manner, at a charge no more
158 |     than the cost of distribution to the recipient; and
159 | 
160 | (b) You may distribute such Executable Form under the terms of this
161 |     License.
162 | 
163 | 3.3. Distribution of a Larger Work
164 | 
165 | You may create and distribute a Larger Work under terms of Your choice,
166 | provided that You also comply with the requirements of this License for
167 | the Covered Software. The Larger Work may not be a combination of Covered
168 | Software with a work governed by one or more Secondary Licenses.
169 | 
170 | 3.4. Notices
171 | 
172 | You may not remove or alter the substance of any license notices
173 | (including copyright notices, patent notices, disclaimers of warranty,
174 | or limitations of liability) contained within the Source Code Form of
175 | the Covered Software, except that You may alter any license notices to
176 | the extent required to remedy known factual inaccuracies.
177 | 
178 | 3.5. Application of Additional Terms
179 | 
180 | You may not choose to offer, or charge a fee for use of the Covered
181 | Software or a fee for, warranty, support, indemnity or liability
182 | obligations to one or more recipients of Covered Software.  You must
183 | make it absolutely clear that any such warranty, support, indemnity, or
184 | liability obligation is offered by You alone, and You hereby agree to
185 | indemnify every Contributor for any liability incurred by such
186 | Contributor as a result of warranty, support, indemnity or liability
187 | terms You offer. You may include additional disclaimers of warranty and
188 | limitations of liability specific to any jurisdiction.
189 | 
190 | 4. Inability to Comply Due to Statute or Regulation
191 | ---------------------------------------------------
192 | 
193 | If it is impossible for You to comply with any of the terms of this
194 | License with respect to some or all of the Covered Software due to
195 | statute, judicial order, or regulation then You must: (a) comply with
196 | the terms of this License to the maximum extent possible; and (b)
197 | describe the limitations and the code they affect. Such description must
198 | be placed in a text file included with all distributions of the Covered
199 | Software under this License. Except to the extent prohibited by statute
200 | or regulation, such description must be sufficiently detailed for a
201 | recipient of ordinary skill to be able to understand it.
202 | 
203 | 5. Termination
204 | --------------
205 | 
206 | 5.1. The rights granted under this License will terminate automatically
207 | if You fail to comply with any of its terms.
208 | 
209 | 5.2. If You initiate litigation against any entity by asserting an
210 | infringement claim (excluding declaratory judgment actions,
211 | counter-claims, and cross-claims) alleging that a Contributor Version
212 | directly or indirectly infringes, then the rights granted to
213 | You by any and all Contributors for the Covered Software under Section
214 | 2.1 of this License shall terminate.
215 | 
216 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all
217 | end user license agreements (excluding distributors and resellers) which
218 | have been validly granted by You or Your distributors under this License
219 | prior to termination shall survive termination.
220 | 
221 | ************************************************************************
222 | *                                                                      *
223 | *  6. Disclaimer of Warranty                                           *
224 | *  -------------------------                                           *
225 | *                                                                      *
226 | *  Covered Software is provided under this License on an "as is"       *
227 | *  basis, without warranty of any kind, either expressed, implied, or  *
228 | *  statutory, including, without limitation, warranties that the       *
229 | *  Covered Software is free of defects, merchantable, fit for a        *
230 | *  particular purpose or non-infringing. The entire risk as to the     *
231 | *  quality and performance of the Covered Software is with You.        *
232 | *  Should any Covered Software prove defective in any respect, You     *
233 | *  (not any Contributor) assume the cost of any necessary servicing,   *
234 | *  repair, or correction. This disclaimer of warranty constitutes an   *
235 | *  essential part of this License. No use of any Covered Software is   *
236 | *  authorized under this License except under this disclaimer.         *
237 | *                                                                      *
238 | ************************************************************************
239 | 
240 | ************************************************************************
241 | *                                                                      *
242 | *  7. Limitation of Liability                                          *
243 | *  --------------------------                                          *
244 | *                                                                      *
245 | *  Under no circumstances and under no legal theory, whether tort      *
246 | *  (including negligence), contract, or otherwise, shall any           *
247 | *  Contributor, or anyone who distributes Covered Software as          *
248 | *  permitted above, be liable to You for any direct, indirect,         *
249 | *  special, incidental, or consequential damages of any character      *
250 | *  including, without limitation, damages for lost profits, loss of    *
251 | *  goodwill, work stoppage, computer failure or malfunction, or any    *
252 | *  and all other commercial damages or losses, even if such party      *
253 | *  shall have been informed of the possibility of such damages. This   *
254 | *  limitation of liability shall not apply to liability for death or   *
255 | *  personal injury resulting from such party’s negligence to the       *
256 | *  extent applicable law prohibits such limitation, but in such event, *
257 | *  and to the greatest extent permissible, damages will be limited to  *
258 | *  direct damages not to exceed one hundred dollars. Some              *
259 | *  jurisdictions do not allow the exclusion or limitation of           *
260 | *  incidental or consequential damages, so this exclusion and          *
261 | *  limitation may not apply to You.                                    *
262 | *                                                                      *
263 | ************************************************************************
264 | 
265 | 8. Litigation
266 | -------------
267 | 
268 | Any litigation relating to this License may be brought only in the
269 | courts of a jurisdiction where the defendant maintains its principal
270 | place of business and such litigation shall be governed by laws of that
271 | jurisdiction, without reference to its conflict-of-law provisions.
272 | Nothing in this Section shall prevent a party’s ability to bring
273 | cross-claims or counter-claims.
274 | 
275 | 9. Miscellaneous
276 | ----------------
277 | 
278 | This License represents the complete agreement concerning the subject
279 | matter hereof. If any provision of this License is held to be
280 | unenforceable, such provision shall be reformed only to the extent
281 | necessary to make it enforceable. Any law or regulation which provides
282 | that the language of a contract shall be construed against the drafter
283 | shall not be used to construe this License against a Contributor.
284 | 
285 | 10. Versions of the License
286 | ---------------------------
287 | 
288 | 10.1. New Versions
289 | 
290 | Oxford Nanopore Technologies PLC. is the license steward. Except as
291 | provided in Section 10.3, no one other than the license steward has the
292 | right to modify or publish new versions of this License. Each version
293 | will be given a distinguishing version number.
294 | 
295 | 10.2. Effect of New Versions
296 | 
297 | You may distribute the Covered Software under the terms of the version
298 | of the License under which You originally received the Covered Software,
299 | or under the terms of any subsequent version published by the license
300 | steward.
301 | 
302 | 10.3. Modified Versions
303 | 
304 | If you create software not governed by this License, and you want to
305 | create a new license for such software, you may create and use a
306 | modified version of this License if you rename the license and remove
307 | any references to the name of the license steward (except to note that
308 | such modified license differs from this License).
309 | 
310 | Exhibit A - Source Code Form License Notice
311 | -------------------------------------------
312 | 
313 |   This Source Code Form is subject to the terms of the Oxford Nanopore
314 |   Technologies PLC. Public License, v. 1.0. Full licence can be found
315 |   obtained from support@nanoporetech.com
316 | 
317 | If it is not possible or desirable to put the notice in a particular
318 | file, then You may include the notice in a location (such as a LICENSE
319 | file in a relevant directory) where a recipient would be likely to look
320 | for such a notice.
321 | 
322 | You may add additional accurate notices of copyright ownership.
323 | 


--------------------------------------------------------------------------------
/nextflow_schema.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "$schema": "http://json-schema.org/draft-07/schema",
  3 |     "$id": "https://raw.githubusercontent.com/epi2me-labs/wf-16s/master/nextflow_schema.json",
  4 |     "title": "epi2me-labs/wf-16s",
  5 |     "workflow_title": "16S rRNA",
  6 |     "description": "Taxonomic classification of 16S rRNA gene sequencing data.",
  7 |     "demo_url": "https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz",
  8 |     "aws_demo_url": "https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo/aws.nextflow.config",
  9 |     "url": "https://github.com/epi2me-labs/wf-16s",
 10 |     "type": "object",
 11 |     "resources": {
 12 |         "recommended": {
 13 |             "cpus": 12,
 14 |             "memory": "32GB"
 15 |         },
 16 |         "minimum": {
 17 |             "cpus": 6,
 18 |             "memory": "16GB"
 19 |         },
 20 |         "run_time": "~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database.",
 21 |         "arm_support": true
 22 |     },
 23 |     "definitions": {
 24 |         "input_options": {
 25 |             "title": "Input Options",
 26 |             "type": "object",
 27 |             "fa_icon": "fas fa-terminal",
 28 |             "description": "Define where the pipeline should find input data and save output data.",
 29 |             "properties": {
 30 |                 "fastq": {
 31 |                     "type": "string",
 32 |                     "format": "path",
 33 |                     "title": "FASTQ",
 34 |                     "description": "FASTQ files to use in the analysis.",
 35 |                     "help_text": "This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.",
 36 |                     "demo_data": "test_data"
 37 |                 },
 38 |                 "bam": {
 39 |                     "type": "string",
 40 |                     "format": "path",
 41 |                     "description": "BAM or unaligned BAM (uBAM) files to use in the analysis.",
 42 |                     "help_text": "This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`."
 43 |                 },
 44 |                 "classifier": {
 45 |                     "type": "string",
 46 |                     "default": "minimap2",
 47 |                     "title": "Classification method",
 48 |                     "description": "Kraken2 or Minimap2 workflow to be used for classification of reads.",
 49 |                     "enum": [
 50 |                         "kraken2",
 51 |                         "minimap2"
 52 |                     ],
 53 |                     "help_text": "Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info."
 54 |                 },
 55 |                 "analyse_unclassified": {
 56 |                     "type": "boolean",
 57 |                     "default": false,
 58 |                     "title": "Analyse unclassified reads",
 59 |                     "description": "Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory.",
 60 |                     "help_text": "If selected and if the input is a multiplex directory the workflow will also process the unclassified directory."
 61 |                 },
 62 |                 "exclude_host": {
 63 |                     "type": "string",
 64 |                     "format": "file-path",
 65 |                     "title": "Exclude host reads",
 66 |                     "description": "A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis."
 67 |                 }
 68 |             },
 69 |             "oneOf": [
 70 |                 {
 71 |                     "required": [
 72 |                         "fastq"
 73 |                     ]
 74 |                 },
 75 |                 {
 76 |                     "required": [
 77 |                         "bam"
 78 |                     ]
 79 |                 }
 80 |             ]
 81 |         },
 82 |         "sample_options": {
 83 |             "title": "Sample Options",
 84 |             "type": "object",
 85 |             "default": "",
 86 |             "properties": {
 87 |                 "sample_sheet": {
 88 |                     "type": "string",
 89 |                     "format": "file-path",
 90 |                     "title": "Sample sheet",
 91 |                     "description": "A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files.",
 92 |                     "help_text": "The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed."
 93 |                 },
 94 |                 "sample": {
 95 |                     "type": "string",
 96 |                     "title": "Sample name",
 97 |                     "description": "A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files."
 98 |                 }
 99 |             },
100 |             "description": "Parameters that relate to samples such as sample sheets and sample names."
101 |         },
102 |         "reference_options": {
103 |             "title": "Reference Options",
104 |             "type": "object",
105 |             "description": "Files will be downloaded as part of the first run of workflow and automatically stored for subsequent runs.",
106 |             "default": "",
107 |             "properties": {
108 |                 "database_set": {
109 |                     "type": "string",
110 |                     "default": "ncbi_16s_18s",
111 |                     "title": "Choose a database",
112 |                     "description": "Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database.",
113 |                     "enum": [
114 |                         "ncbi_16s_18s",
115 |                         "ncbi_16s_18s_28s_ITS",
116 |                         "SILVA_138_1"
117 |                     ],
118 |                     "help_text": "This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options."
119 |                 },
120 |                 "store_dir": {
121 |                     "type": "string",
122 |                     "format": "directory-path",
123 |                     "title": "Store directory name",
124 |                     "description": "Where to store initial download of database.",
125 |                     "help_text": "database set selected will be downloaded as part of the workflow and saved in this location, on subsequent runs it will use this as the database. ",
126 |                     "hidden": true,
127 |                     "default": "store_dir"
128 |                 },
129 |                 "database": {
130 |                     "type": "string",
131 |                     "format": "path",
132 |                     "title": "Kraken2 database",
133 |                     "description": "Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory].",
134 |                     "help_text": "By default uses database chosen in database_set parameter.",
135 |                     "overrides": {
136 |                         "epi2mecloud": {
137 |                             "hidden": true
138 |                         }
139 |                     }
140 |                 },
141 |                 "taxonomy": {
142 |                     "type": "string",
143 |                     "format": "path",
144 |                     "title": "Taxonomy database",
145 |                     "description": "Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file  [.tar.gz or directory].",
146 |                     "help_text": "By default NCBI taxonomy file will be downloaded and used."
147 |                 },
148 |                 "reference": {
149 |                     "type": "string",
150 |                     "format": "file-path",
151 |                     "title": "Minimap2 reference",
152 |                     "description": "Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index.",
153 |                     "help_text": "This option should be used in conjunction with the database parameter to specify a custom database."
154 |                 },
155 |                 "ref2taxid": {
156 |                     "type": "string",
157 |                     "format": "file-path",
158 |                     "title": "File linking reference IDs to specific taxids",
159 |                     "description": "Not required but can be used to specify a  ref2taxid mapping. Format is .tsv (refname  taxid), no header row.",
160 |                     "help_text": "By default uses ref2taxid for option chosen in database_set parameter."
161 |                 },
162 |                 "taxonomic_rank": {
163 |                     "type": "string",
164 |                     "default": "G",
165 |                     "title": "Taxonomic rank",
166 |                     "description": "Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species).",
167 |                     "enum": [
168 |                         "S",
169 |                         "G",
170 |                         "F",
171 |                         "O",
172 |                         "C",
173 |                         "P"
174 |                     ]
175 |                 }
176 |             },
177 |             "dependencies": {
178 |                 "reference": [
179 |                     "ref2taxid"
180 |                 ],
181 |                 "ref2taxid": [
182 |                     "reference"
183 |                 ]
184 |             }
185 |         },
186 |         "kraken2_options": {
187 |             "title": "Kraken2 Options",
188 |             "type": "object",
189 |             "fa_icon": "fas fa-university",
190 |             "help_text": "Kraken2: It is possible to enable classification by Kraken2, disabling alignment, which is a faster but coarser method of classification reliant on the presence of a Kraken2 database.",
191 |             "properties": {
192 |                 "bracken_length": {
193 |                     "type": "integer",
194 |                     "title": "Bracken length",
195 |                     "description": "Set the length value Bracken will use",
196 |                     "minimum": 1,
197 |                     "help_text": "Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300"
198 |                 },
199 |                 "bracken_threshold": {
200 |                     "type": "integer",
201 |                     "title": "Bracken minimum read threshold",
202 |                     "description": "Set the minimum read threshold Bracken will use to consider a taxon",
203 |                     "default": 10,
204 |                     "minimum": 0,
205 |                     "help_text": "Bracken will only consider taxa with a read count greater than or equal to this value."
206 |                 },
207 |                 "kraken2_memory_mapping": {
208 |                     "type": "boolean",
209 |                     "default": false,
210 |                     "title": "Enable memory mapping",
211 |                     "description": "Avoids loading database into RAM",
212 |                     "help_text": "Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database."
213 |                 },
214 |                 "kraken2_confidence": {
215 |                     "type": "number",
216 |                     "default": 0.0,
217 |                     "title": "Confidence score threshold",
218 |                     "description": "Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1",
219 |                     "help_text": "Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works."
220 |                 }
221 |             },
222 |             "description": "Kraken2 classification options. Only relevant if classifier parameter is set to kraken2"
223 |         },
224 |         "minimap2_options": {
225 |             "title": "Minimap2 Options",
226 |             "type": "object",
227 |             "fa_icon": "fas fa-dna",
228 |             "properties": {
229 |                 "minimap2filter": {
230 |                     "type": "string",
231 |                     "title": "Select reads belonging to the following taxonomy identifiers (taxids)",
232 |                     "description": "Filter output of minimap2 by taxids inc. child nodes, E.g. \"9606,1404\"",
233 |                     "help_text": "Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs."
234 |                 },
235 |                 "minimap2exclude": {
236 |                     "type": "boolean",
237 |                     "default": false,
238 |                     "title": "Exclude reads from previous selected taxids",
239 |                     "description": "Invert minimap2filter and exclude the given taxids instead",
240 |                     "help_text": "Exclude a list of taxids from analysis outputs."
241 |                 },
242 |                 "keep_bam": {
243 |                     "type": "boolean",
244 |                     "title": "Enable keep BAM files",
245 |                     "default": false,
246 |                     "description": "Copy bam files into the output directory."
247 |                 },
248 |                 "minimap2_by_reference": {
249 |                     "type": "boolean",
250 |                     "default": false,
251 |                     "title": "Compute coverage and sequencing depth of the references.",
252 |                     "description": "Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report"
253 |                 },
254 |                 "min_percent_identity": {
255 |                     "type": "number",
256 |                     "default": 95,
257 |                     "minimum": 0,
258 |                     "maximum": 100,
259 |                     "title": "Filter taxa based on the percent of identity with the references.",
260 |                     "description": "Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified."
261 |                 },
262 |                 "min_ref_coverage": {
263 |                     "type": "number",
264 |                     "default": 90,
265 |                     "minimum": 0,
266 |                     "maximum": 100,
267 |                     "title": "Filter taxa based on the percent of coverage with the reference.",
268 |                     "description": "Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths."
269 |                 }
270 |             },
271 |             "description": "Minimap2 classification options. Only relevant if classifier parameter is set to minimap2.",
272 |             "help_text": "Minimap2: The default strategy uses minimap2 to perform full alignments against FASTA-formatted references sequences."
273 |         },
274 |         "report_options": {
275 |             "title": "Report Options",
276 |             "type": "object",
277 |             "fa_icon": "fas fa-pills",
278 |             "properties": {
279 |                 "abundance_threshold": {
280 |                     "type": "number",
281 |                     "default": 1,
282 |                     "title": "Abundance threshold",
283 |                     "description": "Remove those taxa whose abundance is equal or lower than the chosen value.",
284 |                     "help_text": "To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1."
285 |                 },
286 |                 "n_taxa_barplot": {
287 |                     "type": "integer",
288 |                     "default": 9,
289 |                     "title": "Number of taxa to be displayed in the barplot",
290 |                     "description": "Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the \"Other\" category."
291 |                 }
292 |             }
293 |         },
294 |         "output_options": {
295 |             "title": "Output Options",
296 |             "type": "object",
297 |             "description": "Parameters for saving and naming workflow outputs.",
298 |             "default": "",
299 |             "properties": {
300 |                 "out_dir": {
301 |                     "type": "string",
302 |                     "format": "directory-path",
303 |                     "default": "output",
304 |                     "title": "Output folder name",
305 |                     "description": "Directory for output of all user-facing files."
306 |                 },
307 |                 "igv": {
308 |                     "type": "boolean",
309 |                     "default": false,
310 |                     "title": "IGV",
311 |                     "description": "Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index."
312 |                 },
313 |                 "include_read_assignments": {
314 |                     "type": "boolean",
315 |                     "default": false,
316 |                     "title": "Include Kraken2/Minimap2 taxonomy per read.",
317 |                     "description": "A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow."
318 |                 },
319 |                 "output_unclassified": {
320 |                     "type": "boolean",
321 |                     "default": false,
322 |                     "title": "Output unclassified reads.",
323 |                     "description": "Output a FASTQ of the unclassified reads."
324 |                 }
325 |             }
326 |         },
327 |         "advanced_options": {
328 |             "title": "Advanced Options",
329 |             "type": "object",
330 |             "description": "Advanced options for configuring processes inside the workflow.",
331 |             "default": "",
332 |             "properties": {
333 |                 "min_len": {
334 |                     "type": "integer",
335 |                     "default": 800,
336 |                     "title": "Minimum read length",
337 |                     "description": "Specify read length lower limit.",
338 |                     "help_text": "Any reads shorter than this limit will not be included in the analysis."
339 |                 },
340 |                 "min_read_qual": {
341 |                     "type": "number",
342 |                     "title": "Minimum read quality",
343 |                     "description": "Specify read quality lower limit.",
344 |                     "help_text": "Any reads with a quality lower than this limit will not be included in the analysis."
345 |                 },
346 |                 "max_len": {
347 |                     "type": "integer",
348 |                     "title": "Maximum read length",
349 |                     "default": 2000,
350 |                     "description": "Specify read length upper limit",
351 |                     "help_text": "Any reads longer than this limit will not be included in the analysis."
352 |                 },
353 |                 "threads": {
354 |                     "type": "integer",
355 |                     "default": 4,
356 |                     "title": "Number of CPU threads per workflow task",
357 |                     "description": "Maximum number of CPU threads to use in each parallel workflow task.",
358 |                     "help_text": "Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes."
359 |                 }
360 |             }
361 |         },
362 |         "miscellaneous_options": {
363 |             "title": "Miscellaneous Options",
364 |             "type": "object",
365 |             "fa_icon": "fas fa-file-import",
366 |             "description": "Everything else.",
367 |             "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs. Typically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.",
368 |             "properties": {
369 |                 "disable_ping": {
370 |                     "type": "boolean",
371 |                     "default": false,
372 |                     "description": "Enable to prevent sending a workflow ping.",
373 |                     "overrides": {
374 |                         "epi2mecloud": {
375 |                             "hidden": true
376 |                         }
377 |                     }
378 |                 },
379 |                 "help": {
380 |                     "type": "boolean",
381 |                     "title": "Display help text",
382 |                     "default": false,
383 |                     "fa_icon": "fas fa-question-circle",
384 |                     "hidden": true
385 |                 },
386 |                 "version": {
387 |                     "type": "boolean",
388 |                     "title": "Display version",
389 |                     "default": false,
390 |                     "description": "Display version and exit.",
391 |                     "fa_icon": "fas fa-question-circle",
392 |                     "hidden": true
393 |                 }
394 |             }
395 |         }
396 |     },
397 |     "allOf": [
398 |         {
399 |             "$ref": "#/definitions/input_options"
400 |         },
401 |         {
402 |             "$ref": "#/definitions/sample_options"
403 |         },
404 |         {
405 |             "$ref": "#/definitions/reference_options"
406 |         },
407 |         {
408 |             "$ref": "#/definitions/kraken2_options"
409 |         },
410 |         {
411 |             "$ref": "#/definitions/minimap2_options"
412 |         },
413 |         {
414 |             "$ref": "#/definitions/output_options"
415 |         },
416 |         {
417 |             "$ref": "#/definitions/advanced_options"
418 |         },
419 |         {
420 |             "$ref": "#/definitions/miscellaneous_options"
421 |         },
422 |         {
423 |             "$ref": "#/definitions/report_options"
424 |         }
425 |     ],
426 |     "properties": {
427 |         "aws_image_prefix": {
428 |             "type": "string",
429 |             "title": "AWS image prefix",
430 |             "hidden": true
431 |         },
432 |         "aws_queue": {
433 |             "type": "string",
434 |             "title": "AWS queue",
435 |             "hidden": true
436 |         },
437 |         "monochrome_logs": {
438 |             "type": "boolean"
439 |         },
440 |         "validate_params": {
441 |             "type": "boolean",
442 |             "default": true
443 |         },
444 |         "show_hidden_params": {
445 |             "type": "boolean"
446 |         }
447 |     }
448 | }


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 16S rRNA
  2 | 
  3 | Taxonomic classification of 16S rRNA gene sequencing data.
  4 | 
  5 | 
  6 | 
  7 | ## Introduction
  8 | 
  9 | This workflow can be used for the following:
 10 | 
 11 | + Taxonomic classification of 16S rRNA, 18S rRNA and ITS amplicons using [default or custom databases](#faqs). Default databases:
 12 |     - NCBI targeted loci: 16S rDNA, 18S rDNA, ITS (ncbi_16s_18s, ncbi_16s_18s_28s_ITS; see [here](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) for details).
 13 | + Generate taxonomic profiles of one or more samples.
 14 | 
 15 | The workflow default parameters are optimised for analysis of 16S rRNA gene amplicons.
 16 | For ITS amplicons, it is strongly recommended that some parameters are changed from the defaults, please see the [ITS presets](#analysing-its-amplicons) section for more information.
 17 | 
 18 | Additional features:
 19 | + Two different approaches are available: `minimap2` (using alignment, default option) or `kraken2` (k-mer based).
 20 | + Results include:
 21 |     - An abundance table with counts per taxa in all the samples.
 22 |     - Interactive sankey and sunburst plots to explore the different identified lineages.
 23 |     - A bar plot comparing the abundances of the most abundant taxa in all the samples.
 24 | 
 25 | 
 26 | 
 27 | 
 28 | ## Compute requirements
 29 | 
 30 | Recommended requirements:
 31 | 
 32 | + CPUs = 12
 33 | + Memory = 32GB
 34 | 
 35 | Minimum requirements:
 36 | 
 37 | + CPUs = 6
 38 | + Memory = 16GB
 39 | 
 40 | Approximate run time: ~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database.
 41 | 
 42 | ARM processor support: True
 43 | 
 44 | 
 45 | 
 46 | 
 47 | ## Install and run
 48 | 
 49 | 
 50 | These are instructions to install and run the workflow on command line.
 51 | You can also access the workflow via the
 52 | [EPI2ME Desktop application](https://labs.epi2me.io/downloads/).
 53 | 
 54 | The workflow uses [Nextflow](https://www.nextflow.io/) to manage
 55 | compute and software resources,
 56 | therefore Nextflow will need to be
 57 | installed before attempting to run the workflow.
 58 | 
 59 | The workflow can currently be run using either
 60 | [Docker](https://docs.docker.com/get-started/)
 61 | or [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html)
 62 | to provide isolation of the required software.
 63 | Both methods are automated out-of-the-box provided
 64 | either Docker or Singularity is installed.
 65 | This is controlled by the
 66 | [`-profile`](https://www.nextflow.io/docs/latest/config.html#config-profiles)
 67 | parameter as exemplified below.
 68 | 
 69 | It is not required to clone or download the git repository
 70 | in order to run the workflow.
 71 | More information on running EPI2ME workflows can
 72 | be found on our [website](https://labs.epi2me.io/wfindex).
 73 | 
 74 | The following command can be used to obtain the workflow.
 75 | This will pull the repository in to the assets folder of
 76 | Nextflow and provide a list of all parameters
 77 | available for the workflow as well as an example command:
 78 | 
 79 | ```
 80 | nextflow run epi2me-labs/wf-16s --help
 81 | ```
 82 | To update a workflow to the latest version on the command line use
 83 | the following command:
 84 | ```
 85 | nextflow pull epi2me-labs/wf-16s
 86 | ```
 87 | 
 88 | A demo dataset is provided for testing of the workflow.
 89 | It can be downloaded and unpacked using the following commands:
 90 | ```
 91 | wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz
 92 | tar -xzvf wf-16s-demo.tar.gz
 93 | ```
 94 | The workflow can then be run with the downloaded demo data using:
 95 | ```
 96 | nextflow run epi2me-labs/wf-16s \
 97 | 	--fastq 'wf-16s-demo/test_data' \
 98 | 	--minimap2_by_reference \
 99 | 	-profile standard
100 | ```
101 | 
102 | For further information about running a workflow on
103 | the command line see https://labs.epi2me.io/wfquickstart/
104 | 
105 | 
106 | 
107 | 
108 | ## Related protocols
109 | 
110 | This workflow is designed to take input sequences that have been produced by [Oxford Nanopore Technologies](https://nanoporetech.com/) devices using protocols associated with either of the kits listed below:
111 | 
112 | - [SQK-MAB114.24](https://nanoporetech.com/document/microbial-amplicon-barcoding-sequencing-for-16s-and-its-sqk-mab114-24)
113 | - [SQK-16S114.24](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/rapid-sequencing-DNA-16s-barcoding-kit-v14-sqk-16114-24)
114 | 
115 | Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/).
116 | 
117 | 
118 | 
119 | ## Input example
120 | 
121 | This workflow accepts either FASTQ or BAM files as input.
122 | 
123 | The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.
124 | 
125 | ```
126 | (i)                     (ii)                 (iii)    
127 | input_reads.fastq   ─── input_directory  ─── input_directory
128 |                         ├── reads0.fastq     ├── barcode01
129 |                         └── reads1.fastq     │   ├── reads0.fastq
130 |                                              │   └── reads1.fastq
131 |                                              ├── barcode02
132 |                                              │   ├── reads0.fastq
133 |                                              │   ├── reads1.fastq
134 |                                              │   └── reads2.fastq
135 |                                              └── barcode03
136 |                                               └── reads0.fastq
137 | ```
138 | 
139 | 
140 | 
141 | ## Input parameters
142 | 
143 | ### Input Options
144 | 
145 | | Nextflow parameter name  | Type | Description | Help | Default |
146 | |--------------------------|------|-------------|------|---------|
147 | | fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. |  |
148 | | bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. |  |
149 | | classifier | string | Kraken2 or Minimap2 workflow to be used for classification of reads. | Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info. | minimap2 |
150 | | analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False |
151 | | exclude_host | string | A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis. |  |  |
152 | 
153 | 
154 | ### Sample Options
155 | 
156 | | Nextflow parameter name  | Type | Description | Help | Default |
157 | |--------------------------|------|-------------|------|---------|
158 | | sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed. |  |
159 | | sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. |  |  |
160 | 
161 | 
162 | ### Reference Options
163 | 
164 | | Nextflow parameter name  | Type | Description | Help | Default |
165 | |--------------------------|------|-------------|------|---------|
166 | | database_set | string | Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database. | This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options. | ncbi_16s_18s |
167 | | database | string | Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory]. | By default uses database chosen in database_set parameter. |  |
168 | | taxonomy | string | Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file  [.tar.gz or directory]. | By default NCBI taxonomy file will be downloaded and used. |  |
169 | | reference | string | Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index. | This option should be used in conjunction with the database parameter to specify a custom database. |  |
170 | | ref2taxid | string | Not required but can be used to specify a  ref2taxid mapping. Format is .tsv (refname  taxid), no header row. | By default uses ref2taxid for option chosen in database_set parameter. |  |
171 | | taxonomic_rank | string | Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species). |  | G |
172 | 
173 | 
174 | ### Kraken2 Options
175 | 
176 | | Nextflow parameter name  | Type | Description | Help | Default |
177 | |--------------------------|------|-------------|------|---------|
178 | | bracken_length | integer | Set the length value Bracken will use | Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300 |  |
179 | | bracken_threshold | integer | Set the minimum read threshold Bracken will use to consider a taxon | Bracken will only consider taxa with a read count greater than or equal to this value. | 10 |
180 | | kraken2_memory_mapping | boolean | Avoids loading database into RAM | Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database. | False |
181 | | kraken2_confidence | number | Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1 | Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works. | 0.0 |
182 | 
183 | 
184 | ### Minimap2 Options
185 | 
186 | | Nextflow parameter name  | Type | Description | Help | Default |
187 | |--------------------------|------|-------------|------|---------|
188 | | minimap2filter | string | Filter output of minimap2 by taxids inc. child nodes, E.g. "9606,1404" | Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs. |  |
189 | | minimap2exclude | boolean | Invert minimap2filter and exclude the given taxids instead | Exclude a list of taxids from analysis outputs. | False |
190 | | keep_bam | boolean | Copy bam files into the output directory. |  | False |
191 | | minimap2_by_reference | boolean | Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report |  | False |
192 | | min_percent_identity | number | Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified. |  | 95 |
193 | | min_ref_coverage | number | Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths. |  | 90 |
194 | 
195 | 
196 | ### Report Options
197 | 
198 | | Nextflow parameter name  | Type | Description | Help | Default |
199 | |--------------------------|------|-------------|------|---------|
200 | | abundance_threshold | number | Remove those taxa whose abundance is equal or lower than the chosen value. | To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1. | 1 |
201 | | n_taxa_barplot | integer | Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the "Other" category. |  | 9 |
202 | 
203 | 
204 | ### Output Options
205 | 
206 | | Nextflow parameter name  | Type | Description | Help | Default |
207 | |--------------------------|------|-------------|------|---------|
208 | | out_dir | string | Directory for output of all user-facing files. |  | output |
209 | | igv | boolean | Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index. |  | False |
210 | | include_read_assignments | boolean | A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow. |  | False |
211 | | output_unclassified | boolean | Output a FASTQ of the unclassified reads. |  | False |
212 | 
213 | 
214 | ### Advanced Options
215 | 
216 | | Nextflow parameter name  | Type | Description | Help | Default |
217 | |--------------------------|------|-------------|------|---------|
218 | | min_len | integer | Specify read length lower limit. | Any reads shorter than this limit will not be included in the analysis. | 800 |
219 | | min_read_qual | number | Specify read quality lower limit. | Any reads with a quality lower than this limit will not be included in the analysis. |  |
220 | | max_len | integer | Specify read length upper limit | Any reads longer than this limit will not be included in the analysis. | 2000 |
221 | | threads | integer | Maximum number of CPU threads to use in each parallel workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. | 4 |
222 | 
223 | 
224 | 
225 | 
226 | 
227 | 
228 | ## Outputs
229 | 
230 | Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
231 | 
232 | | Title | File path | Description | Per sample or aggregated |
233 | |-------|-----------|-------------|--------------------------|
234 | | workflow report | wf-16s-report.html | Report for all samples. | aggregated |
235 | | Abundance table with counts per taxa | abundance_table_{{ taxonomic_rank }}.tsv | Per-taxa counts TSV, including all samples. | aggregated |
236 | | Bracken report file | bracken/{{ alias }}.kraken2_bracken.report | TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report). | per-sample |
237 | | Kraken2 taxonomic assignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.report.txt | Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format). | per-sample |
238 | | Kraken2 taxonomic asignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.assignments.tsv | TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format). | per-sample |
239 | | Host BAM file | host_bam/{{ alias }}.bam | BAM file generated from mapping filtered input reads to the host reference. | per-sample |
240 | | BAM index file of host reads | host_bam/{{ alias }}.bai | BAM index file generated from mapping filtered input reads to the host reference. | per-sample |
241 | | BAM file (minimap2) | bams/{{ alias }}.reference.bam | BAM file generated from mapping filtered input reads to the reference. | per-sample |
242 | | BAM index file (minimap2) | bams/{{ alias }}.reference.bam.bai | Index file generated from mapping filtered input reads to the reference. | per-sample |
243 | | BAM flagstat (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv | Mapping results per reference | per-sample |
244 | | Minimap2 alignment statistics (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz | Per read stats after aligning | per-sample |
245 | | Reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz | Reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
246 | | Index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.fai | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
247 | | GZI index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.gzi | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated |
248 | | JSON configuration file for IGV browser | igv.json | JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference. | aggregated |
249 | | Taxonomic assignment per read. | reads_assignments/{{ alias }}.*.assignments.tsv | TSV file with the taxonomic assignment per read. | per-sample |
250 | | FASTQ of the selected taxids. | extracted/{{ alias }}.minimap2.extracted.fastq | FASTQ containing/excluding the reads of the selected taxids. | per-sample |
251 | | Unclassified FASTQ. | unclassified/{{ alias }}.unclassified.fq.gz | FASTQ containing the reads that have not been classified against the database. | per-sample |
252 | | Alignment statistics TSV | alignment_tables/{{ alias }}.alignment-stats.tsv | Coverage and taxonomy of each reference. | per-sample |
253 | 
254 | 
255 | 
256 | 
257 | ## Pipeline overview
258 | 
259 | 
260 | ### Workflow defaults and parameters
261 | The workflow sets default values for parameters optimised for the analysis of full-length 16S rRNA gene amplicons, including `min_len`, `max_len`, `min_ref_coverage`, and `min_percent_identity`.
262 | Descriptions of the parameters and their defaults can be found in the [input parameters section](#input-parameters).
263 | 
264 | #### Analysing ITS amplicons
265 | For analysis of ITS amplicons users should adjust the following parameters:
266 | - `min_len` should be decreased to 300, as ITS amplicons may be shorter than the current `min_len` default value which will cause them to be excluded.
267 | - `database_set` should be changed to `ncbi_16s_18s_28s_ITS` or a [custom database](#faqs) containing the relevant ITS references.
268 | 
269 | ### 1. Concatenate input files and generate per read stats
270 | 
271 | [fastcat](https://github.com/epi2me-labs/fastcat) is used to concatenate input FASTQ files prior to downstream processing of the workflow. It will also output per-read stats including read lengths and average qualities.
272 | 
273 | You may want to choose which reads are analysed by filtering them using the flags `max_len`, `min_len` and `min_read_qual`.
274 | 
275 | ### 2. Remove host sequences (optional)
276 | 
277 | We have included an optional filtering step to remove any host sequences that map (using [Minimap2](https://github.com/lh3/minimap2)) against a provided host reference (e.g. human), which can be a FASTA file or a MMI index. To use this option provide the path to your host reference with the `exclude_host` parameter. The mapped reads are output in a BAM file and excluded from further analysis.
278 | 
279 | ```
280 | nextflow run epi2me-labs/wf-16s --fastq test_data/case04/reads.fastq.gz --exclude_host test_data/case04/host.fasta.gz
281 | ```
282 | 
283 | ### 3. Classify reads taxonomically
284 | 
285 | There are two different approaches to taxonomic classification:
286 | 
287 | #### 3.1 Using Minimap2
288 | 
289 | [Minimap2](https://github.com/lh3/minimap2) provides better resolution, but, depending on the reference database used, can take significantly more time. This is the option by default.
290 | 
291 | ```
292 | nextflow run epi2me-labs/wf-16s --fastq test_data/case01 --classifier minimap2
293 | ```
294 | 
295 | The creation of alignment statistics plots can be enabled with the `minimap2_by_reference` flag. Using this option produces a table and scatter plot in the report showing sequencing depth and coverage of each reference. The report also contains a heatmap indicating the sequencing depth over relative genomic coordinates for the references with the highest coverage (references with a mean coverage of less than 1% of the one with the largest value are omitted).
296 | 
297 | In addition, the user can output BAM files in a folder called `bams` by using the option `keep_bam`. If the user provides a custom database and uses the `igv` option, the workflow will also output the references with reads mappings, as well as an IGV configuration file. This configuration file allows the user to view the alignments in the EPI2ME Desktop Application in the Viewer tab. Note that the number of references can be reduced using the `abundance_threshold` option, which will select those references with a number of reads aligned higher than this value. Please, consider that the view of the alignment is highly dependent on the reference selected.
298 | 
299 | #### 3.2 Using Kraken2
300 | 
301 | [Kraken2](https://github.com/DerrickWood/kraken2) provides the fastest method for the taxonomic classification of the reads. Then, [Bracken](https://github.com/jenniferlu717/Bracken) is used to provide an estimate of the genus (or the selected taxonomic rank) abundance in the sample.
302 | 
303 | ### 4. Output
304 | 
305 | The main output of the wf-16s pipeline is the `wf-16s-report.html` which can be found in the output directory. It contains a summary of read statistics, the taxonomic composition of the sample and some diversity metrics. The results shown in the report can also be customised with several options. For example, you can use `abundance_threshold` to remove all taxa less prevalent than the threshold from the abundance table. When setting this parameter to a natural number, taxa with fewer absolute counts are removed. You can also pass a decimal between 0.0-1.0 to drop taxa of lower relative abundance. Furthermore, `n_taxa_barplot` controls the number of taxa displayed in the bar plot and groups the rest under the category ‘Other’.
306 | 
307 | You can use the flag `include_read_assignments` to include a per-sample TSV file that indicates how each input sequence was classified, as well as the taxon that has been assigned to each read.
308 | 
309 | For more information about remaining workflow outputs, please see [minimap2 Options](#minimap2-options).
310 | 
311 | ### 5. Diversity indices
312 | 
313 | Species diversity refers to the taxonomic composition in a specific microbial community. There are some useful concepts to take into account:
314 | * Richness: number of unique taxonomic groups present in the community,
315 | * Taxonomic group abundance: number of individuals of a particular taxonomic group present in the community,
316 | * Evenness: refers to the equitability of the different taxonomic groups in terms of their abundances.
317 |     Two different communities can host the same number of different taxonomic groups (i.e. they have the same richness), but they can have different evenness. For instance, if there is one taxon whose abundance is much larger in one community compared to the other.
318 | 
319 | There are three types of biodiversity measures described over a special scale <sup>[1](https://doi.org/10.2307/1218190), [2](https://doi.org/10.1016/B978-0-12-384719-5.00036-8)</sup>: alpha-, beta-, and gamma-diversity.
320 | * Alpha-diversity refers to the richness that occurs within a community given area within a region.
321 | * Beta-diversity defined as variation in the identities of species among sites, provides a direct link between biodiversity at local scales (alpha diversity) and the broader regional species pool (gamma diversity).
322 | * Gamma-diversity is the total observed richness within an entire region.
323 | 
324 | To provide a quick overview of the alpha-diversity of the microbial community, we provide some of the most common diversity metrics calculated for a specific taxonomic rank <sup>[3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4224527/)</sup>, which can be chosen by the user with the `taxonomic_rank` parameter ('D'=Domain,'P'=Phylum, 'C'=Class, 'O'=Order, 'F'=Family, 'G'=Genus, 'S'=Species). By default, the rank is 'G' (genus-level). Some of the included alpha diversity metrics are:
325 | 
326 | * Shannon Diversity Index (H): Shannon entropy approaches zero if a community is almost entirely made up of a single taxon.
327 | 
328 | ```math
329 | H = -\sum_{i=1}^{S}p_i*ln(p_i)
330 | ```
331 | 
332 | * Simpson's Diversity Index (D): The range is from 0 (low diversity) to 1 (high diversity).    
333 | 
334 | ```math
335 | D = \sum_{i=1}^{S}p_i^2
336 | ```
337 | 
338 | * Pielou Index (J): The values range from 0 (presence of a dominant species) and 1 (maximum evennes).    
339 | 
340 | ```math
341 | J = H/ln(S)
342 | ```
343 | 
344 | * Berger-Parker dominance index (BP): expresses the proportional importance of the most abundant type, i.e., the ratio of number of individuals of most abundant species to the total number of individuals of all the species in the sample.
345 | 
346 | ```math
347 | BP = n_i/N
348 | ```
349 |    where n<sub>i</sub> refers to the counts of the most abundant taxon and N is the total of counts.     
350 | 
351 | 
352 | * Fisher’s alpha: Fisher (see Fisher, 1943<sup>[4](https://doi.org/10.2307/1411)</sup>) noticed that only a few species tend to be abundant while most are represented by only a few individuals ('rare biosphere'). These differences in species abundance can be incorporated into species diversity measurements such as the Fisher’s alpha. This index is based upon the logarithmic distribution of number of individuals of different species. 
353 | 
354 | ```math
355 | S = \alpha * ln(1 + N/\alpha)
356 | ```
357 |    where S is the total number of taxa, N is the total number of individuals in the sample. The value of Fisher's $`\alpha`$ is calculated by iteration.
358 | 
359 | These indices are calculated by default using the original abundance table (see McMurdie and Holmes<sup>[5](https://pubmed.ncbi.nlm.nih.gov/24699258/)</sup>, 2014 and Willis<sup>[6](https://www.frontiersin.org/articles/10.3389/fmicb.2019.02407/full)</sup>, 2019). If you want to calculate them from a rarefied abundance table (i.e. all the samples have been subsampled to contain the same number of counts per sample, which is the 95% of the minimum number of total counts), you can download the rarefied table from the report.
360 | 
361 | The report also includes the rarefaction curve per sample which displays the mean of species richness for a subsample of reads (sample size). Generally, this curve initially grows rapidly, as most abundant species are sequenced and they add new taxa in the community, then slightly flattens due to the fact that 'rare' species are more difficult of being sampled, and because of that is more difficult to report an increase in the number of observed species.
362 | 
363 | > Note: Within each rank, each named taxon is a unique unit. The counts are the number of reads assigned to that taxon. All `Unknown` sequences are considered as a unique taxon
364 | 
365 | 
366 | 
367 | 
368 | ## Troubleshooting
369 | 
370 | + If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug.
371 | + See how to interpret some common nextflow exit codes [here](https://labs.epi2me.io/trouble-shooting/).
372 | + When using the Minimap2 pipeline with a custom database, you must make sure that the `ref2taxid` and reference files are coherent, as well as the taxonomy database.
373 | + If your device doesn't have the resources to use large Kraken2 databases, you can enable `kraken2_memory_mapping` to reduce the amount of memory required.
374 | + To enable IGV viewer with a custom reference, this must be a FASTA file and not a minimap2 MMI format index.
375 | 
376 | 
377 | 
378 | 
379 | ## FAQ's
380 | 
381 | If your question is not answered here, please report any issues or suggestions on the [github issues](https://github.com/epi2me-labs/wf-16s/issues) page or start a discussion on the [community](https://community.nanoporetech.com/). 
382 | 
383 | + *Which database is used per default?* - By default, the workflow uses the NCBI 16S + 18S rRNA database. It will be downloaded the first time the workflow is run and re-used in subsequent runs.
384 | 
385 | + *Are more databases available?* - Other 16s databases (listed below) can be selected with the `database_set` parameter, but the workflow can also be used with a custom database if required (see [here](https://labs.epi2me.io/how-to-meta-offline/) for details).
386 |     * 16S, 18S, ITS
387 |         * ncbi_16s_18s and ncbi_16s_18s_28s_ITS:  Archaeal, bacterial and fungal 16S/18S and ITS data. There are two databases available using the data from [NCBI](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/)
388 |         * SILVA_138_1: The [SILVA](https://www.arb-silva.de/) database (version 138) is also available. Note that SILVA uses its own set of taxids, which do not match the NCBI taxids. We provide the respective taxdump files, but if you prefer using the NCBI ones, you can create them from the SILVA files ([NCBI](https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/ncbi/)). As the SILVA database uses genus level, the last taxonomic rank at which the analysis is carried out is genus (`taxonomic_rank G`).
389 | 
390 | + *How can I use Kraken2 indexes?* - There are different databases available [here](https://benlangmead.github.io/aws-indexes/k2).
391 | 
392 | + *How can I use custom databases?* - If you want to run the workflow using your own Kraken2 database, you'll need to provide the database and an associated taxonomy dump. For a custom Minimap2 reference database, you'll need to provide a reference FASTA (or MMI) and an associated ref2taxid file. For a guide on how to build and use custom databases, take a look at our [article on how to run wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/).
393 | 
394 | + *How can I run the workflow with less memory?* -
395 |     When running in Kraken mode, you can set the `kraken2_memory_mapping` parameter if the available memory is smaller than the size of the database.
396 | 
397 | + *How can I run the workflow offline?* - To run wf-16s offline you can use the workflow to download the databases from the internet and prepare them for offline re-use later. If you want to use one of the databases supported out of the box by the workflow, you can run the workflow with your desired database and any input (for example, the test data). The database will be downloaded and prepared in a directory on your computer. Once the database has been prepared, it will be used automatically the next time you run the workflow without needing to be downloaded again. You can find advice on picking a suitable database in our [article on selecting databases for wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/).
398 | 
399 | + *When and how are coverage and identity filters applied when using the minimap2 approach?* - With minimap2-based classification, coverage and identity filtering is applied by using the `min_ref_coverage` and `min_percent_identity` options respectively. All reads that mapped to a reference, but failed to pass these filters, are relabelled as unclassified. If the `include_read_assignments` option is used, tables in the output will show read classifications after this filtering step. However, the output BAM file always contains the raw minimap2 alignment results. To read more about both filters, see [minimap2 Options](#minimap2-options).
400 | 
401 | 
402 | 
403 | 
404 | 
405 | ## Related blog posts
406 | 
407 | + [How to build and use databases to run wf-metagenomics and wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/).
408 | + [Selecting the correct databases in the wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/).
409 | + [How to evaluate unclassified sequences](https://epi2me.nanoporetech.com/post-meta-analysis/)
410 | 
411 | See the [EPI2ME website](https://labs.epi2me.io/) for lots of other resources and blog posts.
412 | 
413 | 
414 | 
415 | 
416 | 


--------------------------------------------------------------------------------