├── bin ├── lib ├── data ├── test_data ├── docs ├── 01_brief_description.md ├── 03_compute_requirements.md ├── 11_other.md ├── 05_related_protocols.md ├── 09_troubleshooting.md ├── 02_introduction.md ├── 06_input_example.md ├── 04_install_and_run.md ├── 10_FAQ.md ├── 07_outputs.md ├── 08_pipeline_overview.md └── 06_input_parameters.md ├── .gitignore ├── .gitmodules ├── .github ├── workflows │ └── issue-autoreply.yml └── ISSUE_TEMPLATE │ ├── config.yml │ ├── question.yml │ ├── feature_request.yml │ └── bug_report.yml ├── .pre-commit-config.yaml ├── test ├── run_fastq_ingress_test.sh └── test_fastq_ingress.py ├── .gitlab-ci.yml ├── nextflow.config ├── output_definition.json ├── CHANGELOG.md ├── main.nf ├── LICENSE ├── nextflow_schema.json └── README.md /bin: -------------------------------------------------------------------------------- 1 | wf-metagenomics/bin -------------------------------------------------------------------------------- /lib: -------------------------------------------------------------------------------- 1 | wf-metagenomics/lib -------------------------------------------------------------------------------- /data: -------------------------------------------------------------------------------- 1 | wf-metagenomics/data -------------------------------------------------------------------------------- /test_data: -------------------------------------------------------------------------------- 1 | wf-metagenomics/test_data -------------------------------------------------------------------------------- /docs/01_brief_description.md: -------------------------------------------------------------------------------- 1 | Taxonomic classification of 16S rRNA gene sequencing data. -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | nextflow 2 | .nextflow* 3 | template-workflow 4 | .*.swp 5 | .*.swo 6 | *.pyc 7 | *.pyo 8 | .DS_store 9 | -------------------------------------------------------------------------------- /.gitmodules: -------------------------------------------------------------------------------- 1 | [submodule "wf-metagenomics"] 2 | path = wf-metagenomics 3 | url = https://github.com/epi2me-labs/wf-metagenomics 4 | -------------------------------------------------------------------------------- /.github/workflows/issue-autoreply.yml: -------------------------------------------------------------------------------- 1 | name: Issue Auto-Reply 2 | 3 | on: 4 | issues: 5 | types: [opened] 6 | 7 | jobs: 8 | auto-reply: 9 | uses: epi2me-labs/.github/.github/workflows/issue-autoreply.yml@main 10 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/config.yml: -------------------------------------------------------------------------------- 1 | blank_issues_enabled: false 2 | contact_links: 3 | - name: Nanopore customer support 4 | url: https://nanoporetech.com/contact 5 | about: For general support, including bioinformatics questions. 6 | -------------------------------------------------------------------------------- /docs/03_compute_requirements.md: -------------------------------------------------------------------------------- 1 | Recommended requirements: 2 | 3 | + CPUs = 12 4 | + Memory = 32GB 5 | 6 | Minimum requirements: 7 | 8 | + CPUs = 6 9 | + Memory = 16GB 10 | 11 | Approximate run time: ~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database. 12 | 13 | ARM processor support: True 14 | -------------------------------------------------------------------------------- /docs/11_other.md: -------------------------------------------------------------------------------- 1 | + [How to build and use databases to run wf-metagenomics and wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/). 2 | + [Selecting the correct databases in the wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/). 3 | + [How to evaluate unclassified sequences](https://epi2me.nanoporetech.com/post-meta-analysis/) 4 | 5 | See the [EPI2ME website](https://labs.epi2me.io/) for lots of other resources and blog posts. 6 | -------------------------------------------------------------------------------- /docs/05_related_protocols.md: -------------------------------------------------------------------------------- 1 | This workflow is designed to take input sequences that have been produced by [Oxford Nanopore Technologies](https://nanoporetech.com/) devices using protocols associated with either of the kits listed below: 2 | 3 | - [SQK-MAB114.24](https://nanoporetech.com/document/microbial-amplicon-barcoding-sequencing-for-16s-and-its-sqk-mab114-24) 4 | - [SQK-16S114.24](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/rapid-sequencing-DNA-16s-barcoding-kit-v14-sqk-16114-24) 5 | 6 | Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/). -------------------------------------------------------------------------------- /docs/09_troubleshooting.md: -------------------------------------------------------------------------------- 1 | + If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug. 2 | + See how to interpret some common nextflow exit codes [here](https://labs.epi2me.io/trouble-shooting/). 3 | + When using the Minimap2 pipeline with a custom database, you must make sure that the `ref2taxid` and reference files are coherent, as well as the taxonomy database. 4 | + If your device doesn't have the resources to use large Kraken2 databases, you can enable `kraken2_memory_mapping` to reduce the amount of memory required. 5 | + To enable IGV viewer with a custom reference, this must be a FASTA file and not a minimap2 MMI format index. 6 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/question.yml: -------------------------------------------------------------------------------- 1 | name: Question 2 | description: Ask a generic question about this project unrelated to features or bugs. 3 | labels: ["question"] 4 | body: 5 | - type: markdown 6 | attributes: 7 | value: | 8 | Please reserve this form for issues not related to bugs or feature requests. If our developers deem your questions to be related to bugs or features you will be asked to fill in the appropriate form. 9 | - type: textarea 10 | id: question1 11 | attributes: 12 | label: Ask away! 13 | placeholder: | 14 | Bad question: How do I use this workflow in my HPC cluster? 15 | Good question: My HPC cluster uses a GridEngine scheduler. Can you point me to documentation for how to use your workflows to efficiently submit jobs to my cluster? 16 | validations: 17 | required: true 18 | -------------------------------------------------------------------------------- /docs/02_introduction.md: -------------------------------------------------------------------------------- 1 | This workflow can be used for the following: 2 | 3 | + Taxonomic classification of 16S rRNA, 18S rRNA and ITS amplicons using [default or custom databases](#faqs). Default databases: 4 | - NCBI targeted loci: 16S rDNA, 18S rDNA, ITS (ncbi_16s_18s, ncbi_16s_18s_28s_ITS; see [here](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) for details). 5 | + Generate taxonomic profiles of one or more samples. 6 | 7 | The workflow default parameters are optimised for analysis of 16S rRNA gene amplicons. 8 | For ITS amplicons, it is strongly recommended that some parameters are changed from the defaults, please see the [ITS presets](#analysing-its-amplicons) section for more information. 9 | 10 | Additional features: 11 | + Two different approaches are available: `minimap2` (using alignment, default option) or `kraken2` (k-mer based). 12 | + Results include: 13 | - An abundance table with counts per taxa in all the samples. 14 | - Interactive sankey and sunburst plots to explore the different identified lineages. 15 | - A bar plot comparing the abundances of the most abundant taxa in all the samples. 16 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.yml: -------------------------------------------------------------------------------- 1 | name: Feature request 2 | description: Suggest an idea for this project 3 | labels: ["feature request"] 4 | body: 5 | 6 | - type: textarea 7 | id: question1 8 | attributes: 9 | label: Is your feature related to a problem? 10 | placeholder: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 11 | validations: 12 | required: true 13 | - type: textarea 14 | id: question2 15 | attributes: 16 | label: Describe the solution you'd like 17 | placeholder: A clear and concise description of what you want to happen. 18 | validations: 19 | required: true 20 | - type: textarea 21 | id: question3 22 | attributes: 23 | label: Describe alternatives you've considered 24 | placeholder: A clear and concise description of any alternative solutions or features you've considered. 25 | validations: 26 | required: true 27 | - type: textarea 28 | id: question4 29 | attributes: 30 | label: Additional context 31 | placeholder: Add any other context about the feature request here. 32 | validations: 33 | required: false 34 | 35 | -------------------------------------------------------------------------------- /docs/06_input_example.md: -------------------------------------------------------------------------------- 1 | This workflow accepts either FASTQ or BAM files as input. 2 | 3 | The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. 4 | 5 | ``` 6 | (i) (ii) (iii) 7 | input_reads.fastq ─── input_directory ─── input_directory 8 | ├── reads0.fastq ├── barcode01 9 | └── reads1.fastq │ ├── reads0.fastq 10 | │ └── reads1.fastq 11 | ├── barcode02 12 | │ ├── reads0.fastq 13 | │ ├── reads1.fastq 14 | │ └── reads2.fastq 15 | └── barcode03 16 | └── reads0.fastq 17 | ``` -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: local 3 | hooks: 4 | - id: docs_readme 5 | name: docs_readme 6 | entry: parse_docs -p docs -e .md -s 01_brief_description 02_introduction 03_compute_requirements 04_install_and_run 05_related_protocols 06_input_example 06_input_parameters 07_outputs 08_pipeline_overview 09_troubleshooting 10_FAQ 11_other -ot README.md -od output_definition.json -ns nextflow_schema.json 7 | language: python 8 | always_run: true 9 | pass_filenames: false 10 | additional_dependencies: 11 | - epi2melabs==0.0.58 12 | - repo: https://github.com/pycqa/flake8 13 | rev: 5.0.4 14 | hooks: 15 | - id: flake8 16 | pass_filenames: false 17 | additional_dependencies: 18 | - flake8-rst-docstrings 19 | - flake8-docstrings 20 | - flake8-import-order 21 | - flake8-forbid-visual-indent 22 | - pep8-naming 23 | - flake8-no-types 24 | - flake8-builtins 25 | - flake8-absolute-import 26 | - flake8-print 27 | # avoid snowballstemmer>=3.0 as it causes flake8-docstrings to stop working [CW-6098] 28 | - snowballstemmer==2.2.0 29 | args: [ 30 | "bin", 31 | "--import-order-style=google", 32 | "--statistics", 33 | "--max-line-length=88", 34 | "--per-file-ignores=bin/workflow_glue/models/*:NT001", 35 | ] 36 | -------------------------------------------------------------------------------- /test/run_fastq_ingress_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -exo pipefail 3 | 4 | get-test_data-from-aws () { 5 | # get aws-cli 6 | curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" 7 | unzip -q awscliv2.zip 8 | 9 | # get test data 10 | aws/dist/aws s3 cp --recursive --quiet \ 11 | "$S3_TEST_DATA" \ 12 | test_data_from_S3 13 | } 14 | 15 | fastq=$1 16 | wf_output_dir=$2 17 | sample_sheet=$3 18 | 19 | # `fastq` and `wf_output_dir` are required 20 | if ! [[ $# -eq 2 || $# -eq 3 ]]; then 21 | echo "Provide 2 or 3 arguments!" >&2 22 | exit 1 23 | fi 24 | 25 | # get test data from s3 if required 26 | if [[ $fastq =~ ^s3:// ]]; then 27 | get-test_data-from-aws 28 | fastq="$PWD/test_data_from_S3/${fastq#*test_data/}" 29 | [[ -n $sample_sheet ]] && 30 | sample_sheet="$PWD/test_data_from_S3/${sample_sheet#*test_data/}" 31 | fi 32 | 33 | # add CWD if paths are relative 34 | [[ ( $fastq != /* ) ]] && fastq="$PWD/$fastq" 35 | [[ ( $wf_output_dir != /* ) ]] && wf_output_dir="$PWD/$wf_output_dir" 36 | [[ ( -n $sample_sheet ) && ( $sample_sheet != /* ) ]] && 37 | sample_sheet="$PWD/$sample_sheet" 38 | 39 | # add flags to parameters (need an array for `fastq` here as there might be spaces in 40 | # the filename) 41 | fastq=("--fastq" "$fastq") 42 | wf_output_dir="--wf-output-dir $wf_output_dir" 43 | [[ -n $sample_sheet ]] && sample_sheet="--sample_sheet $sample_sheet" 44 | 45 | # get container hash from config 46 | img_hash=$(grep 'common_sha.\?=' nextflow.config | grep -oE 'sha[0-9,a-f,A-F]+') 47 | 48 | # run test 49 | docker run -v "$PWD":"$PWD" \ 50 | ontresearch/wf-common:"$img_hash" \ 51 | python "$PWD"/test/test_fastq_ingress.py "${fastq[@]}" $wf_output_dir $sample_sheet 52 | -------------------------------------------------------------------------------- /docs/04_install_and_run.md: -------------------------------------------------------------------------------- 1 | 2 | These are instructions to install and run the workflow on command line. 3 | You can also access the workflow via the 4 | [EPI2ME Desktop application](https://labs.epi2me.io/downloads/). 5 | 6 | The workflow uses [Nextflow](https://www.nextflow.io/) to manage 7 | compute and software resources, 8 | therefore Nextflow will need to be 9 | installed before attempting to run the workflow. 10 | 11 | The workflow can currently be run using either 12 | [Docker](https://docs.docker.com/get-started/) 13 | or [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html) 14 | to provide isolation of the required software. 15 | Both methods are automated out-of-the-box provided 16 | either Docker or Singularity is installed. 17 | This is controlled by the 18 | [`-profile`](https://www.nextflow.io/docs/latest/config.html#config-profiles) 19 | parameter as exemplified below. 20 | 21 | It is not required to clone or download the git repository 22 | in order to run the workflow. 23 | More information on running EPI2ME workflows can 24 | be found on our [website](https://labs.epi2me.io/wfindex). 25 | 26 | The following command can be used to obtain the workflow. 27 | This will pull the repository in to the assets folder of 28 | Nextflow and provide a list of all parameters 29 | available for the workflow as well as an example command: 30 | 31 | ``` 32 | nextflow run epi2me-labs/wf-16s --help 33 | ``` 34 | To update a workflow to the latest version on the command line use 35 | the following command: 36 | ``` 37 | nextflow pull epi2me-labs/wf-16s 38 | ``` 39 | 40 | A demo dataset is provided for testing of the workflow. 41 | It can be downloaded and unpacked using the following commands: 42 | ``` 43 | wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz 44 | tar -xzvf wf-16s-demo.tar.gz 45 | ``` 46 | The workflow can then be run with the downloaded demo data using: 47 | ``` 48 | nextflow run epi2me-labs/wf-16s \ 49 | --fastq 'wf-16s-demo/test_data' \ 50 | --minimap2_by_reference \ 51 | -profile standard 52 | ``` 53 | 54 | For further information about running a workflow on 55 | the command line see https://labs.epi2me.io/wfquickstart/ 56 | -------------------------------------------------------------------------------- /docs/10_FAQ.md: -------------------------------------------------------------------------------- 1 | If your question is not answered here, please report any issues or suggestions on the [github issues](https://github.com/epi2me-labs/wf-16s/issues) page or start a discussion on the [community](https://community.nanoporetech.com/). 2 | 3 | + *Which database is used per default?* - By default, the workflow uses the NCBI 16S + 18S rRNA database. It will be downloaded the first time the workflow is run and re-used in subsequent runs. 4 | 5 | + *Are more databases available?* - Other 16s databases (listed below) can be selected with the `database_set` parameter, but the workflow can also be used with a custom database if required (see [here](https://labs.epi2me.io/how-to-meta-offline/) for details). 6 | * 16S, 18S, ITS 7 | * ncbi_16s_18s and ncbi_16s_18s_28s_ITS: Archaeal, bacterial and fungal 16S/18S and ITS data. There are two databases available using the data from [NCBI](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) 8 | * SILVA_138_1: The [SILVA](https://www.arb-silva.de/) database (version 138) is also available. Note that SILVA uses its own set of taxids, which do not match the NCBI taxids. We provide the respective taxdump files, but if you prefer using the NCBI ones, you can create them from the SILVA files ([NCBI](https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/ncbi/)). As the SILVA database uses genus level, the last taxonomic rank at which the analysis is carried out is genus (`taxonomic_rank G`). 9 | 10 | + *How can I use Kraken2 indexes?* - There are different databases available [here](https://benlangmead.github.io/aws-indexes/k2). 11 | 12 | + *How can I use custom databases?* - If you want to run the workflow using your own Kraken2 database, you'll need to provide the database and an associated taxonomy dump. For a custom Minimap2 reference database, you'll need to provide a reference FASTA (or MMI) and an associated ref2taxid file. For a guide on how to build and use custom databases, take a look at our [article on how to run wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/). 13 | 14 | + *How can I run the workflow with less memory?* - 15 | When running in Kraken mode, you can set the `kraken2_memory_mapping` parameter if the available memory is smaller than the size of the database. 16 | 17 | + *How can I run the workflow offline?* - To run wf-16s offline you can use the workflow to download the databases from the internet and prepare them for offline re-use later. If you want to use one of the databases supported out of the box by the workflow, you can run the workflow with your desired database and any input (for example, the test data). The database will be downloaded and prepared in a directory on your computer. Once the database has been prepared, it will be used automatically the next time you run the workflow without needing to be downloaded again. You can find advice on picking a suitable database in our [article on selecting databases for wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/). 18 | 19 | + *When and how are coverage and identity filters applied when using the minimap2 approach?* - With minimap2-based classification, coverage and identity filtering is applied by using the `min_ref_coverage` and `min_percent_identity` options respectively. All reads that mapped to a reference, but failed to pass these filters, are relabelled as unclassified. If the `include_read_assignments` option is used, tables in the output will show read classifications after this filtering step. However, the output BAM file always contains the raw minimap2 alignment results. To read more about both filters, see [minimap2 Options](#minimap2-options). 20 | 21 | -------------------------------------------------------------------------------- /docs/07_outputs.md: -------------------------------------------------------------------------------- 1 | Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. 2 | 3 | | Title | File path | Description | Per sample or aggregated | 4 | |-------|-----------|-------------|--------------------------| 5 | | workflow report | wf-16s-report.html | Report for all samples. | aggregated | 6 | | Abundance table with counts per taxa | abundance_table_{{ taxonomic_rank }}.tsv | Per-taxa counts TSV, including all samples. | aggregated | 7 | | Bracken report file | bracken/{{ alias }}.kraken2_bracken.report | TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report). | per-sample | 8 | | Kraken2 taxonomic assignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.report.txt | Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format). | per-sample | 9 | | Kraken2 taxonomic asignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.assignments.tsv | TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format). | per-sample | 10 | | Host BAM file | host_bam/{{ alias }}.bam | BAM file generated from mapping filtered input reads to the host reference. | per-sample | 11 | | BAM index file of host reads | host_bam/{{ alias }}.bai | BAM index file generated from mapping filtered input reads to the host reference. | per-sample | 12 | | BAM file (minimap2) | bams/{{ alias }}.reference.bam | BAM file generated from mapping filtered input reads to the reference. | per-sample | 13 | | BAM index file (minimap2) | bams/{{ alias }}.reference.bam.bai | Index file generated from mapping filtered input reads to the reference. | per-sample | 14 | | BAM flagstat (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv | Mapping results per reference | per-sample | 15 | | Minimap2 alignment statistics (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz | Per read stats after aligning | per-sample | 16 | | Reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz | Reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 17 | | Index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.fai | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 18 | | GZI index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.gzi | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 19 | | JSON configuration file for IGV browser | igv.json | JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference. | aggregated | 20 | | Taxonomic assignment per read. | reads_assignments/{{ alias }}.*.assignments.tsv | TSV file with the taxonomic assignment per read. | per-sample | 21 | | FASTQ of the selected taxids. | extracted/{{ alias }}.minimap2.extracted.fastq | FASTQ containing/excluding the reads of the selected taxids. | per-sample | 22 | | Unclassified FASTQ. | unclassified/{{ alias }}.unclassified.fq.gz | FASTQ containing the reads that have not been classified against the database. | per-sample | 23 | | Alignment statistics TSV | alignment_tables/{{ alias }}.alignment-stats.tsv | Coverage and taxonomy of each reference. | per-sample | 24 | -------------------------------------------------------------------------------- /.gitlab-ci.yml: -------------------------------------------------------------------------------- 1 | # Include shared CI 2 | include: 3 | - project: "epi2melabs/ci-templates" 4 | file: "wf-containers.yaml" 5 | 6 | variables: 7 | NF_BEFORE_SCRIPT: mkdir -p ${CI_PROJECT_NAME}/data/ && wget -O ${CI_PROJECT_NAME}/data/wf-16s-demo.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz && tar -xzvf ${CI_PROJECT_NAME}/data/wf-16s-demo.tar.gz -C ${CI_PROJECT_NAME}/data/ 8 | NF_WORKFLOW_OPTS: "--fastq ${CI_PROJECT_NAME}/data/wf-16s-demo/test_data/ 9 | --classifier minimap2 10 | --minimap2_by_reference 11 | --database_set ncbi_16s_18s" 12 | PYTEST_CONTAINER_NAME: "wf-metagenomics" 13 | NF_IGNORE_PROCESSES: "rebatchFastq" 14 | GIT_SUBMODULE_STRATEGY: recursive 15 | CI_FLAVOUR: "new" 16 | CWG_AWS_ENV_NAME: "stack" 17 | 18 | aws-run: 19 | variables: 20 | NF_WORKFLOW_OPTS: "--fastq test_data/case01 --store_dir s3://$${XAWS_BUCKET}/${CI_PROJECT_NAME}/store" 21 | NF_IGNORE_PROCESSES: "rebatchFastq" 22 | artifacts: 23 | when: always 24 | paths: 25 | - ${CI_PROJECT_NAME} 26 | - .nextflow.log 27 | exclude: [] # give me everything pal 28 | allow_failure: false 29 | 30 | 31 | docker-run: 32 | 33 | # Remove this directive in downstream templates 34 | tags: 35 | - large_ram 36 | 37 | # Define a 1D job matrix to inject a variable named MATRIX_NAME into 38 | # the CI environment, we can use the value of MATRIX_NAME to determine 39 | # which options to apply as part of the rules block below 40 | # NOTE There is a slightly cleaner way to define this matrix to include 41 | # the variables, but it is broken when using long strings! See CW-756 42 | parallel: 43 | matrix: 44 | - MATRIX_NAME: [ 45 | "kraken2", "minimap2", "minimap2-sample-sheet", 46 | "kraken2-bam", "minimap2-bam"] 47 | 48 | rules: 49 | - if: ($CI_COMMIT_BRANCH == null || $CI_COMMIT_BRANCH == "dev-template") 50 | when: never 51 | - if: $MATRIX_NAME == "kraken2" 52 | variables: 53 | NF_PROCESS_FILES: "wf-metagenomics/subworkflows/kraken_pipeline.nf" 54 | NF_WORKFLOW_OPTS: "--fastq test_data/case01 --classifier kraken2 --include_read_assignments" 55 | NF_IGNORE_PROCESSES: "" 56 | AFTER_NEXTFLOW_CMD: > 57 | if [ ! -f $$PWD/$$CI_PROJECT_NAME/wf-16s-report.html ]; then (echo -e "Report not found" && exit 1); fi 58 | - if: $MATRIX_NAME == "minimap2" 59 | variables: 60 | NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf" 61 | NF_WORKFLOW_OPTS: "--fastq test_data/case01 --minimap2_by_reference --keep_bam" 62 | NF_IGNORE_PROCESSES: "extractMinimap2Reads" 63 | AFTER_NEXTFLOW_CMD: > 64 | if [ ! -f $$PWD/$$CI_PROJECT_NAME/wf-16s-report.html ]; then (echo -e "Report not found" && exit 1); fi 65 | - if: $MATRIX_NAME == "minimap2-sample-sheet" 66 | variables: 67 | NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf" 68 | NF_WORKFLOW_OPTS: "--fastq test_data/case02 --sample_sheet test_data/case02/sample_sheet.csv --taxonomic_rank G --n_taxa_barplot 5 --abundance_threshold 1" 69 | NF_IGNORE_PROCESSES: "extractMinimap2Reads,getAlignmentStats" 70 | # BAM INGRESS 71 | # Compare counts with case01_no_duplicateIDs, must be the same 72 | - if: $MATRIX_NAME == "kraken2-bam" 73 | variables: 74 | NF_PROCESS_FILES: "wf-metagenomics/subworkflows/kraken_pipeline.nf" 75 | NF_WORKFLOW_OPTS: "--bam test_data/case05_bam --include_read_assignments --abundance_threshold 1 --classifier kraken2" 76 | NF_IGNORE_PROCESSES: "" 77 | ## Regular test minimap2 - mapping stats 78 | - if: $MATRIX_NAME == "minimap2-bam" 79 | variables: 80 | NF_PROCESS_FILES: "wf-metagenomics/subworkflows/minimap_pipeline.nf" 81 | NF_WORKFLOW_OPTS: "--bam test_data/case05_bam --minimap2_by_reference --database_set ncbi_16s_18s --classifier minimap2" 82 | NF_IGNORE_PROCESSES: "extractMinimap2Reads" 83 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.yml: -------------------------------------------------------------------------------- 1 | name: Bug Report 2 | description: File a bug report 3 | labels: ["triage"] 4 | body: 5 | - type: markdown 6 | attributes: 7 | value: | 8 | Thanks for taking the time to fill out this bug report! 9 | 10 | 11 | - type: markdown 12 | attributes: 13 | value: | 14 | # Background 15 | - type: dropdown 16 | id: os 17 | attributes: 18 | label: Operating System 19 | description: What operating system are you running? 20 | options: 21 | - Windows 10 22 | - Windows 11 23 | - macOS 24 | - Ubuntu 22.04 25 | - CentOS 7 26 | - Other Linux (please specify below) 27 | validations: 28 | required: true 29 | - type: input 30 | id: other-os 31 | attributes: 32 | label: Other Linux 33 | placeholder: e.g. Fedora 38 34 | - type: input 35 | id: version 36 | attributes: 37 | label: Workflow Version 38 | description: This is most easily found in the workflow output log 39 | placeholder: v1.2.3 40 | validations: 41 | required: true 42 | - type: dropdown 43 | id: execution 44 | attributes: 45 | label: Workflow Execution 46 | description: Where are you running the workflow? 47 | options: 48 | - EPI2ME Desktop (Local) 49 | - EPI2ME Desktop (Cloud) 50 | - Command line (Local) 51 | - Command line (Cluster) 52 | - Other (please describe) 53 | validations: 54 | required: true 55 | - type: input 56 | id: other-workflow-execution 57 | attributes: 58 | label: Other workflow execution 59 | description: If "Other", please describe 60 | placeholder: Tell us where / how you are running the workflow. 61 | 62 | - type: markdown 63 | attributes: 64 | value: | 65 | # EPI2ME Desktop Application 66 | If you are using the application please provide the following. 67 | - type: input 68 | id: labs-version 69 | attributes: 70 | label: EPI2ME Version 71 | description: Available from the application settings page. 72 | placeholder: v5.1.1 73 | validations: 74 | required: false 75 | 76 | 77 | - type: markdown 78 | attributes: 79 | value: | 80 | # Command-line execution 81 | If you are using nextflow on a command-line, please provide the following. 82 | - type: textarea 83 | id: cli-command 84 | attributes: 85 | label: CLI command run 86 | description: Please tell us the command you are running 87 | placeholder: e.g. nextflow run epi2me-labs/wf-human-variations -profile standard --fastq my-reads/fastq 88 | validations: 89 | required: false 90 | - type: dropdown 91 | id: profile 92 | attributes: 93 | label: Workflow Execution - CLI Execution Profile 94 | description: Which execution profile are you using? If you are using a custom profile or nextflow configuration, please give details below. 95 | options: 96 | - standard (default) 97 | - singularity 98 | - custom 99 | validations: 100 | required: false 101 | 102 | 103 | - type: markdown 104 | attributes: 105 | value: | 106 | # Report details 107 | - type: textarea 108 | id: what-happened 109 | attributes: 110 | label: What happened? 111 | description: Also tell us, what did you expect to happen? 112 | placeholder: Tell us what you see! 113 | validations: 114 | required: true 115 | - type: textarea 116 | id: logs 117 | attributes: 118 | label: Relevant log output 119 | description: For CLI execution please include the full output from running nextflow. For execution from the EPI2ME application please copy the contents of the "Workflow logs" panel from the "Logs" tab corresponding to your workflow instance. (This will be automatically formatted into code, so no need for backticks). 120 | render: shell 121 | validations: 122 | required: true 123 | - type: textarea 124 | id: activity-log 125 | attributes: 126 | label: Application activity log entry 127 | description: For use with the EPI2ME application please see the Settings > View Activity Log page, and copy the contents of any items listed in red using the Copy to clipboard button. 128 | render: shell 129 | validations: 130 | required: false 131 | - type: dropdown 132 | id: run-demo 133 | attributes: 134 | label: Were you able to successfully run the latest version of the workflow with the demo data? 135 | description: For CLI execution, were you able to successfully run the workflow using the demo data available in the [Install and run](./README.md#install-and-run) section of the `README.md`? For execution in the EPI2ME application, were you able to successfully run the workflow via the "Use demo data" button? 136 | options: 137 | - 'yes' 138 | - 'no' 139 | - other (please describe below) 140 | validations: 141 | required: true 142 | - type: textarea 143 | id: demo-other 144 | attributes: 145 | label: Other demo data information 146 | render: shell 147 | validations: 148 | required: false 149 | 150 | -------------------------------------------------------------------------------- /nextflow.config: -------------------------------------------------------------------------------- 1 | // 2 | // Notes to End Users. 3 | // 4 | // The workflow should run without editing this configuration file, 5 | // however there may be instances in which you wish to edit this 6 | // file for compute performance or other reasons. Please see: 7 | // 8 | // https://nextflow.io/docs/latest/config.html#configuration 9 | // 10 | // for further help editing this file. 11 | 12 | 13 | params { 14 | help = false 15 | version = false 16 | fastq = null 17 | bam = null 18 | sample = null 19 | sample_sheet = null 20 | classifier = "minimap2" 21 | exclude_host = null 22 | // Advanced_options 23 | max_len = 2000 24 | min_len = 800 25 | min_read_qual = null 26 | threads = 4 27 | // Databases 28 | taxonomy = null 29 | reference = null 30 | ref2taxid = null 31 | database = null 32 | taxonomic_rank = 'G' 33 | // Minimap 34 | minimap2filter = null 35 | minimap2exclude = false 36 | keep_bam = false 37 | minimap2_by_reference = false 38 | min_percent_identity = 95 39 | min_ref_coverage = 90 40 | // Output 41 | store_dir = "store_dir" 42 | out_dir = "output" 43 | include_read_assignments = false 44 | // Extra features 45 | igv = false 46 | output_unclassified = false 47 | // Kraken 48 | bracken_length = null 49 | bracken_threshold = 10 50 | kraken2_memory_mapping = false 51 | kraken2_confidence = 0 52 | // Databases 53 | database_set = "ncbi_16s_18s" 54 | database_sets = [ 55 | 'ncbi_16s_18s': [ 56 | 'reference': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_16s_18s.fna', 57 | // database already includes kmer_dist_file 58 | 'database': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ncbi_targeted_loci_kraken2.tar.gz', 59 | 'ref2taxid': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s/ref2taxid.targloci.tsv', 60 | 'taxonomy': 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2025-01-01.zip' 61 | ], 62 | 'ncbi_16s_18s_28s_ITS': [ 63 | 'reference': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS.fna', 64 | 'database': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ncbi_16s_18s_28s_ITS_kraken2.tar.gz', 65 | 'ref2taxid': 'https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-metagenomics/ncbi_16s_18s_28s_ITS/ref2taxid.ncbi_16s_18s_28s_ITS.tsv', 66 | 'taxonomy': 'https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2025-01-01.zip' 67 | ], 68 | 'SILVA_138_1': [ 69 | // It uses the taxids from the Silva database, which doesn't match the taxids from NCBI 70 | // Database create from scratch using kraken2-build command. It automatically downloads the files. 71 | 'database': null 72 | ] 73 | ] 74 | // Report options 75 | abundance_threshold = 1 76 | n_taxa_barplot = 9 77 | // AMR options 78 | amr = false 79 | amr_db = "resfinder" 80 | amr_minid = 80 81 | amr_mincov = 80 82 | // AWS 83 | aws_image_prefix = null 84 | aws_queue = null 85 | // Other options 86 | disable_ping = false 87 | monochrome_logs = false 88 | validate_params = true 89 | show_hidden_params = false 90 | analyse_unclassified = false 91 | schema_ignore_params = 'show_hidden_params,validate_params,monochrome_logs,aws_queue,aws_image_prefix,wf,database_sets,amr,amr_db,amr_minid,amr_mincov' 92 | 93 | // Workflows images 94 | wf { 95 | example_cmd = [ 96 | "--fastq 'wf-16s-demo/test_data'", 97 | "--minimap2_by_reference" 98 | ] 99 | agent = null 100 | container_sha = "sha1d71a4d15f57a1c32aacdb94cacdeb268205548e" 101 | common_sha = "sha72f3517dd994984e0e2da0b97cb3f23f8540be4b" 102 | } 103 | } 104 | 105 | 106 | manifest { 107 | name = 'epi2me-labs/wf-16s' 108 | author = 'Oxford Nanopore Technologies' 109 | homePage = 'https://github.com/epi2me-labs/wf-16s' 110 | description = 'Taxonomic classification of 16S rRNA gene sequencing data.' 111 | mainScript = 'main.nf' 112 | nextflowVersion = '>=23.04.2' 113 | version = 'v1.6.0' 114 | } 115 | 116 | 117 | epi2melabs { 118 | tags = "wf-16s,targeted,16S,18S,ITS,bacteria,fungi,metagenomics" 119 | } 120 | 121 | // used by default for "standard" (docker) and singularity profiles, 122 | // other profiles may override. 123 | process { 124 | withLabel:wfmetagenomics { 125 | container = "ontresearch/wf-metagenomics:${params.wf.container_sha}" 126 | } 127 | withLabel:wf_common { 128 | container = "ontresearch/wf-common:${params.wf.common_sha}" 129 | } 130 | shell = ['/bin/bash', '-euo', 'pipefail'] 131 | } 132 | 133 | 134 | profiles { 135 | // the "standard" profile is used implicitely by nextflow 136 | // if no other profile is given on the CLI 137 | standard { 138 | docker { 139 | enabled = true 140 | // this ensures container is run as host user and group, but 141 | // also adds host user to the within-container group 142 | runOptions = "--user \$(id -u):\$(id -g) --group-add 100" 143 | } 144 | } 145 | 146 | // using singularity instead of docker 147 | singularity { 148 | singularity { 149 | enabled = true 150 | autoMounts = true 151 | } 152 | } 153 | 154 | 155 | conda { 156 | conda.enabled = true 157 | } 158 | 159 | // Using AWS batch. 160 | // May need to set aws.region and aws.batch.cliPath 161 | awsbatch { 162 | process { 163 | executor = 'awsbatch' 164 | queue = "${params.aws_queue}" 165 | memory = '16G' 166 | withLabel:wfmetagenomics { 167 | container = "${params.aws_image_prefix}-wf-metagenomics:${params.wf.container_sha}" 168 | } 169 | withLabel:wf_common { 170 | container = "${params.aws_image_prefix}-wf-common:${params.wf.common_sha}" 171 | } 172 | shell = ['/bin/bash', '-euo', 'pipefail'] 173 | } 174 | } 175 | 176 | // local profile for simplified development testing 177 | local { 178 | process.executor = 'local' 179 | } 180 | } 181 | 182 | 183 | timeline { 184 | enabled = true 185 | overwrite = true 186 | file = "${params.out_dir}/execution/timeline.html" 187 | } 188 | report { 189 | enabled = true 190 | overwrite = true 191 | file = "${params.out_dir}/execution/report.html" 192 | } 193 | trace { 194 | enabled = true 195 | overwrite = true 196 | file = "${params.out_dir}/execution/trace.txt" 197 | } 198 | 199 | env { 200 | PYTHONNOUSERSITE = 1 201 | JAVA_TOOL_OPTIONS = "-Xlog:disable -Xlog:all=warning:stderr" 202 | } 203 | -------------------------------------------------------------------------------- /output_definition.json: -------------------------------------------------------------------------------- 1 | { 2 | "files": { 3 | "workflow-report": { 4 | "filepath": "wf-16s-report.html", 5 | "title": "workflow report", 6 | "description": "Report for all samples.", 7 | "mime-type": "text/html", 8 | "optional": false, 9 | "type": "aggregated" 10 | }, 11 | "abundance-table-rank": { 12 | "filepath": "abundance_table_{{ taxonomic_rank }}.tsv", 13 | "title": "Abundance table with counts per taxa", 14 | "description": "Per-taxa counts TSV, including all samples.", 15 | "mime-type": "text/tab-separated-values", 16 | "optional": false, 17 | "type": "aggregated" 18 | }, 19 | "bracken-report": { 20 | "filepath": "bracken/{{ alias }}.kraken2_bracken.report", 21 | "title": "Bracken report file", 22 | "description": "TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report).", 23 | "mime-type": "text/tab-separated-values", 24 | "optional": true, 25 | "type": "per-sample" 26 | }, 27 | "kraken-report": { 28 | "filepath": "kraken2/{{ alias }}.kraken2.report.txt", 29 | "title": "Kraken2 taxonomic assignment per read (Kraken2 pipeline)", 30 | "description": "Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format).", 31 | "mime-type": "text/txt", 32 | "optional": true, 33 | "type": "per-sample" 34 | }, 35 | "kraken-assignments": { 36 | "filepath": "kraken2/{{ alias }}.kraken2.assignments.tsv", 37 | "title": "Kraken2 taxonomic asignment per read (Kraken2 pipeline)", 38 | "description": "TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format).", 39 | "mime-type": "text/tab-separated-values", 40 | "optional": true, 41 | "type": "per-sample" 42 | }, 43 | "host-bam": { 44 | "filepath": "host_bam/{{ alias }}.bam", 45 | "title": "Host BAM file", 46 | "description": "BAM file generated from mapping filtered input reads to the host reference.", 47 | "mime-type": "application/gzip", 48 | "optional": true, 49 | "type": "per-sample" 50 | }, 51 | "host-bai": { 52 | "filepath": "host_bam/{{ alias }}.bai", 53 | "title": "BAM index file of host reads", 54 | "description": "BAM index file generated from mapping filtered input reads to the host reference.", 55 | "mime-type": "application/octet-stream", 56 | "optional": true, 57 | "type": "per-sample" 58 | }, 59 | "minimap2-bam": { 60 | "filepath": "bams/{{ alias }}.reference.bam", 61 | "title": "BAM file (minimap2)", 62 | "description": "BAM file generated from mapping filtered input reads to the reference.", 63 | "mime-type": "application/gzip", 64 | "optional": true, 65 | "type": "per-sample" 66 | }, 67 | "minimap2-index": { 68 | "filepath": "bams/{{ alias }}.reference.bam.bai", 69 | "title": "BAM index file (minimap2)", 70 | "description": "Index file generated from mapping filtered input reads to the reference.", 71 | "mime-type": "application/octet-stream", 72 | "optional": true, 73 | "type": "per-sample" 74 | }, 75 | "minimap2-flagstats": { 76 | "filepath": "bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv", 77 | "title": "BAM flagstat (minimap2)", 78 | "description": "Mapping results per reference", 79 | "mime-type": "text/tab-separated-values", 80 | "optional": true, 81 | "type": "per-sample" 82 | }, 83 | "minimap2-bamreadstats": { 84 | "filepath": "bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz", 85 | "title": "Minimap2 alignment statistics (minimap2)", 86 | "description": "Per read stats after aligning", 87 | "mime-type": "application/gzip", 88 | "optional": true, 89 | "type": "per-sample" 90 | }, 91 | "reduced-reference": { 92 | "filepath": "igv_reference/reduced_reference.fasta.gz", 93 | "title": "Reduced reference FASTA file", 94 | "description": "Reference FASTA file containing only those sequences that have reads mapped against them.", 95 | "mime-type": "application/gzip", 96 | "optional": true, 97 | "type": "aggregated" 98 | }, 99 | "reduced-reference-index": { 100 | "filepath": "igv_reference/reduced_reference.fasta.gz.fai", 101 | "title": "Index of the reduced reference FASTA file", 102 | "description": "Index of the reference FASTA file containing only those sequences that have reads mapped against them.", 103 | "mime-type": "text/tab-separated-values", 104 | "optional": true, 105 | "type": "aggregated" 106 | }, 107 | "reduced-reference-gzi-index": { 108 | "filepath": "igv_reference/reduced_reference.fasta.gz.gzi", 109 | "title": "GZI index of the reduced reference FASTA file", 110 | "description": "Index of the reference FASTA file containing only those sequences that have reads mapped against them.", 111 | "mime-type": "application/octet-stream", 112 | "optional": true, 113 | "type": "aggregated" 114 | }, 115 | "igv-config": { 116 | "filepath": "igv.json", 117 | "title": "JSON configuration file for IGV browser", 118 | "description": "JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference.", 119 | "mime-type": "text/json", 120 | "optional": true, 121 | "type": "aggregated" 122 | }, 123 | "read-assignments": { 124 | "filepath": "reads_assignments/{{ alias }}.*.assignments.tsv", 125 | "title": "Taxonomic assignment per read.", 126 | "description": "TSV file with the taxonomic assignment per read.", 127 | "mime-type": "text/tab-separated-values", 128 | "optional": true, 129 | "type": "per-sample" 130 | }, 131 | "extracted-fastq": { 132 | "filepath": "extracted/{{ alias }}.minimap2.extracted.fastq", 133 | "title": "FASTQ of the selected taxids.", 134 | "description": "FASTQ containing/excluding the reads of the selected taxids.", 135 | "mime-type": "text", 136 | "optional": true, 137 | "type": "per-sample" 138 | }, 139 | "unclassified-fastq": { 140 | "filepath": "unclassified/{{ alias }}.unclassified.fq.gz", 141 | "title": "Unclassified FASTQ.", 142 | "description": "FASTQ containing the reads that have not been classified against the database.", 143 | "mime-type": "application/gzip", 144 | "optional": true, 145 | "type": "per-sample" 146 | }, 147 | "alignment-table": { 148 | "filepath": "alignment_tables/{{ alias }}.alignment-stats.tsv", 149 | "title": "Alignment statistics TSV", 150 | "description": "Coverage and taxonomy of each reference.", 151 | "mime-type": "text/tab-separated-values", 152 | "optional": true, 153 | "type": "per-sample" 154 | } 155 | } 156 | } -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog 2 | All notable changes to this project will be documented in this file. 3 | 4 | The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), 5 | and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). 6 | 7 | 8 | ## [v1.6.0] 9 | This release of wf-16s updates documentation to include guidance for analysis of ITS amplicons with the SQK-MAB114 kit. Additionally, this version of wf-16s fixes issues with missing files and division by zero, which were triggered when input data coverage was very low. This release removes the real time analysis options to simplify the workflow; new solutions for real time taxonomic classification are in development but users who wish to continue using this functionality will need to pin wf-16s to v1.5.0. 10 | ### Changed 11 | - Update wf-metagenomics to [v2.14.0](https://github.com/epi2me-labs/wf-metagenomics/blob/master/CHANGELOG.md#v2140): 12 | - Update wf-template to v5.6.2, which changes: 13 | - Reduce verbosity of debug logging from fastcat which can occasionally occlude errors found in FASTQ files during ingress. 14 | - Log banner art to say "EPI2ME" instead of "EPI2ME Labs" to match current branding. This has no effect on the workflow outputs. 15 | - pre-commit configuration to resolve an internal dependency problem with flake8. This has no effect on the workflow. 16 | - Values in the diversity table appear as None if there are no reads in the sample. 17 | - Values in the abundance table are now integers instead of floats. 18 | - Samples with fewer than 50% of the median read count across all samples are excluded from the rarefaction table. This is to avoid the rest of the samples being rarefied to a very low number of reads, which would lead to a loss of information. 19 | ### Added 20 | - Section in the README about presets for analysing ITS sequencing. 21 | ### Fixed 22 | - Update to wf-metagenomics [v2.14.0](https://github.com/epi2me-labs/wf-metagenomics/blob/master/CHANGELOG.md#v2140): 23 | - Update wf-template to v5.6.2, which fixes: 24 | - Sequence summary read length N50 incorrectly displayed minimum read length, it now correctly shows the N50. 25 | - Sequence summary component alignment and coverage plots failed to plot under some conditions. 26 | - Missing output file containing per-read assignments after identity and coverage filters when using include_read_assignments with the minimap2 subworkflow; this table is now correctly published to {alias}_lineages.minimap2.assignments.tsv. 27 | - Missing output file(s) encountered in the prepare_databases:determine_bracken_length process when using the bracken_length option. 28 | - Missing output file(s) encountered in the minimap_pipeline:getAlignmentStats process when all reads are unclassified. 29 | - pandas.errors.EmptyDataError encountered in the getAlignmentStats process when reference coverage does not reach 1x 30 | - ZeroDivisionError: division by zero encountered in the progressive_bracken process when there are no taxa identified at all. 31 | - Versions of some tools were not properly displayed in the report. 32 | - raise ValueError("All objects passed were None") caused by all samples containing zero classified reads after applying bracken threshold. 33 | 34 | ### Removed 35 | - Real time functionality has been removed to simplify the workflow. The following parameters have been removed as they are no longer required: `server_threads`, `kraken_clients`, `port`, `host`, `external_kraken2`, `batch_size`, `real_time`, `read_limit`. Using these parameters in v1.6.0 onwards will cause an error. 36 | - Update image to remove kraken2-server dependency as it was only required by the real time workflow. 37 | 38 | ## [v1.5.0] 39 | ### Changed 40 | - Bump to wf-metagenomics v2.13.0 41 | - NCBI Taxonomy database updated to the 2025-01-01 release 42 | - Reconciled workflow with wf-template v5.5.0. 43 | - Fix error: bracken-build: line 231: syntax error: unexpected end of file when using SILVA database. 44 | ### Added 45 | - `output_unclassified` parameter. When True, output unclassified FASTQ sequences for both minimap2 and kraken2 modes (default: False). 46 | - Table with alignment stats is now an output: alignment_tables/{{ alias }}.alignment-stats.tsv 47 | 48 | ## [v1.4.0] 49 | ### Changed 50 | - Bump to wf-metagenomics v2.12.0 51 | ### Added 52 | - `bracken_threshold` parameter to adjust bracken minimum read threshold, default 10. 53 | 54 | ## [v1.3.0] 55 | ### Fixed 56 | - Switch to markdown links in the outputs table in the README. 57 | - Exclude samples if all the reads are removed during host depletion. 58 | ### Added 59 | - `igv` option to enable IGV in the EPI2ME Desktop Application. 60 | - `include_read_assignments` option to output a file with the taxonomy of each read. 61 | - `Reads` section in the report to track the number of reads after filtering, host depletion and unclassified. 62 | ### Changed 63 | - Bump to wf-metagenomics v2.11.0 64 | - `keep_bam` is now only required to output BAM files. 65 | - `include_kraken2_assignments` has been replaced by `include_read_assignments`. 66 | - Update databases: 67 | - Taxonomy database to the one released 2024-09-01 68 | ### Removed 69 | - `split-prefix` parameter, as the workflow automatically enables this option for large reference genomes. 70 | - Plot showing number of reads per sample has been replaced for a new table in `Reads` section. 71 | 72 | ## [v1.2.0] 73 | ### Added 74 | - Output IGV configuration file if the `keep_bam` option is enabled and a custom reference is provided (in minimap2 mode). 75 | - Output reduced reference file if the `keep_bam` option is enabled (in minimap2 mode). 76 | - `abundance_threshold` reduces the number of references to be displayed in IGV. 77 | ### Fixed 78 | - `exclude-host` can input a file in the EPI2ME Desktop Application. 79 | ### Changed 80 | - Bump to wf-metagenomics v2.10.0 81 | 82 | ## [v1.1.3] 83 | ### Added 84 | - Reads below percentages of identity (`min_percent_identity`) and the reference covered (`min_ref_coverage`) are considered as unclassified in the minimap2 approach. 85 | ### Fixed 86 | - Files that are empty following the fastcat filtering are discarded from downstream analyses. 87 | ### Changed 88 | - Bump to wf-metagenomics v2.9.4 89 | - `bam` folder within output has been renamed to `bams` 90 | 91 | ## [v1.1.2] 92 | ### Fixed 93 | - "Can only use .dt accessor with datetimelike values" error in makeReport 94 | - "invalid literal for int() with base 10" error in makeReport 95 | ### Changed 96 | - Bump to wf-metagenomics v2.9.2 97 | 98 | ## [v1.1.1] 99 | ### Changed 100 | - Bump to wf-metagenomics v2.9.1 101 | 102 | ## [v1.1.0] 103 | ### Added 104 | - Workflow now accepts BAM or FASTQ files as input (using the `--bam` or `--fastq` parameters, respectively). 105 | ### Changed 106 | - Bump to wf-metagenomics v2.9.0 107 | - Default for `--n_taxa_barplot` increased from 8 to 9. 108 | 109 | ## [v1.0.0] 110 | ### Changed 111 | - Bump to wf-metagenomics v2.8.0 112 | - Update docs 113 | 114 | ## [v0.0.4] 115 | ### Changed 116 | - Bump to wf-metagenomics v2.7.0 117 | - Fixed CHANGELOG format 118 | 119 | ## [v0.0.3] 120 | ### Changed 121 | - Bump to wf-metagenomics v2.6.1 122 | 123 | ## [v0.0.2] 124 | ### Changed 125 | - Bump to wf-metagenomics v2.6.0 126 | 127 | ## [v0.0.1] 128 | - First release. -------------------------------------------------------------------------------- /docs/08_pipeline_overview.md: -------------------------------------------------------------------------------- 1 | 2 | ### Workflow defaults and parameters 3 | The workflow sets default values for parameters optimised for the analysis of full-length 16S rRNA gene amplicons, including `min_len`, `max_len`, `min_ref_coverage`, and `min_percent_identity`. 4 | Descriptions of the parameters and their defaults can be found in the [input parameters section](#input-parameters). 5 | 6 | #### Analysing ITS amplicons 7 | For analysis of ITS amplicons users should adjust the following parameters: 8 | - `min_len` should be decreased to 300, as ITS amplicons may be shorter than the current `min_len` default value which will cause them to be excluded. 9 | - `database_set` should be changed to `ncbi_16s_18s_28s_ITS` or a [custom database](#faqs) containing the relevant ITS references. 10 | 11 | ### 1. Concatenate input files and generate per read stats 12 | 13 | [fastcat](https://github.com/epi2me-labs/fastcat) is used to concatenate input FASTQ files prior to downstream processing of the workflow. It will also output per-read stats including read lengths and average qualities. 14 | 15 | You may want to choose which reads are analysed by filtering them using the flags `max_len`, `min_len` and `min_read_qual`. 16 | 17 | ### 2. Remove host sequences (optional) 18 | 19 | We have included an optional filtering step to remove any host sequences that map (using [Minimap2](https://github.com/lh3/minimap2)) against a provided host reference (e.g. human), which can be a FASTA file or a MMI index. To use this option provide the path to your host reference with the `exclude_host` parameter. The mapped reads are output in a BAM file and excluded from further analysis. 20 | 21 | ``` 22 | nextflow run epi2me-labs/wf-16s --fastq test_data/case04/reads.fastq.gz --exclude_host test_data/case04/host.fasta.gz 23 | ``` 24 | 25 | ### 3. Classify reads taxonomically 26 | 27 | There are two different approaches to taxonomic classification: 28 | 29 | #### 3.1 Using Minimap2 30 | 31 | [Minimap2](https://github.com/lh3/minimap2) provides better resolution, but, depending on the reference database used, can take significantly more time. This is the option by default. 32 | 33 | ``` 34 | nextflow run epi2me-labs/wf-16s --fastq test_data/case01 --classifier minimap2 35 | ``` 36 | 37 | The creation of alignment statistics plots can be enabled with the `minimap2_by_reference` flag. Using this option produces a table and scatter plot in the report showing sequencing depth and coverage of each reference. The report also contains a heatmap indicating the sequencing depth over relative genomic coordinates for the references with the highest coverage (references with a mean coverage of less than 1% of the one with the largest value are omitted). 38 | 39 | In addition, the user can output BAM files in a folder called `bams` by using the option `keep_bam`. If the user provides a custom database and uses the `igv` option, the workflow will also output the references with reads mappings, as well as an IGV configuration file. This configuration file allows the user to view the alignments in the EPI2ME Desktop Application in the Viewer tab. Note that the number of references can be reduced using the `abundance_threshold` option, which will select those references with a number of reads aligned higher than this value. Please, consider that the view of the alignment is highly dependent on the reference selected. 40 | 41 | #### 3.2 Using Kraken2 42 | 43 | [Kraken2](https://github.com/DerrickWood/kraken2) provides the fastest method for the taxonomic classification of the reads. Then, [Bracken](https://github.com/jenniferlu717/Bracken) is used to provide an estimate of the genus (or the selected taxonomic rank) abundance in the sample. 44 | 45 | ### 4. Output 46 | 47 | The main output of the wf-16s pipeline is the `wf-16s-report.html` which can be found in the output directory. It contains a summary of read statistics, the taxonomic composition of the sample and some diversity metrics. The results shown in the report can also be customised with several options. For example, you can use `abundance_threshold` to remove all taxa less prevalent than the threshold from the abundance table. When setting this parameter to a natural number, taxa with fewer absolute counts are removed. You can also pass a decimal between 0.0-1.0 to drop taxa of lower relative abundance. Furthermore, `n_taxa_barplot` controls the number of taxa displayed in the bar plot and groups the rest under the category ‘Other’. 48 | 49 | You can use the flag `include_read_assignments` to include a per-sample TSV file that indicates how each input sequence was classified, as well as the taxon that has been assigned to each read. 50 | 51 | For more information about remaining workflow outputs, please see [minimap2 Options](#minimap2-options). 52 | 53 | ### 5. Diversity indices 54 | 55 | Species diversity refers to the taxonomic composition in a specific microbial community. There are some useful concepts to take into account: 56 | * Richness: number of unique taxonomic groups present in the community, 57 | * Taxonomic group abundance: number of individuals of a particular taxonomic group present in the community, 58 | * Evenness: refers to the equitability of the different taxonomic groups in terms of their abundances. 59 | Two different communities can host the same number of different taxonomic groups (i.e. they have the same richness), but they can have different evenness. For instance, if there is one taxon whose abundance is much larger in one community compared to the other. 60 | 61 | There are three types of biodiversity measures described over a special scale [1](https://doi.org/10.2307/1218190), [2](https://doi.org/10.1016/B978-0-12-384719-5.00036-8): alpha-, beta-, and gamma-diversity. 62 | * Alpha-diversity refers to the richness that occurs within a community given area within a region. 63 | * Beta-diversity defined as variation in the identities of species among sites, provides a direct link between biodiversity at local scales (alpha diversity) and the broader regional species pool (gamma diversity). 64 | * Gamma-diversity is the total observed richness within an entire region. 65 | 66 | To provide a quick overview of the alpha-diversity of the microbial community, we provide some of the most common diversity metrics calculated for a specific taxonomic rank [3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4224527/), which can be chosen by the user with the `taxonomic_rank` parameter ('D'=Domain,'P'=Phylum, 'C'=Class, 'O'=Order, 'F'=Family, 'G'=Genus, 'S'=Species). By default, the rank is 'G' (genus-level). Some of the included alpha diversity metrics are: 67 | 68 | * Shannon Diversity Index (H): Shannon entropy approaches zero if a community is almost entirely made up of a single taxon. 69 | 70 | ```math 71 | H = -\sum_{i=1}^{S}p_i*ln(p_i) 72 | ``` 73 | 74 | * Simpson's Diversity Index (D): The range is from 0 (low diversity) to 1 (high diversity). 75 | 76 | ```math 77 | D = \sum_{i=1}^{S}p_i^2 78 | ``` 79 | 80 | * Pielou Index (J): The values range from 0 (presence of a dominant species) and 1 (maximum evennes). 81 | 82 | ```math 83 | J = H/ln(S) 84 | ``` 85 | 86 | * Berger-Parker dominance index (BP): expresses the proportional importance of the most abundant type, i.e., the ratio of number of individuals of most abundant species to the total number of individuals of all the species in the sample. 87 | 88 | ```math 89 | BP = n_i/N 90 | ``` 91 | where ni refers to the counts of the most abundant taxon and N is the total of counts. 92 | 93 | 94 | * Fisher’s alpha: Fisher (see Fisher, 1943[4](https://doi.org/10.2307/1411)) noticed that only a few species tend to be abundant while most are represented by only a few individuals ('rare biosphere'). These differences in species abundance can be incorporated into species diversity measurements such as the Fisher’s alpha. This index is based upon the logarithmic distribution of number of individuals of different species. 95 | 96 | ```math 97 | S = \alpha * ln(1 + N/\alpha) 98 | ``` 99 | where S is the total number of taxa, N is the total number of individuals in the sample. The value of Fisher's $`\alpha`$ is calculated by iteration. 100 | 101 | These indices are calculated by default using the original abundance table (see McMurdie and Holmes[5](https://pubmed.ncbi.nlm.nih.gov/24699258/), 2014 and Willis[6](https://www.frontiersin.org/articles/10.3389/fmicb.2019.02407/full), 2019). If you want to calculate them from a rarefied abundance table (i.e. all the samples have been subsampled to contain the same number of counts per sample, which is the 95% of the minimum number of total counts), you can download the rarefied table from the report. 102 | 103 | The report also includes the rarefaction curve per sample which displays the mean of species richness for a subsample of reads (sample size). Generally, this curve initially grows rapidly, as most abundant species are sequenced and they add new taxa in the community, then slightly flattens due to the fact that 'rare' species are more difficult of being sampled, and because of that is more difficult to report an increase in the number of observed species. 104 | 105 | > Note: Within each rank, each named taxon is a unique unit. The counts are the number of reads assigned to that taxon. All `Unknown` sequences are considered as a unique taxon 106 | -------------------------------------------------------------------------------- /docs/06_input_parameters.md: -------------------------------------------------------------------------------- 1 | ### Input Options 2 | 3 | | Nextflow parameter name | Type | Description | Help | Default | 4 | |--------------------------|------|-------------|------|---------| 5 | | fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | 6 | | bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | 7 | | classifier | string | Kraken2 or Minimap2 workflow to be used for classification of reads. | Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info. | minimap2 | 8 | | analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False | 9 | | exclude_host | string | A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis. | | | 10 | 11 | 12 | ### Sample Options 13 | 14 | | Nextflow parameter name | Type | Description | Help | Default | 15 | |--------------------------|------|-------------|------|---------| 16 | | sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed. | | 17 | | sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. | | | 18 | 19 | 20 | ### Reference Options 21 | 22 | | Nextflow parameter name | Type | Description | Help | Default | 23 | |--------------------------|------|-------------|------|---------| 24 | | database_set | string | Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database. | This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options. | ncbi_16s_18s | 25 | | database | string | Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory]. | By default uses database chosen in database_set parameter. | | 26 | | taxonomy | string | Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file [.tar.gz or directory]. | By default NCBI taxonomy file will be downloaded and used. | | 27 | | reference | string | Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index. | This option should be used in conjunction with the database parameter to specify a custom database. | | 28 | | ref2taxid | string | Not required but can be used to specify a ref2taxid mapping. Format is .tsv (refname taxid), no header row. | By default uses ref2taxid for option chosen in database_set parameter. | | 29 | | taxonomic_rank | string | Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species). | | G | 30 | 31 | 32 | ### Kraken2 Options 33 | 34 | | Nextflow parameter name | Type | Description | Help | Default | 35 | |--------------------------|------|-------------|------|---------| 36 | | bracken_length | integer | Set the length value Bracken will use | Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300 | | 37 | | bracken_threshold | integer | Set the minimum read threshold Bracken will use to consider a taxon | Bracken will only consider taxa with a read count greater than or equal to this value. | 10 | 38 | | kraken2_memory_mapping | boolean | Avoids loading database into RAM | Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database. | False | 39 | | kraken2_confidence | number | Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1 | Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works. | 0.0 | 40 | 41 | 42 | ### Minimap2 Options 43 | 44 | | Nextflow parameter name | Type | Description | Help | Default | 45 | |--------------------------|------|-------------|------|---------| 46 | | minimap2filter | string | Filter output of minimap2 by taxids inc. child nodes, E.g. "9606,1404" | Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs. | | 47 | | minimap2exclude | boolean | Invert minimap2filter and exclude the given taxids instead | Exclude a list of taxids from analysis outputs. | False | 48 | | keep_bam | boolean | Copy bam files into the output directory. | | False | 49 | | minimap2_by_reference | boolean | Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report | | False | 50 | | min_percent_identity | number | Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified. | | 95 | 51 | | min_ref_coverage | number | Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths. | | 90 | 52 | 53 | 54 | ### Report Options 55 | 56 | | Nextflow parameter name | Type | Description | Help | Default | 57 | |--------------------------|------|-------------|------|---------| 58 | | abundance_threshold | number | Remove those taxa whose abundance is equal or lower than the chosen value. | To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1. | 1 | 59 | | n_taxa_barplot | integer | Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the "Other" category. | | 9 | 60 | 61 | 62 | ### Output Options 63 | 64 | | Nextflow parameter name | Type | Description | Help | Default | 65 | |--------------------------|------|-------------|------|---------| 66 | | out_dir | string | Directory for output of all user-facing files. | | output | 67 | | igv | boolean | Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index. | | False | 68 | | include_read_assignments | boolean | A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow. | | False | 69 | | output_unclassified | boolean | Output a FASTQ of the unclassified reads. | | False | 70 | 71 | 72 | ### Advanced Options 73 | 74 | | Nextflow parameter name | Type | Description | Help | Default | 75 | |--------------------------|------|-------------|------|---------| 76 | | min_len | integer | Specify read length lower limit. | Any reads shorter than this limit will not be included in the analysis. | 800 | 77 | | min_read_qual | number | Specify read quality lower limit. | Any reads with a quality lower than this limit will not be included in the analysis. | | 78 | | max_len | integer | Specify read length upper limit | Any reads longer than this limit will not be included in the analysis. | 2000 | 79 | | threads | integer | Maximum number of CPU threads to use in each parallel workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. | 4 | 80 | 81 | 82 | -------------------------------------------------------------------------------- /main.nf: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env nextflow 2 | 3 | import groovy.json.JsonBuilder 4 | nextflow.enable.dsl = 2 5 | 6 | include { fastq_ingress; xam_ingress } from './lib/ingress' 7 | include { getParams } from './lib/common' 8 | include { run_common } from './wf-metagenomics/subworkflows/common_pipeline' 9 | include { minimap_pipeline } from './wf-metagenomics/subworkflows/minimap_pipeline' 10 | // standard kraken2 11 | include { kraken_pipeline } from './wf-metagenomics/subworkflows/kraken_pipeline' 12 | 13 | // databases 14 | include { prepare_databases } from "./wf-metagenomics/modules/local/databases.nf" 15 | include { 16 | makeReport; 17 | getVersions; 18 | getVersionsCommon; 19 | } from "./wf-metagenomics/modules/local/common" 20 | 21 | OPTIONAL_FILE = file("$projectDir/data/OPTIONAL_FILE") 22 | nextflow.preview.recursion=true 23 | 24 | // entrypoint workflow 25 | WorkflowMain.initialise(workflow, params, log) 26 | workflow { 27 | Pinguscript.ping_start(nextflow, workflow, params) 28 | 29 | dataDir = projectDir + '/data' 30 | 31 | // Ready the optional file 32 | OPTIONAL = file("$projectDir/data/OPTIONAL_FILE") 33 | 34 | 35 | // Checking user parameters 36 | log.info("Checking inputs.") 37 | 38 | // Check maximum and minimum length 39 | ArrayList fastcat_extra_args = [] 40 | if (params.min_len) { fastcat_extra_args << "-a $params.min_len" } 41 | if (params.max_len) { fastcat_extra_args << "-b $params.max_len" } 42 | if (params.min_read_qual) { fastcat_extra_args << "-q $params.min_read_qual" } 43 | // If BAM files are output, keep runIDs in case they are reused in the wf to track them. 44 | boolean keep_bam = (params.keep_bam || params.igv) 45 | if (keep_bam) {fastcat_extra_args << "-H"} 46 | 47 | // Check source param is valid 48 | sources = params.database_sets 49 | if (params.containsKey('include_kraken2_assignments')){ 50 | throw new Exception("`include_kraken2_assignments` is now deprecated in favour of `include_read_assignments`.") 51 | } 52 | 53 | // Stop the pipeline in case not valid parameters combinations 54 | if (params.classifier == 'minimap2' && params.database) { 55 | throw new Exception("To use minimap2 with your custom database, you need to use `--reference` (instead of `--database`) and `--ref2taxid`.") 56 | } 57 | 58 | boolean output_igv = params.igv 59 | if (params.classifier == 'minimap2' && params.reference && params.igv) { 60 | ArrayList ref_exts = [".fa", ".fa.gz", ".fasta", ".fasta.gz", ".fna", ".fna.gz"] 61 | if (! ref_exts.any { ext -> file(params.reference).name.endsWith(ext) }) { 62 | output_igv = false 63 | log.info("The custom database reference must be a FASTA format file in order to view within IGV.") 64 | } else { 65 | output_igv=true 66 | } 67 | } 68 | 69 | if ((params.classifier == 'kraken2' ) && params.reference) { 70 | throw new Exception("To use kraken2 with your custom database, you need to use `--database` (instead of `--reference`) and include the `bracken_dist` within it.") 71 | } 72 | 73 | // If user provides each database, set to 'custom' the params.database_set 74 | if (params.reference || params.database) { 75 | source_name = 'custom' 76 | // distinguish between taxonomy and database to be able to use taxonomy default db in some cases. 77 | // this can be potentially risky but might be justified if the reference and ref2taxid use NCBI taxids. 78 | source_data_database = null 79 | source_name_taxonomy = params.database_set 80 | source_data_taxonomy = sources.get(source_name_taxonomy, false) 81 | log.info("Note: Reference/Database are custom.") 82 | log.info("Note: Memory available to the workflow must be slightly higher than size of the database $source_name index.") 83 | if (params.classifier == "kraken2"){ 84 | log.info("Note: Or consider to use the --kraken2_memory_mapping.") 85 | } 86 | 87 | } 88 | if(params.taxonomy){ 89 | // this can be useful if the user wants to use a new taxonomy database (maybe updated) but the default reference. 90 | source_name = params.database_set 91 | source_data_database = sources.get(source_name, false) 92 | source_data_taxonomy = null 93 | log.info("Note: Taxonomy database is custom.") 94 | } else { 95 | source_name = params.database_set 96 | source_data_database = sources.get(source_name, false) 97 | source_data_taxonomy = sources.get(source_name, false) 98 | if (!sources.containsKey(source_name) || !source_data_database) { 99 | keys = sources.keySet() 100 | throw new Exception("Source $params.database_set is invalid, must be one of $keys") 101 | } 102 | } 103 | 104 | // Input data 105 | if (params.fastq) { 106 | ingress_samples = fastq_ingress([ 107 | "input":params.fastq, 108 | "sample": params.sample, 109 | "sample_sheet": params.sample_sheet, 110 | "analyse_unclassified":params.analyse_unclassified, 111 | "stats": true, 112 | "fastcat_extra_args": fastcat_extra_args.join(" "), 113 | "per_read_stats": false 114 | ]) 115 | } else { 116 | // if we didn't get a `--fastq`, there must have been a `--bam` (as is codified 117 | // by the schema) 118 | ingress_samples = xam_ingress([ 119 | "input":params.bam, 120 | "sample":params.sample, 121 | "sample_sheet":params.sample_sheet, 122 | "analyse_unclassified":params.analyse_unclassified, 123 | "return_fastq": true, 124 | "keep_unaligned": true, 125 | "stats": true, 126 | "per_read_stats": false 127 | ]) 128 | } 129 | 130 | 131 | // Discard empty samples 132 | log.info( 133 | "Note: Empty files or those files whose reads have been discarded after filtering based on " + 134 | "read length and/or read quality will not appear in the report and will be excluded from subsequent analysis.") 135 | ingress_samples_filtered = ingress_samples 136 | | filter { meta, _seqs, _stats -> 137 | def valid = meta['n_seqs'] > 0 138 | if (!valid) { 139 | log.warn "Found empty file for sample '${meta["alias"]}'." 140 | } 141 | valid 142 | } 143 | 144 | // Set minimap2 common options 145 | ArrayList common_minimap2_opts = [ 146 | "-ax map-ont", 147 | "--cap-kalloc 100m", 148 | "--cap-sw-mem 50m", 149 | ] 150 | 151 | 152 | // Run common 153 | versions = getVersionsCommon(getVersions()) 154 | parameters = getParams() 155 | 156 | if (params.exclude_host) { 157 | host_reference = file(params.exclude_host, checkIfExists: true) 158 | samples = run_common(ingress_samples_filtered, host_reference, common_minimap2_opts).samples 159 | } else { 160 | samples = ingress_samples_filtered 161 | } 162 | 163 | if (params.classifier == "minimap2") { 164 | log.info("Minimap2 pipeline.") 165 | if (keep_bam) { 166 | common_minimap2_opts = common_minimap2_opts + ["-y"] 167 | } 168 | databases = prepare_databases( 169 | source_data_taxonomy, 170 | source_data_database 171 | ) 172 | results = minimap_pipeline( 173 | samples, 174 | databases.reference, 175 | databases.ref2taxid, 176 | databases.taxonomy, 177 | databases.taxonomic_rank, 178 | common_minimap2_opts, 179 | output_igv 180 | ) 181 | alignment_stats = results.alignment_reports 182 | } else { 183 | // Handle getting kraken2 database files if kraken2 classifier selected 184 | log.info("Kraken2 pipeline.") 185 | alignment_stats = Channel.empty() 186 | databases = prepare_databases( 187 | source_data_taxonomy, 188 | source_data_database 189 | ) 190 | results = kraken_pipeline( 191 | samples, 192 | databases.taxonomy, 193 | databases.database, 194 | databases.bracken_length, 195 | databases.taxonomic_rank, 196 | ) 197 | } 198 | 199 | // Use initial reads stats (after fastcat) QC, 200 | // and after host_depletion 201 | // but update meta after running pipelines 202 | for_report = ingress_samples_filtered 203 | | map { meta, _path, stats -> 204 | [ meta.alias, stats ] } 205 | | combine( 206 | results.metadata_after_taxonomy, 207 | by: 0 ) // on alias 208 | | multiMap { _alias, stats, meta -> 209 | meta: meta 210 | stats: stats } 211 | // Reporting 212 | makeReport( 213 | workflow.manifest.version, 214 | for_report.meta.collect(), 215 | for_report.stats.collect(), 216 | results.abundance_table, 217 | alignment_stats.ifEmpty(OPTIONAL_FILE), 218 | results.lineages, 219 | versions, 220 | parameters, 221 | databases.taxonomic_rank, 222 | OPTIONAL_FILE 223 | ) 224 | } 225 | 226 | workflow.onComplete { 227 | Pinguscript.ping_complete(nextflow, workflow, params) 228 | } 229 | workflow.onError { 230 | Pinguscript.ping_error(nextflow, workflow, params) 231 | } 232 | -------------------------------------------------------------------------------- /test/test_fastq_ingress.py: -------------------------------------------------------------------------------- 1 | """Test `fastq_ingress` result of previously run workflow.""" 2 | import argparse 3 | import json 4 | import os 5 | import pathlib 6 | import re 7 | import sys 8 | 9 | import pandas as pd 10 | import pysam 11 | import pytest 12 | 13 | 14 | FASTQ_EXTENSIONS = ["fastq", "fastq.gz", "fq", "fq.gz"] 15 | ROOT_DIR = pathlib.Path(__file__).resolve().parent.parent 16 | 17 | 18 | def is_fastq_file(fname): 19 | """Check if file is a FASTQ file.""" 20 | return any(map(lambda ext: fname.endswith(ext), FASTQ_EXTENSIONS)) 21 | 22 | 23 | def get_fastq_files(path): 24 | """Return a list of FASTQ files for a given path.""" 25 | return filter(is_fastq_file, os.listdir(path)) if os.path.isdir(path) else [path] 26 | 27 | 28 | def create_metadict(**kwargs): 29 | """Create dict from metadata and check if required values are present.""" 30 | if "alias" not in kwargs or kwargs["alias"] is None: 31 | raise ValueError("Meta data needs 'alias'.") 32 | defaults = dict(barcode=None, type="test_sample", run_ids=[]) 33 | if "run_ids" in kwargs: 34 | # cast to sorted list to compare to workflow output 35 | kwargs["run_ids"] = sorted(list(kwargs["run_ids"])) 36 | defaults.update(kwargs) 37 | defaults["alias"] = defaults["alias"].replace(" ", "_") 38 | return defaults 39 | 40 | 41 | def get_fastq_names_and_runids(fastq_file): 42 | """Create a dict of names and run_ids for entries in a FASTQ file.""" 43 | names = [] 44 | run_ids = set() 45 | with pysam.FastxFile(fastq_file) as f: 46 | for entry in f: 47 | names.append(entry.name) 48 | (run_id,) = re.findall(r"runid=([^\s]+)", entry.comment) or [None] 49 | if run_id: 50 | run_ids.add(run_id) 51 | return dict(names=names, run_ids=run_ids) 52 | 53 | 54 | def args(): 55 | """Parse and process input arguments. Use the workflow params for those missing.""" 56 | # get the path to the workflow output directory 57 | parser = argparse.ArgumentParser() 58 | parser.add_argument( 59 | "--wf-output-dir", 60 | default=ROOT_DIR / "output", 61 | help=( 62 | "path to the output directory where the workflow results have been " 63 | "published; defaults to 'output' in the root directory of the workflow if " 64 | "not provided" 65 | ), 66 | ) 67 | parser.add_argument( 68 | "--fastq", 69 | help=( 70 | "Path to FASTQ input file / directory with FASTQ files / sub-directories; " 71 | "will take input path from workflow output if not provided" 72 | ), 73 | ) 74 | parser.add_argument( 75 | "--sample_sheet", 76 | help=( 77 | "Path to sample sheet CSV file. If not provided, will take sample sheet " 78 | "path from workflow params (if available)." 79 | ), 80 | ) 81 | args = parser.parse_args() 82 | 83 | wf_output_dir = pathlib.Path(args.wf_output_dir) 84 | fastq_ingress_results_dir = wf_output_dir / "fastq_ingress_results" 85 | 86 | # make sure that there are fastq_ingress results (i.e. that the workflow has been 87 | # run successfully and that the correct wf output path was provided) 88 | if not fastq_ingress_results_dir.exists(): 89 | raise ValueError( 90 | f"{fastq_ingress_results_dir} does not exist. Has `wf-template` been run?" 91 | ) 92 | 93 | # get the workflow params 94 | with open(wf_output_dir / "params.json", "r") as f: 95 | params = json.load(f) 96 | input_path = args.fastq if args.fastq is not None else ROOT_DIR / params["fastq"] 97 | sample_sheet = args.sample_sheet 98 | if sample_sheet is None and params["sample_sheet"] is not None: 99 | sample_sheet = ROOT_DIR / params["sample_sheet"] 100 | 101 | if not os.path.exists(input_path): 102 | raise ValueError(f"Input path '{input_path}' does not exist.") 103 | 104 | return input_path, sample_sheet, fastq_ingress_results_dir, params 105 | 106 | 107 | def get_valid_inputs(input_path, sample_sheet, params): 108 | """Get valid input paths and corresponding metadata.""" 109 | # find the valid inputs 110 | valid_inputs = [] 111 | if os.path.isfile(input_path): 112 | # handle file case 113 | fastq_entries = get_fastq_names_and_runids(input_path) 114 | valid_inputs.append( 115 | [ 116 | create_metadict( 117 | alias=params["sample"] 118 | if params["sample"] is not None 119 | else os.path.basename(input_path).split(".")[0], 120 | run_ids=fastq_entries["run_ids"], 121 | ), 122 | input_path, 123 | ] 124 | ) 125 | else: 126 | # is a directory --> check if fastq files in top-level dir or in sub-dirs 127 | tree = list(os.walk(input_path)) 128 | top_dir_has_fastq_files = any(map(is_fastq_file, tree[0][2])) 129 | subdirs_have_fastq_files = any( 130 | any(map(is_fastq_file, files)) for _, _, files in tree[1:] 131 | ) 132 | if top_dir_has_fastq_files and subdirs_have_fastq_files: 133 | raise ValueError( 134 | f"Input directory '{input_path}' cannot contain FASTQ " 135 | "files and sub-directories with FASTQ files." 136 | ) 137 | # make sure we only have fastq files in either (top-level dir or sub-dirs) and 138 | # not both 139 | if not top_dir_has_fastq_files and not subdirs_have_fastq_files: 140 | raise ValueError( 141 | f"Input directory '{input_path}' contains neither sub-directories " 142 | "nor FASTQ files." 143 | ) 144 | if top_dir_has_fastq_files: 145 | run_ids = set() 146 | for fastq_file in get_fastq_files(input_path): 147 | curr_fastq_entries = get_fastq_names_and_runids( 148 | pathlib.Path(input_path) / fastq_file 149 | ) 150 | run_ids.update(curr_fastq_entries["run_ids"]) 151 | valid_inputs.append( 152 | [ 153 | create_metadict( 154 | alias=params["sample"] 155 | if params["sample"] is not None 156 | else os.path.basename(input_path), 157 | run_ids=run_ids, 158 | ), 159 | input_path, 160 | ] 161 | ) 162 | else: 163 | # iterate over the sub-directories 164 | for subdir, subsubdirs, files in tree[1:]: 165 | # make sure we don't have sub-sub-directories containing fastq files 166 | if subsubdirs and any( 167 | is_fastq_file(file) 168 | for subsubdir in subsubdirs 169 | for file in os.listdir(pathlib.Path(subdir) / subsubdir) 170 | ): 171 | raise ValueError( 172 | f"Input directory '{input_path}' cannot contain more " 173 | "than one level of sub-directories with FASTQ files." 174 | ) 175 | # handle unclassified 176 | if ( 177 | os.path.basename(subdir) == "unclassified" 178 | and not params["analyse_unclassified"] 179 | ): 180 | continue 181 | # only process further if sub-dir has fastq files 182 | if any(map(is_fastq_file, files)): 183 | run_ids = set() 184 | for fastq_file in get_fastq_files(subdir): 185 | curr_fastq_entries = get_fastq_names_and_runids( 186 | pathlib.Path(subdir) / fastq_file 187 | ) 188 | run_ids.update(curr_fastq_entries["run_ids"]) 189 | 190 | barcode = os.path.basename(subdir) 191 | valid_inputs.append( 192 | [ 193 | create_metadict( 194 | alias=barcode, 195 | barcode=barcode, 196 | run_ids=run_ids, 197 | ), 198 | subdir, 199 | ] 200 | ) 201 | # parse the sample sheet in case there was one 202 | if sample_sheet is not None: 203 | sample_sheet = pd.read_csv(sample_sheet).set_index( 204 | # set 'barcode' as index while also keeping the 'barcode' column in the df 205 | "barcode", 206 | drop=False, 207 | ) 208 | # now, get the corresponding inputs for each entry in the sample sheet (sample 209 | # sheet entries for which no input directory was found will have `None` as their 210 | # input path); we need a dict mapping barcodes to valid input paths for this 211 | valid_inputs_dict = {os.path.basename(path): path for _, path in valid_inputs} 212 | # reset `valid_inputs` 213 | valid_inputs = [] 214 | for barcode, meta in sample_sheet.iterrows(): 215 | path = valid_inputs_dict.get(barcode) 216 | run_ids = set() 217 | if path is not None: 218 | for fastq_file in get_fastq_files(path): 219 | curr_fastq_entries = get_fastq_names_and_runids( 220 | pathlib.Path(path) / fastq_file 221 | ) 222 | run_ids.update(curr_fastq_entries["run_ids"]) 223 | valid_inputs.append([create_metadict(**dict(meta), run_ids=run_ids), path]) 224 | return valid_inputs 225 | 226 | 227 | # prepare data for the tests 228 | @pytest.fixture(scope="module") 229 | def prepare(): 230 | """Prepare data for tests.""" 231 | input_path, sample_sheet, fastq_ingress_results_dir, params = args() 232 | valid_inputs = get_valid_inputs(input_path, sample_sheet, params) 233 | return fastq_ingress_results_dir, valid_inputs, params 234 | 235 | 236 | # define tests 237 | def test_result_subdirs(prepare): 238 | """ 239 | Test if workflow results dir contains all expected samples. 240 | 241 | Tests if the published sub-directories in `fastq_ingress_results_dir` contain all 242 | the samples we expect. 243 | """ 244 | fastq_ingress_results_dir, valid_inputs, _ = prepare 245 | _, subdirs, files = next(os.walk(fastq_ingress_results_dir)) 246 | assert not files, "Files found in top-level dir of fastq_ingress results" 247 | assert set(subdirs) == set([meta["alias"] for meta, _ in valid_inputs]) 248 | 249 | 250 | def test_fastq_entry_names(prepare): 251 | """ 252 | Test FASTQ entries. 253 | 254 | Tests if the concatenated sequences indeed contain all the FASTQ entries of the 255 | FASTQ files in the valid inputs. 256 | """ 257 | fastq_ingress_results_dir, valid_inputs, _ = prepare 258 | for meta, path in valid_inputs: 259 | if path is None: 260 | # this sample sheet entry had no input dir (or no reads) 261 | continue 262 | # get FASTQ entries in the result file produced by the workflow 263 | fastq_entries = get_fastq_names_and_runids( 264 | fastq_ingress_results_dir / meta["alias"] / "seqs.fastq.gz" 265 | ) 266 | # now collect the FASTQ entries from the individual input files 267 | exp_fastq_names = [] 268 | exp_fastq_runids = [] 269 | for fastq_file in get_fastq_files(path): 270 | curr_fastq_entries = get_fastq_names_and_runids( 271 | pathlib.Path(path) / fastq_file 272 | ) 273 | exp_fastq_names += curr_fastq_entries["names"] 274 | exp_fastq_runids += curr_fastq_entries["run_ids"] 275 | assert set(fastq_entries["names"]) == set(exp_fastq_names) 276 | assert set(fastq_entries["run_ids"]) == set(exp_fastq_runids) 277 | 278 | 279 | def test_stats_present(prepare): 280 | """Tests if the `fastcat` stats are present when they should be.""" 281 | fastq_ingress_results_dir, valid_inputs, params = prepare 282 | for meta, path in valid_inputs: 283 | if path is None: 284 | # this sample sheet entry had no input dir (or no reads) 285 | continue 286 | # we expect `fastcat` stats in two cases: (i) they were requested explicitly or 287 | # (ii) the input was a directory containing multiple FASTQ files 288 | expect_stats = ( 289 | params["wf"]["fastcat_stats"] 290 | or os.path.isdir(path) 291 | and len(list(filter(is_fastq_file, os.listdir(path)))) > 1 292 | ) 293 | stats_dir = fastq_ingress_results_dir / meta["alias"] / "fastcat_stats" 294 | # assert that stats are there when we expect them 295 | assert expect_stats == stats_dir.exists() 296 | # make sure that the per-file and per-read stats files are there 297 | if expect_stats: 298 | for fname in ("per-file-stats.tsv", "per-read-stats.tsv"): 299 | assert ( 300 | fastq_ingress_results_dir / meta["alias"] / "fastcat_stats" / fname 301 | ).is_file() 302 | 303 | 304 | def test_metamap(prepare): 305 | """Test if the metamap in the `fastq_ingress` results is as expected.""" 306 | fastq_ingress_results_dir, valid_inputs, params = prepare 307 | for meta, _ in valid_inputs: 308 | # if there were no fastcat stats, we can't expect run IDs in the metamap 309 | if not params["wf"]["fastcat_stats"]: 310 | meta["run_ids"] = [] 311 | with open(fastq_ingress_results_dir / meta["alias"] / "metamap.json", "r") as f: 312 | metamap = json.load(f) 313 | assert meta == metamap 314 | 315 | 316 | if __name__ == "__main__": 317 | # trigger pytest 318 | ret_code = pytest.main([os.path.realpath(__file__), "-vv"]) 319 | sys.exit(ret_code) 320 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Oxford Nanopore Technologies PLC. Public License Version 1.0 2 | ============================================================= 3 | 4 | 1. Definitions 5 | -------------- 6 | 7 | 1.1. "Contributor" 8 | means each individual or legal entity that creates, contributes to 9 | the creation of, or owns Covered Software. 10 | 11 | 1.2. "Contributor Version" 12 | means the combination of the Contributions of others (if any) used 13 | by a Contributor and that particular Contributor’s Contribution. 14 | 15 | 1.3. "Contribution" 16 | means Covered Software of a particular Contributor. 17 | 18 | 1.4. "Covered Software" 19 | means Source Code Form to which the initial Contributor has attached 20 | the notice in Exhibit A, the Executable Form of such Source Code 21 | Form, and Modifications of such Source Code Form, in each case 22 | including portions thereof. 23 | 24 | 1.5. "Executable Form" 25 | means any form of the work other than Source Code Form. 26 | 27 | 1.6. "Larger Work" 28 | means a work that combines Covered Software with other material, in 29 | a separate file or files, that is not Covered Software. 30 | 31 | 1.7. "License" 32 | means this document. 33 | 34 | 1.8. "Licensable" 35 | means having the right to grant, to the maximum extent possible, 36 | whether at the time of the initial grant or subsequently, any and 37 | all of the rights conveyed by this License. 38 | 39 | 1.9. "Modifications" 40 | means any of the following: 41 | 42 | (a) any file in Source Code Form that results from an addition to, 43 | deletion from, or modification of the contents of Covered 44 | Software; or 45 | (b) any new file in Source Code Form that contains any Covered 46 | Software. 47 | 48 | 1.10. "Research Purposes" 49 | means use for internal research and not intended for or directed 50 | towards commercial advantages or monetary compensation; provided, 51 | however, that monetary compensation does not include sponsored 52 | research of research funded by grants. 53 | 54 | 1.11 "Secondary License" 55 | means either the GNU General Public License, Version 2.0, the GNU 56 | Lesser General Public License, Version 2.1, the GNU Affero General 57 | Public License, Version 3.0, or any later versions of those 58 | licenses. 59 | 60 | 1.12. "Source Code Form" 61 | means the form of the work preferred for making modifications. 62 | 63 | 1.13. "You" (or "Your") 64 | means an individual or a legal entity exercising rights under this 65 | License. For legal entities, "You" includes any entity that 66 | controls, is controlled by, or is under common control with You. For 67 | purposes of this definition, "control" means (a) the power, direct 68 | or indirect, to cause the direction or management of such entity, 69 | whether by contract or otherwise, or (b) ownership of more than 70 | fifty percent (50%) of the outstanding shares or beneficial 71 | ownership of such entity. 72 | 73 | 2. License Grants and Conditions 74 | -------------------------------- 75 | 76 | 2.1. Grants 77 | 78 | Each Contributor hereby grants You a world-wide, royalty-free, 79 | non-exclusive license under Contributor copyrights Licensable by such 80 | Contributor to use, reproduce, make available, modify, display, 81 | perform, distribute, and otherwise exploit solely for Research Purposes 82 | its Contributions, either on an unmodified basis, with Modifications, 83 | or as part of a Larger Work. 84 | 85 | 2.2. Effective Date 86 | 87 | The licenses granted in Section 2.1 with respect to any Contribution 88 | become effective for each Contribution on the date the Contributor 89 | first distributes such Contribution. 90 | 91 | 2.3. Limitations on Grant Scope 92 | 93 | The licenses granted in this Section 2 are the only rights granted under 94 | this License. No additional rights or licenses will be implied from the 95 | distribution or licensing of Covered Software under this License. The 96 | License is incompatible with Secondary Licenses. Notwithstanding 97 | Section 2.1 above, no copyright license is granted: 98 | 99 | (a) for any code that a Contributor has removed from Covered Software; 100 | or 101 | 102 | (b) use of the Contributions or its Contributor Version other than for 103 | Research Purposes only; or 104 | 105 | (c) for infringements caused by: (i) Your and any other third party’s 106 | modifications of Covered Software, or (ii) the combination of its 107 | Contributions with other software (except as part of its Contributor 108 | Version). 109 | 110 | This License does not grant any rights in the patents, trademarks, 111 | service marks, or logos of any Contributor (except as may be necessary 112 | to comply with the notice requirements in Section 3.4). 113 | 114 | 2.4. Subsequent Licenses 115 | 116 | No Contributor makes additional grants as a result of Your choice to 117 | distribute the Covered Software under a subsequent version of this 118 | License (see Section 10.2) or under the terms of a Secondary License 119 | (if permitted under the terms of Section 3.3). 120 | 121 | 2.5. Representation 122 | 123 | Each Contributor represents that the Contributor believes its 124 | Contributions are its original creation(s) or it has sufficient rights 125 | to grant the rights to its Contributions conveyed by this License. 126 | 127 | 2.6. Fair Use 128 | 129 | This License is not intended to limit any rights You have under 130 | applicable copyright doctrines of fair use, fair dealing, or other 131 | equivalents. 132 | 133 | 2.7. Conditions 134 | 135 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted 136 | in Section 2.1. 137 | 138 | 3. Responsibilities 139 | ------------------- 140 | 141 | 3.1. Distribution of Source Form 142 | 143 | All distribution of Covered Software in Source Code Form, including any 144 | Modifications that You create or to which You contribute, must be under 145 | the terms of this License. You must inform recipients that the Source 146 | Code Form of the Covered Software is governed by the terms of this 147 | License, and how they can obtain a copy of this License. You may not 148 | attempt to alter or restrict the recipients’ rights in the Source Code Form. 149 | 150 | 3.2. Distribution of Executable Form 151 | 152 | If You distribute Covered Software in Executable Form then: 153 | 154 | (a) such Covered Software must also be made available in Source Code 155 | Form, as described in Section 3.1, and You must inform recipients of 156 | the Executable Form how they can obtain a copy of such Source Code 157 | Form by reasonable means in a timely manner, at a charge no more 158 | than the cost of distribution to the recipient; and 159 | 160 | (b) You may distribute such Executable Form under the terms of this 161 | License. 162 | 163 | 3.3. Distribution of a Larger Work 164 | 165 | You may create and distribute a Larger Work under terms of Your choice, 166 | provided that You also comply with the requirements of this License for 167 | the Covered Software. The Larger Work may not be a combination of Covered 168 | Software with a work governed by one or more Secondary Licenses. 169 | 170 | 3.4. Notices 171 | 172 | You may not remove or alter the substance of any license notices 173 | (including copyright notices, patent notices, disclaimers of warranty, 174 | or limitations of liability) contained within the Source Code Form of 175 | the Covered Software, except that You may alter any license notices to 176 | the extent required to remedy known factual inaccuracies. 177 | 178 | 3.5. Application of Additional Terms 179 | 180 | You may not choose to offer, or charge a fee for use of the Covered 181 | Software or a fee for, warranty, support, indemnity or liability 182 | obligations to one or more recipients of Covered Software. You must 183 | make it absolutely clear that any such warranty, support, indemnity, or 184 | liability obligation is offered by You alone, and You hereby agree to 185 | indemnify every Contributor for any liability incurred by such 186 | Contributor as a result of warranty, support, indemnity or liability 187 | terms You offer. You may include additional disclaimers of warranty and 188 | limitations of liability specific to any jurisdiction. 189 | 190 | 4. Inability to Comply Due to Statute or Regulation 191 | --------------------------------------------------- 192 | 193 | If it is impossible for You to comply with any of the terms of this 194 | License with respect to some or all of the Covered Software due to 195 | statute, judicial order, or regulation then You must: (a) comply with 196 | the terms of this License to the maximum extent possible; and (b) 197 | describe the limitations and the code they affect. Such description must 198 | be placed in a text file included with all distributions of the Covered 199 | Software under this License. Except to the extent prohibited by statute 200 | or regulation, such description must be sufficiently detailed for a 201 | recipient of ordinary skill to be able to understand it. 202 | 203 | 5. Termination 204 | -------------- 205 | 206 | 5.1. The rights granted under this License will terminate automatically 207 | if You fail to comply with any of its terms. 208 | 209 | 5.2. If You initiate litigation against any entity by asserting an 210 | infringement claim (excluding declaratory judgment actions, 211 | counter-claims, and cross-claims) alleging that a Contributor Version 212 | directly or indirectly infringes, then the rights granted to 213 | You by any and all Contributors for the Covered Software under Section 214 | 2.1 of this License shall terminate. 215 | 216 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all 217 | end user license agreements (excluding distributors and resellers) which 218 | have been validly granted by You or Your distributors under this License 219 | prior to termination shall survive termination. 220 | 221 | ************************************************************************ 222 | * * 223 | * 6. Disclaimer of Warranty * 224 | * ------------------------- * 225 | * * 226 | * Covered Software is provided under this License on an "as is" * 227 | * basis, without warranty of any kind, either expressed, implied, or * 228 | * statutory, including, without limitation, warranties that the * 229 | * Covered Software is free of defects, merchantable, fit for a * 230 | * particular purpose or non-infringing. The entire risk as to the * 231 | * quality and performance of the Covered Software is with You. * 232 | * Should any Covered Software prove defective in any respect, You * 233 | * (not any Contributor) assume the cost of any necessary servicing, * 234 | * repair, or correction. This disclaimer of warranty constitutes an * 235 | * essential part of this License. No use of any Covered Software is * 236 | * authorized under this License except under this disclaimer. * 237 | * * 238 | ************************************************************************ 239 | 240 | ************************************************************************ 241 | * * 242 | * 7. Limitation of Liability * 243 | * -------------------------- * 244 | * * 245 | * Under no circumstances and under no legal theory, whether tort * 246 | * (including negligence), contract, or otherwise, shall any * 247 | * Contributor, or anyone who distributes Covered Software as * 248 | * permitted above, be liable to You for any direct, indirect, * 249 | * special, incidental, or consequential damages of any character * 250 | * including, without limitation, damages for lost profits, loss of * 251 | * goodwill, work stoppage, computer failure or malfunction, or any * 252 | * and all other commercial damages or losses, even if such party * 253 | * shall have been informed of the possibility of such damages. This * 254 | * limitation of liability shall not apply to liability for death or * 255 | * personal injury resulting from such party’s negligence to the * 256 | * extent applicable law prohibits such limitation, but in such event, * 257 | * and to the greatest extent permissible, damages will be limited to * 258 | * direct damages not to exceed one hundred dollars. Some * 259 | * jurisdictions do not allow the exclusion or limitation of * 260 | * incidental or consequential damages, so this exclusion and * 261 | * limitation may not apply to You. * 262 | * * 263 | ************************************************************************ 264 | 265 | 8. Litigation 266 | ------------- 267 | 268 | Any litigation relating to this License may be brought only in the 269 | courts of a jurisdiction where the defendant maintains its principal 270 | place of business and such litigation shall be governed by laws of that 271 | jurisdiction, without reference to its conflict-of-law provisions. 272 | Nothing in this Section shall prevent a party’s ability to bring 273 | cross-claims or counter-claims. 274 | 275 | 9. Miscellaneous 276 | ---------------- 277 | 278 | This License represents the complete agreement concerning the subject 279 | matter hereof. If any provision of this License is held to be 280 | unenforceable, such provision shall be reformed only to the extent 281 | necessary to make it enforceable. Any law or regulation which provides 282 | that the language of a contract shall be construed against the drafter 283 | shall not be used to construe this License against a Contributor. 284 | 285 | 10. Versions of the License 286 | --------------------------- 287 | 288 | 10.1. New Versions 289 | 290 | Oxford Nanopore Technologies PLC. is the license steward. Except as 291 | provided in Section 10.3, no one other than the license steward has the 292 | right to modify or publish new versions of this License. Each version 293 | will be given a distinguishing version number. 294 | 295 | 10.2. Effect of New Versions 296 | 297 | You may distribute the Covered Software under the terms of the version 298 | of the License under which You originally received the Covered Software, 299 | or under the terms of any subsequent version published by the license 300 | steward. 301 | 302 | 10.3. Modified Versions 303 | 304 | If you create software not governed by this License, and you want to 305 | create a new license for such software, you may create and use a 306 | modified version of this License if you rename the license and remove 307 | any references to the name of the license steward (except to note that 308 | such modified license differs from this License). 309 | 310 | Exhibit A - Source Code Form License Notice 311 | ------------------------------------------- 312 | 313 | This Source Code Form is subject to the terms of the Oxford Nanopore 314 | Technologies PLC. Public License, v. 1.0. Full licence can be found 315 | obtained from support@nanoporetech.com 316 | 317 | If it is not possible or desirable to put the notice in a particular 318 | file, then You may include the notice in a location (such as a LICENSE 319 | file in a relevant directory) where a recipient would be likely to look 320 | for such a notice. 321 | 322 | You may add additional accurate notices of copyright ownership. 323 | -------------------------------------------------------------------------------- /nextflow_schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "$schema": "http://json-schema.org/draft-07/schema", 3 | "$id": "https://raw.githubusercontent.com/epi2me-labs/wf-16s/master/nextflow_schema.json", 4 | "title": "epi2me-labs/wf-16s", 5 | "workflow_title": "16S rRNA", 6 | "description": "Taxonomic classification of 16S rRNA gene sequencing data.", 7 | "demo_url": "https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz", 8 | "aws_demo_url": "https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo/aws.nextflow.config", 9 | "url": "https://github.com/epi2me-labs/wf-16s", 10 | "type": "object", 11 | "resources": { 12 | "recommended": { 13 | "cpus": 12, 14 | "memory": "32GB" 15 | }, 16 | "minimum": { 17 | "cpus": 6, 18 | "memory": "16GB" 19 | }, 20 | "run_time": "~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database.", 21 | "arm_support": true 22 | }, 23 | "definitions": { 24 | "input_options": { 25 | "title": "Input Options", 26 | "type": "object", 27 | "fa_icon": "fas fa-terminal", 28 | "description": "Define where the pipeline should find input data and save output data.", 29 | "properties": { 30 | "fastq": { 31 | "type": "string", 32 | "format": "path", 33 | "title": "FASTQ", 34 | "description": "FASTQ files to use in the analysis.", 35 | "help_text": "This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`.", 36 | "demo_data": "test_data" 37 | }, 38 | "bam": { 39 | "type": "string", 40 | "format": "path", 41 | "description": "BAM or unaligned BAM (uBAM) files to use in the analysis.", 42 | "help_text": "This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`." 43 | }, 44 | "classifier": { 45 | "type": "string", 46 | "default": "minimap2", 47 | "title": "Classification method", 48 | "description": "Kraken2 or Minimap2 workflow to be used for classification of reads.", 49 | "enum": [ 50 | "kraken2", 51 | "minimap2" 52 | ], 53 | "help_text": "Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info." 54 | }, 55 | "analyse_unclassified": { 56 | "type": "boolean", 57 | "default": false, 58 | "title": "Analyse unclassified reads", 59 | "description": "Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory.", 60 | "help_text": "If selected and if the input is a multiplex directory the workflow will also process the unclassified directory." 61 | }, 62 | "exclude_host": { 63 | "type": "string", 64 | "format": "file-path", 65 | "title": "Exclude host reads", 66 | "description": "A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis." 67 | } 68 | }, 69 | "oneOf": [ 70 | { 71 | "required": [ 72 | "fastq" 73 | ] 74 | }, 75 | { 76 | "required": [ 77 | "bam" 78 | ] 79 | } 80 | ] 81 | }, 82 | "sample_options": { 83 | "title": "Sample Options", 84 | "type": "object", 85 | "default": "", 86 | "properties": { 87 | "sample_sheet": { 88 | "type": "string", 89 | "format": "file-path", 90 | "title": "Sample sheet", 91 | "description": "A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files.", 92 | "help_text": "The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed." 93 | }, 94 | "sample": { 95 | "type": "string", 96 | "title": "Sample name", 97 | "description": "A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files." 98 | } 99 | }, 100 | "description": "Parameters that relate to samples such as sample sheets and sample names." 101 | }, 102 | "reference_options": { 103 | "title": "Reference Options", 104 | "type": "object", 105 | "description": "Files will be downloaded as part of the first run of workflow and automatically stored for subsequent runs.", 106 | "default": "", 107 | "properties": { 108 | "database_set": { 109 | "type": "string", 110 | "default": "ncbi_16s_18s", 111 | "title": "Choose a database", 112 | "description": "Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database.", 113 | "enum": [ 114 | "ncbi_16s_18s", 115 | "ncbi_16s_18s_28s_ITS", 116 | "SILVA_138_1" 117 | ], 118 | "help_text": "This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options." 119 | }, 120 | "store_dir": { 121 | "type": "string", 122 | "format": "directory-path", 123 | "title": "Store directory name", 124 | "description": "Where to store initial download of database.", 125 | "help_text": "database set selected will be downloaded as part of the workflow and saved in this location, on subsequent runs it will use this as the database. ", 126 | "hidden": true, 127 | "default": "store_dir" 128 | }, 129 | "database": { 130 | "type": "string", 131 | "format": "path", 132 | "title": "Kraken2 database", 133 | "description": "Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory].", 134 | "help_text": "By default uses database chosen in database_set parameter.", 135 | "overrides": { 136 | "epi2mecloud": { 137 | "hidden": true 138 | } 139 | } 140 | }, 141 | "taxonomy": { 142 | "type": "string", 143 | "format": "path", 144 | "title": "Taxonomy database", 145 | "description": "Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file [.tar.gz or directory].", 146 | "help_text": "By default NCBI taxonomy file will be downloaded and used." 147 | }, 148 | "reference": { 149 | "type": "string", 150 | "format": "file-path", 151 | "title": "Minimap2 reference", 152 | "description": "Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index.", 153 | "help_text": "This option should be used in conjunction with the database parameter to specify a custom database." 154 | }, 155 | "ref2taxid": { 156 | "type": "string", 157 | "format": "file-path", 158 | "title": "File linking reference IDs to specific taxids", 159 | "description": "Not required but can be used to specify a ref2taxid mapping. Format is .tsv (refname taxid), no header row.", 160 | "help_text": "By default uses ref2taxid for option chosen in database_set parameter." 161 | }, 162 | "taxonomic_rank": { 163 | "type": "string", 164 | "default": "G", 165 | "title": "Taxonomic rank", 166 | "description": "Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species).", 167 | "enum": [ 168 | "S", 169 | "G", 170 | "F", 171 | "O", 172 | "C", 173 | "P" 174 | ] 175 | } 176 | }, 177 | "dependencies": { 178 | "reference": [ 179 | "ref2taxid" 180 | ], 181 | "ref2taxid": [ 182 | "reference" 183 | ] 184 | } 185 | }, 186 | "kraken2_options": { 187 | "title": "Kraken2 Options", 188 | "type": "object", 189 | "fa_icon": "fas fa-university", 190 | "help_text": "Kraken2: It is possible to enable classification by Kraken2, disabling alignment, which is a faster but coarser method of classification reliant on the presence of a Kraken2 database.", 191 | "properties": { 192 | "bracken_length": { 193 | "type": "integer", 194 | "title": "Bracken length", 195 | "description": "Set the length value Bracken will use", 196 | "minimum": 1, 197 | "help_text": "Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300" 198 | }, 199 | "bracken_threshold": { 200 | "type": "integer", 201 | "title": "Bracken minimum read threshold", 202 | "description": "Set the minimum read threshold Bracken will use to consider a taxon", 203 | "default": 10, 204 | "minimum": 0, 205 | "help_text": "Bracken will only consider taxa with a read count greater than or equal to this value." 206 | }, 207 | "kraken2_memory_mapping": { 208 | "type": "boolean", 209 | "default": false, 210 | "title": "Enable memory mapping", 211 | "description": "Avoids loading database into RAM", 212 | "help_text": "Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database." 213 | }, 214 | "kraken2_confidence": { 215 | "type": "number", 216 | "default": 0.0, 217 | "title": "Confidence score threshold", 218 | "description": "Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1", 219 | "help_text": "Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works." 220 | } 221 | }, 222 | "description": "Kraken2 classification options. Only relevant if classifier parameter is set to kraken2" 223 | }, 224 | "minimap2_options": { 225 | "title": "Minimap2 Options", 226 | "type": "object", 227 | "fa_icon": "fas fa-dna", 228 | "properties": { 229 | "minimap2filter": { 230 | "type": "string", 231 | "title": "Select reads belonging to the following taxonomy identifiers (taxids)", 232 | "description": "Filter output of minimap2 by taxids inc. child nodes, E.g. \"9606,1404\"", 233 | "help_text": "Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs." 234 | }, 235 | "minimap2exclude": { 236 | "type": "boolean", 237 | "default": false, 238 | "title": "Exclude reads from previous selected taxids", 239 | "description": "Invert minimap2filter and exclude the given taxids instead", 240 | "help_text": "Exclude a list of taxids from analysis outputs." 241 | }, 242 | "keep_bam": { 243 | "type": "boolean", 244 | "title": "Enable keep BAM files", 245 | "default": false, 246 | "description": "Copy bam files into the output directory." 247 | }, 248 | "minimap2_by_reference": { 249 | "type": "boolean", 250 | "default": false, 251 | "title": "Compute coverage and sequencing depth of the references.", 252 | "description": "Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report" 253 | }, 254 | "min_percent_identity": { 255 | "type": "number", 256 | "default": 95, 257 | "minimum": 0, 258 | "maximum": 100, 259 | "title": "Filter taxa based on the percent of identity with the references.", 260 | "description": "Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified." 261 | }, 262 | "min_ref_coverage": { 263 | "type": "number", 264 | "default": 90, 265 | "minimum": 0, 266 | "maximum": 100, 267 | "title": "Filter taxa based on the percent of coverage with the reference.", 268 | "description": "Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths." 269 | } 270 | }, 271 | "description": "Minimap2 classification options. Only relevant if classifier parameter is set to minimap2.", 272 | "help_text": "Minimap2: The default strategy uses minimap2 to perform full alignments against FASTA-formatted references sequences." 273 | }, 274 | "report_options": { 275 | "title": "Report Options", 276 | "type": "object", 277 | "fa_icon": "fas fa-pills", 278 | "properties": { 279 | "abundance_threshold": { 280 | "type": "number", 281 | "default": 1, 282 | "title": "Abundance threshold", 283 | "description": "Remove those taxa whose abundance is equal or lower than the chosen value.", 284 | "help_text": "To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1." 285 | }, 286 | "n_taxa_barplot": { 287 | "type": "integer", 288 | "default": 9, 289 | "title": "Number of taxa to be displayed in the barplot", 290 | "description": "Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the \"Other\" category." 291 | } 292 | } 293 | }, 294 | "output_options": { 295 | "title": "Output Options", 296 | "type": "object", 297 | "description": "Parameters for saving and naming workflow outputs.", 298 | "default": "", 299 | "properties": { 300 | "out_dir": { 301 | "type": "string", 302 | "format": "directory-path", 303 | "default": "output", 304 | "title": "Output folder name", 305 | "description": "Directory for output of all user-facing files." 306 | }, 307 | "igv": { 308 | "type": "boolean", 309 | "default": false, 310 | "title": "IGV", 311 | "description": "Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index." 312 | }, 313 | "include_read_assignments": { 314 | "type": "boolean", 315 | "default": false, 316 | "title": "Include Kraken2/Minimap2 taxonomy per read.", 317 | "description": "A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow." 318 | }, 319 | "output_unclassified": { 320 | "type": "boolean", 321 | "default": false, 322 | "title": "Output unclassified reads.", 323 | "description": "Output a FASTQ of the unclassified reads." 324 | } 325 | } 326 | }, 327 | "advanced_options": { 328 | "title": "Advanced Options", 329 | "type": "object", 330 | "description": "Advanced options for configuring processes inside the workflow.", 331 | "default": "", 332 | "properties": { 333 | "min_len": { 334 | "type": "integer", 335 | "default": 800, 336 | "title": "Minimum read length", 337 | "description": "Specify read length lower limit.", 338 | "help_text": "Any reads shorter than this limit will not be included in the analysis." 339 | }, 340 | "min_read_qual": { 341 | "type": "number", 342 | "title": "Minimum read quality", 343 | "description": "Specify read quality lower limit.", 344 | "help_text": "Any reads with a quality lower than this limit will not be included in the analysis." 345 | }, 346 | "max_len": { 347 | "type": "integer", 348 | "title": "Maximum read length", 349 | "default": 2000, 350 | "description": "Specify read length upper limit", 351 | "help_text": "Any reads longer than this limit will not be included in the analysis." 352 | }, 353 | "threads": { 354 | "type": "integer", 355 | "default": 4, 356 | "title": "Number of CPU threads per workflow task", 357 | "description": "Maximum number of CPU threads to use in each parallel workflow task.", 358 | "help_text": "Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes." 359 | } 360 | } 361 | }, 362 | "miscellaneous_options": { 363 | "title": "Miscellaneous Options", 364 | "type": "object", 365 | "fa_icon": "fas fa-file-import", 366 | "description": "Everything else.", 367 | "help_text": "These options are common to all nf-core pipelines and allow you to customise some of the core preferences for how the pipeline runs. Typically these options would be set in a Nextflow config file loaded for all pipeline runs, such as `~/.nextflow/config`.", 368 | "properties": { 369 | "disable_ping": { 370 | "type": "boolean", 371 | "default": false, 372 | "description": "Enable to prevent sending a workflow ping.", 373 | "overrides": { 374 | "epi2mecloud": { 375 | "hidden": true 376 | } 377 | } 378 | }, 379 | "help": { 380 | "type": "boolean", 381 | "title": "Display help text", 382 | "default": false, 383 | "fa_icon": "fas fa-question-circle", 384 | "hidden": true 385 | }, 386 | "version": { 387 | "type": "boolean", 388 | "title": "Display version", 389 | "default": false, 390 | "description": "Display version and exit.", 391 | "fa_icon": "fas fa-question-circle", 392 | "hidden": true 393 | } 394 | } 395 | } 396 | }, 397 | "allOf": [ 398 | { 399 | "$ref": "#/definitions/input_options" 400 | }, 401 | { 402 | "$ref": "#/definitions/sample_options" 403 | }, 404 | { 405 | "$ref": "#/definitions/reference_options" 406 | }, 407 | { 408 | "$ref": "#/definitions/kraken2_options" 409 | }, 410 | { 411 | "$ref": "#/definitions/minimap2_options" 412 | }, 413 | { 414 | "$ref": "#/definitions/output_options" 415 | }, 416 | { 417 | "$ref": "#/definitions/advanced_options" 418 | }, 419 | { 420 | "$ref": "#/definitions/miscellaneous_options" 421 | }, 422 | { 423 | "$ref": "#/definitions/report_options" 424 | } 425 | ], 426 | "properties": { 427 | "aws_image_prefix": { 428 | "type": "string", 429 | "title": "AWS image prefix", 430 | "hidden": true 431 | }, 432 | "aws_queue": { 433 | "type": "string", 434 | "title": "AWS queue", 435 | "hidden": true 436 | }, 437 | "monochrome_logs": { 438 | "type": "boolean" 439 | }, 440 | "validate_params": { 441 | "type": "boolean", 442 | "default": true 443 | }, 444 | "show_hidden_params": { 445 | "type": "boolean" 446 | } 447 | } 448 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 16S rRNA 2 | 3 | Taxonomic classification of 16S rRNA gene sequencing data. 4 | 5 | 6 | 7 | ## Introduction 8 | 9 | This workflow can be used for the following: 10 | 11 | + Taxonomic classification of 16S rRNA, 18S rRNA and ITS amplicons using [default or custom databases](#faqs). Default databases: 12 | - NCBI targeted loci: 16S rDNA, 18S rDNA, ITS (ncbi_16s_18s, ncbi_16s_18s_28s_ITS; see [here](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) for details). 13 | + Generate taxonomic profiles of one or more samples. 14 | 15 | The workflow default parameters are optimised for analysis of 16S rRNA gene amplicons. 16 | For ITS amplicons, it is strongly recommended that some parameters are changed from the defaults, please see the [ITS presets](#analysing-its-amplicons) section for more information. 17 | 18 | Additional features: 19 | + Two different approaches are available: `minimap2` (using alignment, default option) or `kraken2` (k-mer based). 20 | + Results include: 21 | - An abundance table with counts per taxa in all the samples. 22 | - Interactive sankey and sunburst plots to explore the different identified lineages. 23 | - A bar plot comparing the abundances of the most abundant taxa in all the samples. 24 | 25 | 26 | 27 | 28 | ## Compute requirements 29 | 30 | Recommended requirements: 31 | 32 | + CPUs = 12 33 | + Memory = 32GB 34 | 35 | Minimum requirements: 36 | 37 | + CPUs = 6 38 | + Memory = 16GB 39 | 40 | Approximate run time: ~40min for 1 million reads in total (24 barcodes) using Minimap2 and the ncbi_16s_18s database. 41 | 42 | ARM processor support: True 43 | 44 | 45 | 46 | 47 | ## Install and run 48 | 49 | 50 | These are instructions to install and run the workflow on command line. 51 | You can also access the workflow via the 52 | [EPI2ME Desktop application](https://labs.epi2me.io/downloads/). 53 | 54 | The workflow uses [Nextflow](https://www.nextflow.io/) to manage 55 | compute and software resources, 56 | therefore Nextflow will need to be 57 | installed before attempting to run the workflow. 58 | 59 | The workflow can currently be run using either 60 | [Docker](https://docs.docker.com/get-started/) 61 | or [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html) 62 | to provide isolation of the required software. 63 | Both methods are automated out-of-the-box provided 64 | either Docker or Singularity is installed. 65 | This is controlled by the 66 | [`-profile`](https://www.nextflow.io/docs/latest/config.html#config-profiles) 67 | parameter as exemplified below. 68 | 69 | It is not required to clone or download the git repository 70 | in order to run the workflow. 71 | More information on running EPI2ME workflows can 72 | be found on our [website](https://labs.epi2me.io/wfindex). 73 | 74 | The following command can be used to obtain the workflow. 75 | This will pull the repository in to the assets folder of 76 | Nextflow and provide a list of all parameters 77 | available for the workflow as well as an example command: 78 | 79 | ``` 80 | nextflow run epi2me-labs/wf-16s --help 81 | ``` 82 | To update a workflow to the latest version on the command line use 83 | the following command: 84 | ``` 85 | nextflow pull epi2me-labs/wf-16s 86 | ``` 87 | 88 | A demo dataset is provided for testing of the workflow. 89 | It can be downloaded and unpacked using the following commands: 90 | ``` 91 | wget https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-16s/wf-16s-demo.tar.gz 92 | tar -xzvf wf-16s-demo.tar.gz 93 | ``` 94 | The workflow can then be run with the downloaded demo data using: 95 | ``` 96 | nextflow run epi2me-labs/wf-16s \ 97 | --fastq 'wf-16s-demo/test_data' \ 98 | --minimap2_by_reference \ 99 | -profile standard 100 | ``` 101 | 102 | For further information about running a workflow on 103 | the command line see https://labs.epi2me.io/wfquickstart/ 104 | 105 | 106 | 107 | 108 | ## Related protocols 109 | 110 | This workflow is designed to take input sequences that have been produced by [Oxford Nanopore Technologies](https://nanoporetech.com/) devices using protocols associated with either of the kits listed below: 111 | 112 | - [SQK-MAB114.24](https://nanoporetech.com/document/microbial-amplicon-barcoding-sequencing-for-16s-and-its-sqk-mab114-24) 113 | - [SQK-16S114.24](https://community.nanoporetech.com/docs/prepare/library_prep_protocols/rapid-sequencing-DNA-16s-barcoding-kit-v14-sqk-16114-24) 114 | 115 | Find related protocols in the [Nanopore community](https://community.nanoporetech.com/docs/). 116 | 117 | 118 | 119 | ## Input example 120 | 121 | This workflow accepts either FASTQ or BAM files as input. 122 | 123 | The FASTQ or BAM input parameters for this workflow accept one of three cases: (i) the path to a single FASTQ or BAM file; (ii) the path to a top-level directory containing FASTQ or BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ or BAM files. In the first and second cases (i and ii), a sample name can be supplied with `--sample`. In the last case (iii), the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. 124 | 125 | ``` 126 | (i) (ii) (iii) 127 | input_reads.fastq ─── input_directory ─── input_directory 128 | ├── reads0.fastq ├── barcode01 129 | └── reads1.fastq │ ├── reads0.fastq 130 | │ └── reads1.fastq 131 | ├── barcode02 132 | │ ├── reads0.fastq 133 | │ ├── reads1.fastq 134 | │ └── reads2.fastq 135 | └── barcode03 136 | └── reads0.fastq 137 | ``` 138 | 139 | 140 | 141 | ## Input parameters 142 | 143 | ### Input Options 144 | 145 | | Nextflow parameter name | Type | Description | Help | Default | 146 | |--------------------------|------|-------------|------|---------| 147 | | fastq | string | FASTQ files to use in the analysis. | This accepts one of three cases: (i) the path to a single FASTQ file; (ii) the path to a top-level directory containing FASTQ files; (iii) the path to a directory containing one level of sub-directories which in turn contain FASTQ files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | 148 | | bam | string | BAM or unaligned BAM (uBAM) files to use in the analysis. | This accepts one of three cases: (i) the path to a single BAM file; (ii) the path to a top-level directory containing BAM files; (iii) the path to a directory containing one level of sub-directories which in turn contain BAM files. In the first and second case, a sample name can be supplied with `--sample`. In the last case, the data is assumed to be multiplexed with the names of the sub-directories as barcodes. In this case, a sample sheet can be provided with `--sample_sheet`. | | 149 | | classifier | string | Kraken2 or Minimap2 workflow to be used for classification of reads. | Use Kraken2 for fast classification and minimap2 for finer resolution, see Readme for further info. | minimap2 | 150 | | analyse_unclassified | boolean | Analyse unclassified reads from input directory. By default the workflow will not process reads in the unclassified directory. | If selected and if the input is a multiplex directory the workflow will also process the unclassified directory. | False | 151 | | exclude_host | string | A FASTA or MMI file of the host reference. Reads that align with this reference will be excluded from the analysis. | | | 152 | 153 | 154 | ### Sample Options 155 | 156 | | Nextflow parameter name | Type | Description | Help | Default | 157 | |--------------------------|------|-------------|------|---------| 158 | | sample_sheet | string | A CSV file used to map barcodes to sample aliases. The sample sheet can be provided when the input data is a directory containing sub-directories with FASTQ files. | The sample sheet is a CSV file with, minimally, columns named `barcode`,`alias`. Extra columns are allowed. | | 159 | | sample | string | A single sample name for non-multiplexed data. Permissible if passing a single .fastq(.gz) file or directory of .fastq(.gz) files. | | | 160 | 161 | 162 | ### Reference Options 163 | 164 | | Nextflow parameter name | Type | Description | Help | Default | 165 | |--------------------------|------|-------------|------|---------| 166 | | database_set | string | Sets the reference, databases and taxonomy datasets that will be used for classifying reads. Choices: ['ncbi_16s_18s','ncbi_16s_18s_28s_ITS', 'SILVA_138_1']. Workflow will require memory available to be slightly higher than the size of the database. | This setting is overridable by providing an explicit taxonomy, database or reference path in the other reference options. | ncbi_16s_18s | 167 | | database | string | Not required but can be used to specifically override Kraken2 database [.tar.gz or Directory]. | By default uses database chosen in database_set parameter. | | 168 | | taxonomy | string | Not required but can be used to specifically override taxonomy database. Change the default to use a different taxonomy file [.tar.gz or directory]. | By default NCBI taxonomy file will be downloaded and used. | | 169 | | reference | string | Override the FASTA reference file selected by the database_set parameter. It can be a FASTA format reference sequence collection or a minimap2 MMI format index. | This option should be used in conjunction with the database parameter to specify a custom database. | | 170 | | ref2taxid | string | Not required but can be used to specify a ref2taxid mapping. Format is .tsv (refname taxid), no header row. | By default uses ref2taxid for option chosen in database_set parameter. | | 171 | | taxonomic_rank | string | Returns results at the taxonomic rank chosen. In the Kraken2 pipeline, this sets the level that Bracken will estimate abundance at. Default: G (genus). Other possible options are P (phylum), C (class), O (order), F (family), and S (species). | | G | 172 | 173 | 174 | ### Kraken2 Options 175 | 176 | | Nextflow parameter name | Type | Description | Help | Default | 177 | |--------------------------|------|-------------|------|---------| 178 | | bracken_length | integer | Set the length value Bracken will use | Should be set to the length used to generate the kmer distribution file supplied in the Kraken database input directory. For the default datasets these will be set automatically. ncbi_16s_18s = 1000 , ncbi_16s_18s_28s_ITS = 1000 , PlusPF-8 = 300 | | 179 | | bracken_threshold | integer | Set the minimum read threshold Bracken will use to consider a taxon | Bracken will only consider taxa with a read count greater than or equal to this value. | 10 | 180 | | kraken2_memory_mapping | boolean | Avoids loading database into RAM | Kraken 2 will by default load the database into process-local RAM; this flag will avoid doing so. It may be useful if the available RAM memory is lower than the size of the chosen database. | False | 181 | | kraken2_confidence | number | Kraken2 Confidence score threshold. Default: 0.0. Valid interval: 0-1 | Apply a threshold to determine if a sequence is classified or unclassified. See the [kraken2 manual section on confidence scoring](https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring) for further details about how it works. | 0.0 | 182 | 183 | 184 | ### Minimap2 Options 185 | 186 | | Nextflow parameter name | Type | Description | Help | Default | 187 | |--------------------------|------|-------------|------|---------| 188 | | minimap2filter | string | Filter output of minimap2 by taxids inc. child nodes, E.g. "9606,1404" | Provide a list of taxids if you are only interested in certain ones in your minimap2 analysis outputs. | | 189 | | minimap2exclude | boolean | Invert minimap2filter and exclude the given taxids instead | Exclude a list of taxids from analysis outputs. | False | 190 | | keep_bam | boolean | Copy bam files into the output directory. | | False | 191 | | minimap2_by_reference | boolean | Add a table with the mean sequencing depth per reference, standard deviation and coefficient of variation. It adds a scatterplot of the sequencing depth vs. the coverage and a heatmap showing the depth per percentile to the report | | False | 192 | | min_percent_identity | number | Minimum percentage of identity with the matched reference to define a sequence as classified; sequences with a value lower than this are defined as unclassified. | | 95 | 193 | | min_ref_coverage | number | Minimum coverage value to define a sequence as classified; sequences with a coverage value lower than this are defined as unclassified. Use this option if you expect reads whose lengths are similar to the references' lengths. | | 90 | 194 | 195 | 196 | ### Report Options 197 | 198 | | Nextflow parameter name | Type | Description | Help | Default | 199 | |--------------------------|------|-------------|------|---------| 200 | | abundance_threshold | number | Remove those taxa whose abundance is equal or lower than the chosen value. | To remove taxa with abundances lower than or equal to a relative value (compared to the total number of reads) use a decimal between 0-1 (1 not inclusive). To remove taxa with abundances lower than or equal to an absolute value, provide a number larger or equal to 1. | 1 | 201 | | n_taxa_barplot | integer | Number of most abundant taxa to be displayed in the barplot. The remaining taxa will be grouped under the "Other" category. | | 9 | 202 | 203 | 204 | ### Output Options 205 | 206 | | Nextflow parameter name | Type | Description | Help | Default | 207 | |--------------------------|------|-------------|------|---------| 208 | | out_dir | string | Directory for output of all user-facing files. | | output | 209 | | igv | boolean | Enable IGV visualisation in the EPI2ME Desktop Application by creating the required files. This will cause the workflow to emit the BAM files as well. If using a custom reference, this must be a FASTA file and not a minimap2 MMI format index. | | False | 210 | | include_read_assignments | boolean | A per-sample TSV file that indicates the taxonomy assigned to each sequence. These will only be output on completion of the workflow. | | False | 211 | | output_unclassified | boolean | Output a FASTQ of the unclassified reads. | | False | 212 | 213 | 214 | ### Advanced Options 215 | 216 | | Nextflow parameter name | Type | Description | Help | Default | 217 | |--------------------------|------|-------------|------|---------| 218 | | min_len | integer | Specify read length lower limit. | Any reads shorter than this limit will not be included in the analysis. | 800 | 219 | | min_read_qual | number | Specify read quality lower limit. | Any reads with a quality lower than this limit will not be included in the analysis. | | 220 | | max_len | integer | Specify read length upper limit | Any reads longer than this limit will not be included in the analysis. | 2000 | 221 | | threads | integer | Maximum number of CPU threads to use in each parallel workflow task. | Several tasks in this workflow benefit from using multiple CPU threads. This option sets the number of CPU threads for all such processes. | 4 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | ## Outputs 229 | 230 | Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}. 231 | 232 | | Title | File path | Description | Per sample or aggregated | 233 | |-------|-----------|-------------|--------------------------| 234 | | workflow report | wf-16s-report.html | Report for all samples. | aggregated | 235 | | Abundance table with counts per taxa | abundance_table_{{ taxonomic_rank }}.tsv | Per-taxa counts TSV, including all samples. | aggregated | 236 | | Bracken report file | bracken/{{ alias }}.kraken2_bracken.report | TSV file with the abundance of each taxa. More info about [bracken report](https://github.com/jenniferlu717/Bracken#output-kraken-style-bracken-report). | per-sample | 237 | | Kraken2 taxonomic assignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.report.txt | Lineage-aggregated counts. More info about [kraken2 report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#sample-report-output-format). | per-sample | 238 | | Kraken2 taxonomic asignment per read (Kraken2 pipeline) | kraken2/{{ alias }}.kraken2.assignments.tsv | TSV file with the taxonomic assignment per read. More info about [kraken2 assignments report](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#standard-kraken-output-format). | per-sample | 239 | | Host BAM file | host_bam/{{ alias }}.bam | BAM file generated from mapping filtered input reads to the host reference. | per-sample | 240 | | BAM index file of host reads | host_bam/{{ alias }}.bai | BAM index file generated from mapping filtered input reads to the host reference. | per-sample | 241 | | BAM file (minimap2) | bams/{{ alias }}.reference.bam | BAM file generated from mapping filtered input reads to the reference. | per-sample | 242 | | BAM index file (minimap2) | bams/{{ alias }}.reference.bam.bai | Index file generated from mapping filtered input reads to the reference. | per-sample | 243 | | BAM flagstat (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.flagstat.tsv | Mapping results per reference | per-sample | 244 | | Minimap2 alignment statistics (minimap2) | bams/{{ alias }}.bamstats_results/bamstats.readstats.tsv.gz | Per read stats after aligning | per-sample | 245 | | Reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz | Reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 246 | | Index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.fai | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 247 | | GZI index of the reduced reference FASTA file | igv_reference/reduced_reference.fasta.gz.gzi | Index of the reference FASTA file containing only those sequences that have reads mapped against them. | aggregated | 248 | | JSON configuration file for IGV browser | igv.json | JSON configuration file to be loaded in IGV for visualising alignments against the reduced reference. | aggregated | 249 | | Taxonomic assignment per read. | reads_assignments/{{ alias }}.*.assignments.tsv | TSV file with the taxonomic assignment per read. | per-sample | 250 | | FASTQ of the selected taxids. | extracted/{{ alias }}.minimap2.extracted.fastq | FASTQ containing/excluding the reads of the selected taxids. | per-sample | 251 | | Unclassified FASTQ. | unclassified/{{ alias }}.unclassified.fq.gz | FASTQ containing the reads that have not been classified against the database. | per-sample | 252 | | Alignment statistics TSV | alignment_tables/{{ alias }}.alignment-stats.tsv | Coverage and taxonomy of each reference. | per-sample | 253 | 254 | 255 | 256 | 257 | ## Pipeline overview 258 | 259 | 260 | ### Workflow defaults and parameters 261 | The workflow sets default values for parameters optimised for the analysis of full-length 16S rRNA gene amplicons, including `min_len`, `max_len`, `min_ref_coverage`, and `min_percent_identity`. 262 | Descriptions of the parameters and their defaults can be found in the [input parameters section](#input-parameters). 263 | 264 | #### Analysing ITS amplicons 265 | For analysis of ITS amplicons users should adjust the following parameters: 266 | - `min_len` should be decreased to 300, as ITS amplicons may be shorter than the current `min_len` default value which will cause them to be excluded. 267 | - `database_set` should be changed to `ncbi_16s_18s_28s_ITS` or a [custom database](#faqs) containing the relevant ITS references. 268 | 269 | ### 1. Concatenate input files and generate per read stats 270 | 271 | [fastcat](https://github.com/epi2me-labs/fastcat) is used to concatenate input FASTQ files prior to downstream processing of the workflow. It will also output per-read stats including read lengths and average qualities. 272 | 273 | You may want to choose which reads are analysed by filtering them using the flags `max_len`, `min_len` and `min_read_qual`. 274 | 275 | ### 2. Remove host sequences (optional) 276 | 277 | We have included an optional filtering step to remove any host sequences that map (using [Minimap2](https://github.com/lh3/minimap2)) against a provided host reference (e.g. human), which can be a FASTA file or a MMI index. To use this option provide the path to your host reference with the `exclude_host` parameter. The mapped reads are output in a BAM file and excluded from further analysis. 278 | 279 | ``` 280 | nextflow run epi2me-labs/wf-16s --fastq test_data/case04/reads.fastq.gz --exclude_host test_data/case04/host.fasta.gz 281 | ``` 282 | 283 | ### 3. Classify reads taxonomically 284 | 285 | There are two different approaches to taxonomic classification: 286 | 287 | #### 3.1 Using Minimap2 288 | 289 | [Minimap2](https://github.com/lh3/minimap2) provides better resolution, but, depending on the reference database used, can take significantly more time. This is the option by default. 290 | 291 | ``` 292 | nextflow run epi2me-labs/wf-16s --fastq test_data/case01 --classifier minimap2 293 | ``` 294 | 295 | The creation of alignment statistics plots can be enabled with the `minimap2_by_reference` flag. Using this option produces a table and scatter plot in the report showing sequencing depth and coverage of each reference. The report also contains a heatmap indicating the sequencing depth over relative genomic coordinates for the references with the highest coverage (references with a mean coverage of less than 1% of the one with the largest value are omitted). 296 | 297 | In addition, the user can output BAM files in a folder called `bams` by using the option `keep_bam`. If the user provides a custom database and uses the `igv` option, the workflow will also output the references with reads mappings, as well as an IGV configuration file. This configuration file allows the user to view the alignments in the EPI2ME Desktop Application in the Viewer tab. Note that the number of references can be reduced using the `abundance_threshold` option, which will select those references with a number of reads aligned higher than this value. Please, consider that the view of the alignment is highly dependent on the reference selected. 298 | 299 | #### 3.2 Using Kraken2 300 | 301 | [Kraken2](https://github.com/DerrickWood/kraken2) provides the fastest method for the taxonomic classification of the reads. Then, [Bracken](https://github.com/jenniferlu717/Bracken) is used to provide an estimate of the genus (or the selected taxonomic rank) abundance in the sample. 302 | 303 | ### 4. Output 304 | 305 | The main output of the wf-16s pipeline is the `wf-16s-report.html` which can be found in the output directory. It contains a summary of read statistics, the taxonomic composition of the sample and some diversity metrics. The results shown in the report can also be customised with several options. For example, you can use `abundance_threshold` to remove all taxa less prevalent than the threshold from the abundance table. When setting this parameter to a natural number, taxa with fewer absolute counts are removed. You can also pass a decimal between 0.0-1.0 to drop taxa of lower relative abundance. Furthermore, `n_taxa_barplot` controls the number of taxa displayed in the bar plot and groups the rest under the category ‘Other’. 306 | 307 | You can use the flag `include_read_assignments` to include a per-sample TSV file that indicates how each input sequence was classified, as well as the taxon that has been assigned to each read. 308 | 309 | For more information about remaining workflow outputs, please see [minimap2 Options](#minimap2-options). 310 | 311 | ### 5. Diversity indices 312 | 313 | Species diversity refers to the taxonomic composition in a specific microbial community. There are some useful concepts to take into account: 314 | * Richness: number of unique taxonomic groups present in the community, 315 | * Taxonomic group abundance: number of individuals of a particular taxonomic group present in the community, 316 | * Evenness: refers to the equitability of the different taxonomic groups in terms of their abundances. 317 | Two different communities can host the same number of different taxonomic groups (i.e. they have the same richness), but they can have different evenness. For instance, if there is one taxon whose abundance is much larger in one community compared to the other. 318 | 319 | There are three types of biodiversity measures described over a special scale [1](https://doi.org/10.2307/1218190), [2](https://doi.org/10.1016/B978-0-12-384719-5.00036-8): alpha-, beta-, and gamma-diversity. 320 | * Alpha-diversity refers to the richness that occurs within a community given area within a region. 321 | * Beta-diversity defined as variation in the identities of species among sites, provides a direct link between biodiversity at local scales (alpha diversity) and the broader regional species pool (gamma diversity). 322 | * Gamma-diversity is the total observed richness within an entire region. 323 | 324 | To provide a quick overview of the alpha-diversity of the microbial community, we provide some of the most common diversity metrics calculated for a specific taxonomic rank [3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4224527/), which can be chosen by the user with the `taxonomic_rank` parameter ('D'=Domain,'P'=Phylum, 'C'=Class, 'O'=Order, 'F'=Family, 'G'=Genus, 'S'=Species). By default, the rank is 'G' (genus-level). Some of the included alpha diversity metrics are: 325 | 326 | * Shannon Diversity Index (H): Shannon entropy approaches zero if a community is almost entirely made up of a single taxon. 327 | 328 | ```math 329 | H = -\sum_{i=1}^{S}p_i*ln(p_i) 330 | ``` 331 | 332 | * Simpson's Diversity Index (D): The range is from 0 (low diversity) to 1 (high diversity). 333 | 334 | ```math 335 | D = \sum_{i=1}^{S}p_i^2 336 | ``` 337 | 338 | * Pielou Index (J): The values range from 0 (presence of a dominant species) and 1 (maximum evennes). 339 | 340 | ```math 341 | J = H/ln(S) 342 | ``` 343 | 344 | * Berger-Parker dominance index (BP): expresses the proportional importance of the most abundant type, i.e., the ratio of number of individuals of most abundant species to the total number of individuals of all the species in the sample. 345 | 346 | ```math 347 | BP = n_i/N 348 | ``` 349 | where ni refers to the counts of the most abundant taxon and N is the total of counts. 350 | 351 | 352 | * Fisher’s alpha: Fisher (see Fisher, 1943[4](https://doi.org/10.2307/1411)) noticed that only a few species tend to be abundant while most are represented by only a few individuals ('rare biosphere'). These differences in species abundance can be incorporated into species diversity measurements such as the Fisher’s alpha. This index is based upon the logarithmic distribution of number of individuals of different species. 353 | 354 | ```math 355 | S = \alpha * ln(1 + N/\alpha) 356 | ``` 357 | where S is the total number of taxa, N is the total number of individuals in the sample. The value of Fisher's $`\alpha`$ is calculated by iteration. 358 | 359 | These indices are calculated by default using the original abundance table (see McMurdie and Holmes[5](https://pubmed.ncbi.nlm.nih.gov/24699258/), 2014 and Willis[6](https://www.frontiersin.org/articles/10.3389/fmicb.2019.02407/full), 2019). If you want to calculate them from a rarefied abundance table (i.e. all the samples have been subsampled to contain the same number of counts per sample, which is the 95% of the minimum number of total counts), you can download the rarefied table from the report. 360 | 361 | The report also includes the rarefaction curve per sample which displays the mean of species richness for a subsample of reads (sample size). Generally, this curve initially grows rapidly, as most abundant species are sequenced and they add new taxa in the community, then slightly flattens due to the fact that 'rare' species are more difficult of being sampled, and because of that is more difficult to report an increase in the number of observed species. 362 | 363 | > Note: Within each rank, each named taxon is a unique unit. The counts are the number of reads assigned to that taxon. All `Unknown` sequences are considered as a unique taxon 364 | 365 | 366 | 367 | 368 | ## Troubleshooting 369 | 370 | + If the workflow fails please run it with the demo data set to ensure the workflow itself is working. This will help us determine if the issue is related to the environment, input parameters or a bug. 371 | + See how to interpret some common nextflow exit codes [here](https://labs.epi2me.io/trouble-shooting/). 372 | + When using the Minimap2 pipeline with a custom database, you must make sure that the `ref2taxid` and reference files are coherent, as well as the taxonomy database. 373 | + If your device doesn't have the resources to use large Kraken2 databases, you can enable `kraken2_memory_mapping` to reduce the amount of memory required. 374 | + To enable IGV viewer with a custom reference, this must be a FASTA file and not a minimap2 MMI format index. 375 | 376 | 377 | 378 | 379 | ## FAQ's 380 | 381 | If your question is not answered here, please report any issues or suggestions on the [github issues](https://github.com/epi2me-labs/wf-16s/issues) page or start a discussion on the [community](https://community.nanoporetech.com/). 382 | 383 | + *Which database is used per default?* - By default, the workflow uses the NCBI 16S + 18S rRNA database. It will be downloaded the first time the workflow is run and re-used in subsequent runs. 384 | 385 | + *Are more databases available?* - Other 16s databases (listed below) can be selected with the `database_set` parameter, but the workflow can also be used with a custom database if required (see [here](https://labs.epi2me.io/how-to-meta-offline/) for details). 386 | * 16S, 18S, ITS 387 | * ncbi_16s_18s and ncbi_16s_18s_28s_ITS: Archaeal, bacterial and fungal 16S/18S and ITS data. There are two databases available using the data from [NCBI](https://www.ncbi.nlm.nih.gov/refseq/targetedloci/) 388 | * SILVA_138_1: The [SILVA](https://www.arb-silva.de/) database (version 138) is also available. Note that SILVA uses its own set of taxids, which do not match the NCBI taxids. We provide the respective taxdump files, but if you prefer using the NCBI ones, you can create them from the SILVA files ([NCBI](https://www.arb-silva.de/no_cache/download/archive/current/Exports/taxonomy/ncbi/)). As the SILVA database uses genus level, the last taxonomic rank at which the analysis is carried out is genus (`taxonomic_rank G`). 389 | 390 | + *How can I use Kraken2 indexes?* - There are different databases available [here](https://benlangmead.github.io/aws-indexes/k2). 391 | 392 | + *How can I use custom databases?* - If you want to run the workflow using your own Kraken2 database, you'll need to provide the database and an associated taxonomy dump. For a custom Minimap2 reference database, you'll need to provide a reference FASTA (or MMI) and an associated ref2taxid file. For a guide on how to build and use custom databases, take a look at our [article on how to run wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/). 393 | 394 | + *How can I run the workflow with less memory?* - 395 | When running in Kraken mode, you can set the `kraken2_memory_mapping` parameter if the available memory is smaller than the size of the database. 396 | 397 | + *How can I run the workflow offline?* - To run wf-16s offline you can use the workflow to download the databases from the internet and prepare them for offline re-use later. If you want to use one of the databases supported out of the box by the workflow, you can run the workflow with your desired database and any input (for example, the test data). The database will be downloaded and prepared in a directory on your computer. Once the database has been prepared, it will be used automatically the next time you run the workflow without needing to be downloaded again. You can find advice on picking a suitable database in our [article on selecting databases for wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/). 398 | 399 | + *When and how are coverage and identity filters applied when using the minimap2 approach?* - With minimap2-based classification, coverage and identity filtering is applied by using the `min_ref_coverage` and `min_percent_identity` options respectively. All reads that mapped to a reference, but failed to pass these filters, are relabelled as unclassified. If the `include_read_assignments` option is used, tables in the output will show read classifications after this filtering step. However, the output BAM file always contains the raw minimap2 alignment results. To read more about both filters, see [minimap2 Options](#minimap2-options). 400 | 401 | 402 | 403 | 404 | 405 | ## Related blog posts 406 | 407 | + [How to build and use databases to run wf-metagenomics and wf-16s offline](https://labs.epi2me.io/how-to-meta-offline/). 408 | + [Selecting the correct databases in the wf-metagenomics](https://labs.epi2me.io/metagenomic-databases/). 409 | + [How to evaluate unclassified sequences](https://epi2me.nanoporetech.com/post-meta-analysis/) 410 | 411 | See the [EPI2ME website](https://labs.epi2me.io/) for lots of other resources and blog posts. 412 | 413 | 414 | 415 | 416 | --------------------------------------------------------------------------------