├── .gitattributes ├── .gitignore ├── docs ├── images │ ├── preseq_plot.png │ ├── saturation.png │ ├── cutadapt_plot.png │ ├── dupRadar_plot.png │ ├── infer_experiment.png │ ├── nfcore-rnaseq_logo.ai │ ├── read_duplication.png │ ├── junction_saturation.png │ ├── nf-core-rnaseq_logo.png │ ├── star_alignment_plot.png │ ├── inner_distance_concept.png │ ├── mqc_hcplot_hocmzpdjsq.png │ ├── mqc_hcplot_ltqchiyxfz.png │ ├── mqc_hcplot_wtnqrdhkuc.png │ ├── rseqc_read_dups_plot.png │ ├── preseq_complexity_curve.png │ ├── featureCounts_biotype_plot.png │ ├── rseqc_infer_experiment_plot.png │ ├── rseqc_inner_distance_plot.png │ ├── featureCounts_assignment_plot.png │ ├── rseqc_read_distribution_plot.png │ ├── rseqc_junction_saturation_plot.png │ └── rseqc_junction_annotation_junctions_plot.png ├── README.md ├── output.md └── usage.md ├── assets ├── nf-core-rnaseq_logo.png ├── biotypes_header.txt ├── heatmap_header.txt ├── mdsplot_header.txt ├── rrna-db-defaults.txt ├── multiqc_config.yaml ├── sendmail_template.txt ├── where_are_my_files.txt ├── email_template.txt └── email_template.html ├── .github ├── markdownlint.yml ├── workflows │ ├── branch.yml │ ├── ci.yml │ └── linting.yml ├── ISSUE_TEMPLATE │ ├── feature_request.md │ └── bug_report.md ├── PULL_REQUEST_TEMPLATE.md └── CONTRIBUTING.md ├── Dockerfile ├── conf ├── awsbatch.config ├── test.config ├── test_gz.config ├── base.config └── igenomes.config ├── LICENSE ├── environment.yml ├── bin ├── markdown_to_html.r ├── se.r ├── filter_gtf_for_genes_in_genome.py ├── mqc_features_stat.py ├── tximport.r ├── parse_gtf.py ├── edgeR_heatmap_MDS.r ├── scrape_software_versions.py ├── gtf2bed └── dupRadar.r ├── .travis.yml ├── CODE_OF_CONDUCT.md ├── README.md ├── nextflow.config ├── CHANGELOG.md └── parameters.settings.json /.gitattributes: -------------------------------------------------------------------------------- 1 | *.config linguist-language=nextflow 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .nextflow* 2 | work/ 3 | data/ 4 | results/ 5 | .DS_Store 6 | tests/test_data 7 | *.pyc 8 | -------------------------------------------------------------------------------- /docs/images/preseq_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/preseq_plot.png -------------------------------------------------------------------------------- /docs/images/saturation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/saturation.png -------------------------------------------------------------------------------- /assets/nf-core-rnaseq_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/assets/nf-core-rnaseq_logo.png -------------------------------------------------------------------------------- /docs/images/cutadapt_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/cutadapt_plot.png -------------------------------------------------------------------------------- /docs/images/dupRadar_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/dupRadar_plot.png -------------------------------------------------------------------------------- /docs/images/infer_experiment.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/infer_experiment.png -------------------------------------------------------------------------------- /docs/images/nfcore-rnaseq_logo.ai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/nfcore-rnaseq_logo.ai -------------------------------------------------------------------------------- /docs/images/read_duplication.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/read_duplication.png -------------------------------------------------------------------------------- /docs/images/junction_saturation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/junction_saturation.png -------------------------------------------------------------------------------- /docs/images/nf-core-rnaseq_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/nf-core-rnaseq_logo.png -------------------------------------------------------------------------------- /docs/images/star_alignment_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/star_alignment_plot.png -------------------------------------------------------------------------------- /docs/images/inner_distance_concept.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/inner_distance_concept.png -------------------------------------------------------------------------------- /docs/images/mqc_hcplot_hocmzpdjsq.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/mqc_hcplot_hocmzpdjsq.png -------------------------------------------------------------------------------- /docs/images/mqc_hcplot_ltqchiyxfz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/mqc_hcplot_ltqchiyxfz.png -------------------------------------------------------------------------------- /docs/images/mqc_hcplot_wtnqrdhkuc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/mqc_hcplot_wtnqrdhkuc.png -------------------------------------------------------------------------------- /docs/images/rseqc_read_dups_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_read_dups_plot.png -------------------------------------------------------------------------------- /docs/images/preseq_complexity_curve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/preseq_complexity_curve.png -------------------------------------------------------------------------------- /docs/images/featureCounts_biotype_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/featureCounts_biotype_plot.png -------------------------------------------------------------------------------- /docs/images/rseqc_infer_experiment_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_infer_experiment_plot.png -------------------------------------------------------------------------------- /docs/images/rseqc_inner_distance_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_inner_distance_plot.png -------------------------------------------------------------------------------- /docs/images/featureCounts_assignment_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/featureCounts_assignment_plot.png -------------------------------------------------------------------------------- /docs/images/rseqc_read_distribution_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_read_distribution_plot.png -------------------------------------------------------------------------------- /docs/images/rseqc_junction_saturation_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_junction_saturation_plot.png -------------------------------------------------------------------------------- /docs/images/rseqc_junction_annotation_junctions_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/viklund/rnaseq/master/docs/images/rseqc_junction_annotation_junctions_plot.png -------------------------------------------------------------------------------- /.github/markdownlint.yml: -------------------------------------------------------------------------------- 1 | # Markdownlint configuration file 2 | default: true, 3 | line-length: false 4 | no-multiple-blanks: 0 5 | blanks-around-headers: false 6 | blanks-around-lists: false 7 | header-increment: false 8 | no-duplicate-header: 9 | siblings_only: true 10 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM nfcore/base:1.7 2 | LABEL authors="phil.ewels@scilifelab.se" \ 3 | description="Docker image containing all requirements for the nfcore/rnaseq pipeline" 4 | 5 | COPY environment.yml / 6 | RUN conda env create -f /environment.yml && conda clean -a 7 | ENV PATH /opt/conda/envs/nf-core-rnaseq-1.4.2/bin:$PATH 8 | -------------------------------------------------------------------------------- /assets/biotypes_header.txt: -------------------------------------------------------------------------------- 1 | # id: 'biotype-counts' 2 | # section_name: 'Biotype Counts' 3 | # description: "shows reads overlapping genomic features of different biotypes, 4 | # counted by featureCounts." 5 | # plot_type: 'bargraph' 6 | # anchor: 'featurecounts_biotype' 7 | # pconfig: 8 | # id: "featureCounts_biotype_plot" 9 | # title: "featureCounts: Biotypes" 10 | # xlab: "# Reads" 11 | # cpswitch_counts_label: "Number of Reads" 12 | -------------------------------------------------------------------------------- /assets/heatmap_header.txt: -------------------------------------------------------------------------------- 1 | # id: 'sample-similarity' 2 | # section_name: 'edgeR: Sample Similarity' 3 | # description: "is generated from normalised gene counts through 4 | # edgeR. 5 | # Pearson's correlation between log2 normalised CPM values are then calculated and clustered." 6 | # plot_type: 'heatmap' 7 | # anchor: 'ngi_rnaseq-sample_similarity' 8 | # pconfig: 9 | # title: "edgeR: Pearson's correlation" 10 | # xlab: True 11 | # reverseColors: True 12 | -------------------------------------------------------------------------------- /assets/mdsplot_header.txt: -------------------------------------------------------------------------------- 1 | # id: 'edgeR-sample-distances' 2 | # section_name: 'MDS Plot' 3 | # description: "show relatedness between samples in a project. 4 | # These values are calculated using edgeR 5 | # in the edgeR_heatmap_MDS.r script." 6 | # plot_type: 'scatter' 7 | # anchor: 'ngi_rnaseq-mds_plot' 8 | # pconfig: 9 | # xlab: 'Leading' 10 | # title: 'MDS Plot' 11 | # ylab: 'logFC' 12 | -------------------------------------------------------------------------------- /docs/README.md: -------------------------------------------------------------------------------- 1 | # nf-core/rnaseq: Documentation 2 | 3 | The nf-core/rnaseq documentation is split into the following files: 4 | 5 | 1. [Installation](https://nf-co.re/usage/installation) 6 | 2. Pipeline configuration 7 | * [Local installation](https://nf-co.re/usage/local_installation) 8 | * [Adding your own system config](https://nf-co.re/usage/adding_own_config) 9 | * [Reference genomes](https://nf-co.re/usage/reference_genomes) 10 | 3. [Running the pipeline](usage.md) 11 | 4. [Output and how to interpret the results](output.md) 12 | 5. [Troubleshooting](https://nf-co.re/usage/troubleshooting) 13 | -------------------------------------------------------------------------------- /.github/workflows/branch.yml: -------------------------------------------------------------------------------- 1 | name: nf-core/rnaseq branch protection 2 | # This workflow is triggered on PRs to master branch on the repository 3 | on: 4 | pull_request: 5 | branches: 6 | - master 7 | 8 | jobs: 9 | test: 10 | runs-on: ubuntu-latest 11 | steps: 12 | # PRs are only ok if coming from an nf-core dev branch 13 | - uses: actions/checkout@v1 14 | - name: Check PRs 15 | run: | 16 | [[ $(git remote get-url origin) == *nf-core/rnaseq ]] && [[ ${GITHUB_BASE_REF} = "master" ]] && { [[ ${GITHUB_HEAD_REF} = "dev" ]] || [[ ${GITHUB_BASE_REF} = "patch" ]]; } 17 | -------------------------------------------------------------------------------- /conf/awsbatch.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * Nextflow config file for running on AWS batch 4 | * ------------------------------------------------- 5 | * Base config needed for running with -profile awsbatch 6 | */ 7 | params { 8 | config_profile_name = 'AWSBATCH' 9 | config_profile_description = 'AWSBATCH Cloud Profile' 10 | config_profile_contact = 'Alexander Peltzer (@apeltzer)' 11 | config_profile_url = 'https://aws.amazon.com/de/batch/' 12 | } 13 | 14 | aws.region = params.awsregion 15 | process.executor = 'awsbatch' 16 | process.queue = params.awsqueue 17 | executor.awscli = '/home/ec2-user/miniconda/bin/aws' 18 | params.tracedir = './' 19 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | Hi there! 2 | 3 | Thanks for suggesting a new feature for the pipeline! Please delete this text and anything that's not relevant from the template below: 4 | 5 | #### Is your feature request related to a problem? Please describe. 6 | A clear and concise description of what the problem is. 7 | Ex. I'm always frustrated when [...] 8 | 9 | #### Describe the solution you'd like 10 | A clear and concise description of what you want to happen. 11 | 12 | #### Describe alternatives you've considered 13 | A clear and concise description of any alternative solutions or features you've considered. 14 | 15 | #### Additional context 16 | Add any other context about the feature request here. 17 | -------------------------------------------------------------------------------- /assets/rrna-db-defaults.txt: -------------------------------------------------------------------------------- 1 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/rfam-5.8s-database-id98.fasta 2 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/rfam-5s-database-id98.fasta 3 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-arc-16s-id95.fasta 4 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-arc-23s-id98.fasta 5 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-bac-16s-id90.fasta 6 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-bac-23s-id98.fasta 7 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-euk-18s-id95.fasta 8 | https://raw.githubusercontent.com/biocore/sortmerna/master/rRNA_databases/silva-euk-28s-id98.fasta -------------------------------------------------------------------------------- /assets/multiqc_config.yaml: -------------------------------------------------------------------------------- 1 | extra_fn_clean_exts: 2 | - '_R1' 3 | - '_R2' 4 | - '.hisat' 5 | - '_subsamp' 6 | - '.sorted' 7 | 8 | report_comment: > 9 | This report has been generated by the nf-core/rnaseq 10 | analysis pipeline. For information about how to interpret these results, please see the 11 | documentation. 12 | 13 | top_modules: 14 | - 'edgeR-sample-distances' 15 | - 'sample-similarity' 16 | - 'DupRadar' 17 | - 'biotype-counts' 18 | 19 | report_section_order: 20 | software_versions: 21 | order: -1000 22 | nf-core-rnaseq-summary: 23 | order: -1100 24 | 25 | table_columns_visible: 26 | FastQC: 27 | percent_duplicates: False 28 | 29 | export_plots: true 30 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | Hi there! 2 | 3 | Thanks for telling us about a problem with the pipeline. Please delete this text and anything that's not relevant from the template below: 4 | 5 | #### Describe the bug 6 | A clear and concise description of what the bug is. 7 | 8 | #### Steps to reproduce 9 | Steps to reproduce the behaviour: 10 | 1. Command line: `nextflow run ...` 11 | 2. See error: _Please provide your error message_ 12 | 13 | #### Expected behaviour 14 | A clear and concise description of what you expected to happen. 15 | 16 | #### System: 17 | - Hardware: [e.g. HPC, Desktop, Cloud...] 18 | - Executor: [e.g. slurm, local, awsbatch...] 19 | - OS: [e.g. CentOS Linux, macOS, Linux Mint...] 20 | - Version [e.g. 7, 10.13.6, 18.3...] 21 | 22 | #### Nextflow Installation: 23 | - Version: [e.g. 0.31.0] 24 | 25 | #### Container engine: 26 | - Engine: [e.g. Conda, Docker or Singularity] 27 | - version: [e.g. 1.0.0] 28 | - Image tag: [e.g. nfcore/rnaseq:1.0.0] 29 | 30 | #### Additional context 31 | Add any other context about the problem here. 32 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | Many thanks to contributing to nf-core/rnaseq! 2 | 3 | To ensure that your build passes, please make sure your pull request is to the `dev` branch rather than to `master`. Thank you! 4 | 5 | Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs). 6 | 7 | ## PR checklist 8 | - [ ] PR is to `dev` rather than `master` 9 | - [ ] This comment contains a description of changes (with reason) 10 | - [ ] If you've fixed a bug or added code that should be tested, add tests! 11 | - [ ] If necessary, also make a PR on the [nf-core/rnaseq branch on the nf-core/test-datasets repo]( https://github.com/nf-core/test-datasets/pull/new/nf-core/rnaseq) 12 | - [ ] Ensure the test suite passes (`nextflow run . -profile test,docker`). 13 | - [ ] Make sure your code lints (`nf-core lint .`). 14 | - [ ] Documentation in `docs` is updated 15 | - [ ] `CHANGELOG.md` is updated 16 | - [ ] `README.md` is updated 17 | 18 | **Learn more about contributing:** https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md 19 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) Phil Ewels, Rickard Hammarén 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | # You can use this file to create a conda environment for this pipeline: 2 | # conda env create -f environment.yml 3 | name: nf-core-rnaseq-1.4.2 4 | channels: 5 | - conda-forge 6 | - bioconda 7 | - defaults 8 | dependencies: 9 | ## conda-forge packages, sorting now alphabetically, without the channel prefix! 10 | - matplotlib=3.0.3 # Current 3.1.0 build incompatible with multiqc=1.7 11 | - r-base=3.6.1 12 | - conda-forge::r-data.table=1.12.4 13 | - conda-forge::r-gplots=3.0.1.1 14 | - conda-forge::r-markdown=1.1 15 | 16 | ## bioconda packages, see above 17 | - bioconductor-dupradar=1.14.0 18 | - bioconductor-edger=3.26.5 19 | - bioconductor-tximeta=1.2.2 20 | - bioconductor-summarizedexperiment=1.14.0 21 | - deeptools=3.3.1 22 | - fastqc=0.11.8 23 | - gffread=0.11.4 24 | - hisat2=2.1.0 25 | - multiqc=1.7 26 | - picard=2.21.1 27 | - preseq=2.0.3 28 | - qualimap=2.2.2c 29 | - rseqc=3.0.1 30 | - salmon=0.14.2 31 | - samtools=1.9 32 | - sortmerna=2.1b # for metatranscriptomics 33 | - star=2.6.1d # Don't upgrade me - 2.7X indices incompatible with iGenomes. 34 | - stringtie=2.0 35 | - subread=1.6.4 36 | - trim-galore=0.6.4 37 | -------------------------------------------------------------------------------- /.github/workflows/ci.yml: -------------------------------------------------------------------------------- 1 | name: nf-core/rnaseq CI 2 | # This workflow is triggered on pushes and PRs to the repository. 3 | on: [push, pull_request] 4 | 5 | jobs: 6 | test: 7 | runs-on: ubuntu-latest 8 | strategy: 9 | matrix: 10 | nxf_ver: ['19.04.0', ''] 11 | aligner: ["--aligner 'hisat2'", "--aligner 'star'", "--pseudo_aligner 'salmon'"] 12 | options: ['--skipQC', '--remove_rRNA', '--saveUnaligned', '--skipTrimming', '--star_index false'] 13 | steps: 14 | - uses: actions/checkout@v1 15 | - name: Install Nextflow 16 | run: | 17 | export NXF_VER=${{ matrix.nxf_ver }} 18 | wget -qO- get.nextflow.io | bash 19 | sudo mv nextflow /usr/local/bin/ 20 | - name: Download image 21 | run: | 22 | docker pull nfcore/rnaseq:dev && docker tag nfcore/rnaseq:dev nfcore/rnaseq:1.4.2 23 | - name: Basic workflow tests 24 | run: | 25 | nextflow run ${GITHUB_WORKSPACE} -profile test,docker ${{ matrix.aligner }} ${{ matrix.options }} 26 | - name: Basic workflow, gzipped input 27 | run: | 28 | nextflow run ${GITHUB_WORKSPACE} -profile test_gz,docker ${{ matrix.aligner }} ${{ matrix.options }} 29 | -------------------------------------------------------------------------------- /assets/sendmail_template.txt: -------------------------------------------------------------------------------- 1 | To: $email 2 | Subject: $subject 3 | Mime-Version: 1.0 4 | Content-Type: multipart/related;boundary="nfcoremimeboundary" 5 | 6 | --nfcoremimeboundary 7 | Content-Type: text/html; charset=utf-8 8 | 9 | $email_html 10 | 11 | --nfcoremimeboundary 12 | Content-Type: image/png;name="nf-core-rnaseq_logo.png" 13 | Content-Transfer-Encoding: base64 14 | Content-ID: 15 | Content-Disposition: inline; filename="nf-core-rnaseq_logo.png" 16 | 17 | <% out << new File("$baseDir/assets/nf-core-rnaseq_logo.png"). 18 | bytes. 19 | encodeBase64(). 20 | toString(). 21 | tokenize( '\n' )*. 22 | toList()*. 23 | collate( 76 )*. 24 | collect { it.join() }. 25 | flatten(). 26 | join( '\n' ) %> 27 | 28 | <% 29 | if (mqcFile){ 30 | def mqcFileObj = new File("$mqcFile") 31 | if (mqcFileObj.length() < mqcMaxSize){ 32 | out << """ 33 | --nfcoremimeboundary 34 | Content-Type: text/html; name=\"multiqc_report\" 35 | Content-Transfer-Encoding: base64 36 | Content-ID: 37 | Content-Disposition: attachment; filename=\"${mqcFileObj.getName()}\" 38 | 39 | ${mqcFileObj. 40 | bytes. 41 | encodeBase64(). 42 | toString(). 43 | tokenize( '\n' )*. 44 | toList()*. 45 | collate( 76 )*. 46 | collect { it.join() }. 47 | flatten(). 48 | join( '\n' )} 49 | """ 50 | }} 51 | %> 52 | 53 | --nfcoremimeboundary-- 54 | -------------------------------------------------------------------------------- /bin/markdown_to_html.r: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | # Command line argument processing 4 | args = commandArgs(trailingOnly=TRUE) 5 | if (length(args) < 2) { 6 | stop("Usage: markdown_to_html.r ", call.=FALSE) 7 | } 8 | markdown_fn <- args[1] 9 | output_fn <- args[2] 10 | 11 | # Load / install packages 12 | if (!require("markdown")) { 13 | install.packages("markdown", dependencies=TRUE, repos='http://cloud.r-project.org/') 14 | library("markdown") 15 | } 16 | 17 | base_css_fn <- getOption("markdown.HTML.stylesheet") 18 | base_css <- readChar(base_css_fn, file.info(base_css_fn)$size) 19 | custom_css <- paste(base_css, " 20 | body { 21 | padding: 3em; 22 | margin-right: 350px; 23 | max-width: 100%; 24 | } 25 | #toc { 26 | position: fixed; 27 | right: 20px; 28 | width: 300px; 29 | padding-top: 20px; 30 | overflow: scroll; 31 | height: calc(100% - 3em - 20px); 32 | } 33 | #toc_header { 34 | font-size: 1.8em; 35 | font-weight: bold; 36 | } 37 | #toc > ul { 38 | padding-left: 0; 39 | list-style-type: none; 40 | } 41 | #toc > ul ul { padding-left: 20px; } 42 | #toc > ul > li > a { display: none; } 43 | img { max-width: 800px; } 44 | ") 45 | 46 | markdownToHTML( 47 | file = markdown_fn, 48 | output = output_fn, 49 | stylesheet = custom_css, 50 | options = c('toc', 'base64_images', 'highlight_code') 51 | ) 52 | -------------------------------------------------------------------------------- /conf/test.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * Nextflow config file for running tests 4 | * ------------------------------------------------- 5 | * Defines bundled input files and everything required 6 | * to run a fast and simple test. Use as follows: 7 | * nextflow run nf-core/rnaseq -profile test 8 | */ 9 | 10 | params { 11 | config_profile_name = 'Test profile' 12 | config_profile_description = 'Minimal test dataset to check pipeline function' 13 | // Limit resources so that this can run CI 14 | max_cpus = 2 15 | max_memory = 6.GB 16 | max_time = 48.h 17 | 18 | // Input data 19 | singleEnd = true 20 | readPaths = [ 21 | ['SRR4238351', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238351_subsamp.fastq.gz']], 22 | ['SRR4238355', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238355_subsamp.fastq.gz']], 23 | ['SRR4238359', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238359_subsamp.fastq.gz']], 24 | ['SRR4238379', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238379_subsamp.fastq.gz']], 25 | ] 26 | // Genome references 27 | fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genome.fa' 28 | gtf = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gtf' 29 | gff = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/genes.gff' 30 | transcript_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta' 31 | } 32 | -------------------------------------------------------------------------------- /.github/workflows/linting.yml: -------------------------------------------------------------------------------- 1 | name: nf-core/rnaseq linting 2 | # This workflow is triggered on pushes and PRs to the repository. 3 | on: [push, pull_request] 4 | 5 | jobs: 6 | Markdown: 7 | runs-on: ubuntu-latest 8 | steps: 9 | - uses: actions/checkout@v1 10 | - uses: actions/setup-node@v1 11 | with: 12 | node-version: '10' 13 | - name: Install markdownlint 14 | run: | 15 | npm install -g markdownlint-cli 16 | - name: Run Markdownlint 17 | run: | 18 | markdownlint ${GITHUB_WORKSPACE} -c ${GITHUB_WORKSPACE}/.github/markdownlint.yml 19 | YAML: 20 | runs-on: ubuntu-latest 21 | steps: 22 | - uses: actions/checkout@v1 23 | - uses: actions/setup-node@v1 24 | with: 25 | node-version: '10' 26 | - name: Install yamllint 27 | run: | 28 | npm install -g yaml-lint 29 | - name: Run yamllint 30 | run: | 31 | yamllint $(find ${GITHUB_WORKSPACE} -type f -name "*.yml") 32 | nf-core: 33 | runs-on: ubuntu-latest 34 | steps: 35 | - uses: actions/checkout@v1 36 | - name: Install Nextflow 37 | run: | 38 | wget -qO- get.nextflow.io | bash 39 | sudo mv nextflow /usr/local/bin/ 40 | - uses: actions/setup-python@v1 41 | with: 42 | python-version: '3.6' 43 | architecture: 'x64' 44 | - name: Install pip 45 | run: | 46 | sudo apt install python3-pip 47 | pip install --upgrade pip 48 | - name: Install nf-core tools 49 | run: | 50 | pip install nf-core 51 | - name: Run nf-core lint 52 | run: | 53 | nf-core lint ${GITHUB_WORKSPACE} -------------------------------------------------------------------------------- /assets/where_are_my_files.txt: -------------------------------------------------------------------------------- 1 | ===================== 2 | Where are my files? 3 | ===================== 4 | 5 | By default, the nf-core/rnaseq pipeline does not save large intermediate files to the 6 | results directory. This is to try to conserve disk space. 7 | 8 | These files can be found in the pipeline `work` directory if needed. 9 | Alternatively, re-run the pipeline using `-resume` in addition to one of 10 | the below command-line options and they will be copied into the results directory: 11 | 12 | `--saveAlignedIntermediates` 13 | The final BAM files created after the Picard MarkDuplicates step are always saved 14 | and can be found in the `markDuplicates/` folder. 15 | Specify this flag to also copy out BAM files from STAR / HISAT2 alignment and sorting steps. 16 | 17 | `--saveTrimmed` 18 | Specify to save trimmed FastQ files to the results directory. 19 | 20 | `--saveReference` 21 | Save any downloaded or generated reference genome files to your results folder. 22 | These can then be used for future pipeline runs, reducing processing times. 23 | 24 | ----------------------------------- 25 | Setting defaults in a config file 26 | ----------------------------------- 27 | If you would always like these files to be saved without having to specify this on 28 | the command line, you can save the following to your personal configuration file 29 | (eg. `~/.nextflow/config`): 30 | 31 | params.saveReference = true 32 | params.saveTrimmed = true 33 | params.saveAlignedIntermediates = true 34 | 35 | For more help, see the following documentation: 36 | 37 | https://github.com/nf-core/rnaseq/blob/master/docs/usage.md 38 | https://www.nextflow.io/docs/latest/getstarted.html 39 | https://www.nextflow.io/docs/latest/config.html 40 | -------------------------------------------------------------------------------- /bin/se.r: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | args = commandArgs(trailingOnly=TRUE) 4 | if (length(args) < 2) { 5 | stop("Usage: tximeta.r ", call.=FALSE) 6 | } 7 | 8 | coldata = args[1] 9 | counts_fn = args[2] 10 | tpm_fn = args[3] 11 | 12 | tx2gene = "tx2gene.csv" 13 | info = file.info(tx2gene) 14 | if (info$size == 0){ 15 | tx2gene = NULL 16 | }else{ 17 | rowdata = read.csv(tx2gene, header = FALSE) 18 | colnames(rowdata) = c("tx", "gene_id", "gene_name") 19 | tx2gene = rowdata[,1:2] 20 | } 21 | 22 | counts = read.csv(counts_fn, row.names=1) 23 | tpm = read.csv(tpm_fn, row.names=1) 24 | 25 | if (length(intersect(rownames(counts), rowdata[["tx"]])) > length(intersect(rownames(counts), rowdata[["gene_id"]]))){ 26 | by_what = "tx" 27 | } else { 28 | by_what = "gene_id" 29 | rowdata = unique(rowdata[,2:3]) 30 | } 31 | 32 | if (file.exists(coldata)){ 33 | coldata = read.csv(coldata) 34 | coldata = coldata[match(colnames(counts), coldata[,1]),] 35 | coldata = cbind(files = fns, coldata) 36 | }else{ 37 | message("ColData not avaliable ", coldata) 38 | coldata = data.frame(files = colnames(counts), names = colnames(counts)) 39 | } 40 | library(SummarizedExperiment) 41 | 42 | rownames(coldata) = coldata[["names"]] 43 | extra = setdiff(rownames(counts), as.character(rowdata[[by_what]])) 44 | if (length(extra) > 0){ 45 | rowdata = rbind(rowdata, 46 | data.frame(tx=extra, 47 | gene_id=extra, 48 | gene_name=extra)) 49 | } 50 | 51 | rowdata = rowdata[match(rownames(counts), as.character(rowdata[[by_what]])),] 52 | rownames(rowdata) = rowdata[[by_what]] 53 | se = SummarizedExperiment(assays = list(counts = counts, 54 | abundance = tpm), 55 | colData = DataFrame(coldata), 56 | rowData = rowdata) 57 | 58 | saveRDS(se, file = paste0(tools::file_path_sans_ext(counts_fn), ".rds")) 59 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | sudo: required 2 | language: python 3 | jdk: openjdk8 4 | services: docker 5 | python: '3.6' 6 | cache: pip 7 | matrix: 8 | fast_finish: true 9 | 10 | before_install: 11 | # PRs to master are only ok if coming from dev branch 12 | - '[ $TRAVIS_PULL_REQUEST = "false" ] || [ $TRAVIS_BRANCH != "master" ] || ([ $TRAVIS_PULL_REQUEST_SLUG = $TRAVIS_REPO_SLUG ] && ([ $TRAVIS_PULL_REQUEST_BRANCH = "dev" ] || [ $TRAVIS_PULL_REQUEST_BRANCH = "patch" ]))' 13 | # Pull the docker image first so the test doesn't wait for this 14 | - docker pull nfcore/rnaseq:dev 15 | # Fake the tag locally so that the pipeline runs properly 16 | # Looks weird when this is :dev to :dev, but makes sense when testing code for a release (:dev to :1.0.1) 17 | - docker tag nfcore/rnaseq:dev nfcore/rnaseq:1.4.2 18 | 19 | install: 20 | # Install Nextflow 21 | - mkdir /tmp/nextflow && cd /tmp/nextflow 22 | - wget -qO- get.nextflow.io | bash 23 | - sudo ln -s /tmp/nextflow/nextflow /usr/local/bin/nextflow 24 | # Install nf-core/tools 25 | - pip install --upgrade pip 26 | - pip install nf-core 27 | # Reset 28 | - mkdir ${TRAVIS_BUILD_DIR}/tests && cd ${TRAVIS_BUILD_DIR}/tests 29 | # Install markdownlint-cli 30 | - sudo apt-get install npm && npm install -g markdownlint-cli 31 | 32 | env: 33 | - NXF_VER='19.04.0' # Specify a minimum NF version that should be tested and work 34 | - NXF_VER='' # Plus: get the latest NF version and check that it works 35 | 36 | script: 37 | # Lint the pipeline code 38 | - nf-core lint ${TRAVIS_BUILD_DIR} 39 | # Lint the documentation 40 | - markdownlint ${TRAVIS_BUILD_DIR} -c ${TRAVIS_BUILD_DIR}/.github/markdownlint.yml 41 | # Run, build reference genome with STAR 42 | - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker 43 | # Run, build reference genome with HISAT2 44 | - nextflow run ${TRAVIS_BUILD_DIR} -profile test,docker --aligner hisat2 45 | # Mini Test for gzipped stuff with STAR 46 | - nextflow run ${TRAVIS_BUILD_DIR} -profile test_gz,docker --fasta false --pseudo_aligner 'salmon' --skipAlignment 47 | -------------------------------------------------------------------------------- /conf/test_gz.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * Nextflow config file for running tests 4 | * ------------------------------------------------- 5 | * Defines bundled input files and everything required 6 | * to run a fast and simple test. Use as follows: 7 | * nextflow run nf-core/rnaseq -profile test_gz 8 | */ 9 | 10 | params { 11 | config_profile_name = 'Test profile - gzipped inputs' 12 | config_profile_description = 'Minimal test dataset to check pipeline function with gzipped input files' 13 | // Limit resources so that this can run on Travis 14 | max_cpus = 2 15 | max_memory = 6.GB 16 | max_time = 48.h 17 | // Input data 18 | singleEnd = true 19 | readPaths = [ 20 | ['SRR4238351', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238351_subsamp.fastq.gz']], 21 | ['SRR4238355', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238355_subsamp.fastq.gz']], 22 | ['SRR4238359', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238359_subsamp.fastq.gz']], 23 | ['SRR4238379', ['https://github.com/nf-core/test-datasets/raw/rnaseq/testdata/SRR4238379_subsamp.fastq.gz']], 24 | ] 25 | // Genome references 26 | fasta = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/genome.fa.gz' 27 | gtf = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/genes.gtf.gz' 28 | gff = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/genes.gff.gz' 29 | transcript_fasta = 'https://github.com/nf-core/test-datasets/raw/rnaseq/reference/transcriptome.fasta.gz' 30 | hisat2_index = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/hisat2.tar.gz' 31 | star_index = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/star.tar.gz' 32 | salmon_index = 'https://github.com/czbiohub/test-datasets/raw/olgabot/subset-chrom-I-gzip/reference/salmon_index.tar.gz' 33 | compressedReference = true 34 | } 35 | -------------------------------------------------------------------------------- /assets/email_template.txt: -------------------------------------------------------------------------------- 1 | ---------------------------------------------------- 2 | ,--./,-. 3 | ___ __ __ __ ___ /,-._.--~\\ 4 | |\\ | |__ __ / ` / \\ |__) |__ } { 5 | | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-, 6 | `._,._,' 7 | nf-core/rnaseq v${version} 8 | ---------------------------------------------------- 9 | 10 | Run Name: $runName 11 | 12 | <% if (success){ 13 | out << "## nf-core/rnaseq execution completed successfully! ##" 14 | } else { 15 | out << """#################################################### 16 | ## nf-core/rnaseq execution completed unsuccessfully! ## 17 | #################################################### 18 | The exit status of the task that caused the workflow execution to fail was: $exitStatus. 19 | The full error message was: 20 | 21 | ${errorReport} 22 | """ 23 | } %> 24 | 25 | 26 | <% if (!success){ 27 | out << """#################################################### 28 | ## nf-core/rnaseq execution completed unsuccessfully! ## 29 | #################################################### 30 | The exit status of the task that caused the workflow execution to fail was: $exitStatus. 31 | The full error message was: 32 | 33 | ${errorReport} 34 | """ 35 | } else if(skipped_poor_alignment.size() > 0) { 36 | out << """################################################## 37 | ## nf-core/rnaseq execution completed with warnings ## 38 | ################################################## 39 | The pipeline finished successfully, but the following samples were skipped, 40 | due to very low alignment (less than 5%): 41 | 42 | - ${skipped_poor_alignment.join("\n - ")} 43 | """ 44 | } else { 45 | out << "## nf-core/rnaseq execution completed successfully! ##" 46 | } 47 | %> 48 | 49 | 50 | 51 | 52 | The workflow was completed at $dateComplete (duration: $duration) 53 | 54 | The command used to launch the workflow was as follows: 55 | 56 | $commandLine 57 | 58 | 59 | 60 | Pipeline Configuration: 61 | ----------------------- 62 | <% out << summary.collect{ k,v -> " - $k: $v" }.join("\n") %> 63 | 64 | -- 65 | nf-core/rnaseq 66 | https://github.com/nf-core/rnaseq 67 | -------------------------------------------------------------------------------- /conf/base.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * nf-core/rnaseq Nextflow base config file 4 | * ------------------------------------------------- 5 | * A 'blank slate' config file, appropriate for general 6 | * use on most high performace compute environments. 7 | * Assumes that all software is installed and available 8 | * on the PATH. Runs in `local` mode - all jobs will be 9 | * run on the logged in environment. 10 | */ 11 | 12 | process { 13 | 14 | cpus = { check_max( 2, 'cpus' ) } 15 | memory = { check_max( 8.GB * task.attempt, 'memory' ) } 16 | time = { check_max( 4.h * task.attempt, 'time' ) } 17 | 18 | errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'terminate' } 19 | maxRetries = 1 20 | maxErrors = '-1' 21 | 22 | // Process-specific resource requirements 23 | withLabel: low_memory { 24 | memory = { check_max( 16.GB * task.attempt, 'memory' ) } 25 | } 26 | withLabel: mid_memory { 27 | cpus = { check_max (4, 'cpus')} 28 | memory = { check_max( 28.GB * task.attempt, 'memory' ) } 29 | time = { check_max( 8.h * task.attempt, 'time' ) } 30 | } 31 | withLabel: high_memory { 32 | cpus = { check_max (10, 'cpus')} 33 | memory = { check_max( 70.GB * task.attempt, 'memory' ) } 34 | time = { check_max( 8.h * task.attempt, 'time' ) } 35 | } 36 | 37 | withName: makeHISATindex { 38 | cpus = { check_max( 10, 'cpus' ) } 39 | memory = { check_max( 200.GB * task.attempt, 'memory' ) } 40 | time = { check_max( 5.h * task.attempt, 'time' ) } 41 | } 42 | withName: trim_galore { 43 | time = { check_max( 8.h * task.attempt, 'time' ) } 44 | } 45 | withName: sortmerna { 46 | cpus = { check_max( 16 * task.attempt, 'cpus' ) } 47 | time = { check_max( 24.h * task.attempt, 'time' ) } 48 | maxRetries = 2 49 | } 50 | withName: markDuplicates { 51 | // Actually the -Xmx value should be kept lower, 52 | // and is set through the markdup_java_options 53 | cpus = { check_max( 8, 'cpus' ) } 54 | memory = { check_max( 8.GB * task.attempt, 'memory' ) } 55 | } 56 | withLabel: salmon { 57 | cpus = { check_max( 8, 'cpus' ) } 58 | memory = { check_max( 16.GB * task.attempt, 'memory' ) } 59 | } 60 | withName: 'get_software_versions' { 61 | memory = { check_max( 2.GB * task.attempt, 'memory' ) } 62 | cache = false 63 | } 64 | withName: 'multiqc' { 65 | memory = { check_max( 2.GB * task.attempt, 'memory' ) } 66 | cache = false 67 | } 68 | } 69 | 70 | params { 71 | // Defaults only, expecting to be overwritten 72 | max_memory = 128.GB 73 | max_cpus = 16 74 | max_time = 240.h 75 | igenomes_base = 's3://ngi-igenomes/igenomes/' 76 | } 77 | -------------------------------------------------------------------------------- /bin/filter_gtf_for_genes_in_genome.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function 3 | import logging 4 | from itertools import groupby 5 | import argparse 6 | 7 | # Create a logger 8 | logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s') 9 | logger = logging.getLogger(__file__) 10 | logger.setLevel(logging.INFO) 11 | 12 | def is_header(line): 13 | return line[0] == '>' 14 | 15 | 16 | def extract_fasta_seq_names(fasta_name): 17 | """ 18 | modified from Brent Pedersen 19 | Correct Way To Parse A Fasta File In Python 20 | given a fasta file. yield tuples of header, sequence 21 | from https://www.biostars.org/p/710/ 22 | """ 23 | # first open the file outside 24 | fh = open(fasta_name) 25 | 26 | # ditch the boolean (x[0]) and just keep the header or sequence since 27 | # we know they alternate. 28 | faiter = (x[1] for x in groupby(fh, is_header)) 29 | 30 | for i, header in enumerate(faiter): 31 | line = next(header) 32 | if is_header(line): 33 | # drop the ">" 34 | headerStr = line[1:].strip().split()[0] 35 | yield headerStr 36 | 37 | 38 | def extract_genes_in_genome(fasta, gtf_in, gtf_out): 39 | seq_names_in_genome = set(extract_fasta_seq_names(fasta)) 40 | logger.info("Extracted chromosome sequence names from : %s" % fasta) 41 | logger.info("All chromosome names: " + ", ".join(sorted(x for x in seq_names_in_genome))) 42 | seq_names_in_gtf = set([]) 43 | 44 | n_total_lines = 0 45 | n_lines_in_genome = 0 46 | with open(gtf_out, 'w') as f: 47 | with open(gtf_in) as g: 48 | 49 | for line in g.readlines(): 50 | n_total_lines += 1 51 | seq_name_gtf = line.split("\t")[0] 52 | seq_names_in_gtf.add(seq_name_gtf) 53 | if seq_name_gtf in seq_names_in_genome: 54 | n_lines_in_genome += 1 55 | f.write(line) 56 | logger.info("Extracted %d / %d lines from %s matching sequences in %s" % 57 | (n_lines_in_genome, n_total_lines, gtf_in, fasta)) 58 | logger.info("All sequence IDs from GTF: " + ", ".join(sorted(x for x in seq_name_gtf))) 59 | 60 | logger.info("Wrote matching lines to %s" % gtf_out) 61 | 62 | 63 | if __name__ == "__main__": 64 | parser = argparse.ArgumentParser(description="""Filter GTF only for features in the genome""") 65 | parser.add_argument("--gtf", type=str, help="GTF file") 66 | parser.add_argument("--fasta", type=str, help="Genome fasta file") 67 | parser.add_argument("-o", "--output", dest='output', 68 | default='genes_in_genome.gtf', 69 | type=str, help="GTF features on fasta genome sequences") 70 | 71 | args = parser.parse_args() 72 | extract_genes_in_genome(args.fasta, args.gtf, args.output) 73 | -------------------------------------------------------------------------------- /bin/mqc_features_stat.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import argparse 4 | import logging 5 | import os 6 | 7 | # Create a logger 8 | logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s') 9 | logger = logging.getLogger(__file__) 10 | logger.setLevel(logging.INFO) 11 | 12 | mqc_main = """#id: 'biotype-gs' 13 | #plot_type: 'generalstats' 14 | #pconfig:""" 15 | 16 | mqc_pconf="""# percent_{ft}: 17 | # title: '% {ft}' 18 | # namespace: 'Biotype Counts' 19 | # description: '% reads overlapping {ft} features' 20 | # max: 100 21 | # min: 0 22 | # scale: 'RdYlGn-rev' 23 | # format: '{{:.2f}}%'""" 24 | 25 | def mqc_feature_stat(bfile, features, outfile, sname=None): 26 | 27 | # If sample name not given use file name 28 | if not sname: 29 | sname = os.path.splitext(os.path.basename(bfile))[0] 30 | 31 | # Try to parse and read biocount file 32 | fcounts = {} 33 | try: 34 | with open(bfile, 'r') as bfl: 35 | for ln in bfl: 36 | if ln.startswith('#'): 37 | continue 38 | ft, cn = ln.strip().split('\t') 39 | fcounts[ft] = float(cn) 40 | except: 41 | logger.error("Trouble reading the biocount file {}".format(bfile)) 42 | return 43 | 44 | total_count = sum(fcounts.values()) 45 | if total_count == 0: 46 | logger.error("No biocounts found, exiting") 47 | return 48 | 49 | # Calculate percentage for each requested feature 50 | fpercent = {f: (fcounts[f]/total_count)*100 if f in fcounts else 0 for f in features} 51 | if len(fpercent) == 0: 52 | logger.error("Any of given features '{}' not found in the biocount file".format(", ".join(features), bfile)) 53 | return 54 | 55 | # Prepare the output strings 56 | out_head, out_value, out_mqc = ("Sample", "'{}'".format(sname), mqc_main) 57 | for ft, pt in fpercent.items(): 58 | out_head = "{}\tpercent_{}".format(out_head, ft) 59 | out_value = "{}\t{}".format(out_value, pt) 60 | out_mqc = "{}\n{}".format(out_mqc, mqc_pconf.format(ft=ft)) 61 | 62 | # Write the output to a file 63 | with open(outfile, 'w') as ofl: 64 | out_final = "\n".join([out_mqc, out_head, out_value]).strip() 65 | ofl.write(out_final + "\n") 66 | 67 | if __name__ == "__main__": 68 | parser = argparse.ArgumentParser(description="""Calculate features percentage for biotype counts""") 69 | parser.add_argument("biocount", type=str, help="File with all biocounts") 70 | parser.add_argument("-f", "--features", dest='features', required=True, nargs='+', help="Features to count") 71 | parser.add_argument("-s", "--sample", dest='sample', type=str, help="Sample Name") 72 | parser.add_argument("-o", "--output", dest='output', default='biocount_percent.tsv', type=str, help="Sample Name") 73 | args = parser.parse_args() 74 | mqc_feature_stat(args.biocount, args.features, args.output, args.sample) 75 | -------------------------------------------------------------------------------- /bin/tximport.r: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | args = commandArgs(trailingOnly=TRUE) 4 | if (length(args) < 2) { 5 | stop("Usage: tximeta.r ", call.=FALSE) 6 | } 7 | 8 | path = args[2] 9 | coldata = args[1] 10 | 11 | sample_name = args[3] 12 | 13 | prefix = paste(c(sample_name, "salmon"), sep="_") 14 | 15 | tx2gene = "tx2gene.csv" 16 | info = file.info(tx2gene) 17 | if (info$size == 0){ 18 | tx2gene = NULL 19 | }else{ 20 | rowdata = read.csv(tx2gene, header = FALSE) 21 | colnames(rowdata) = c("tx", "gene_id", "gene_name") 22 | tx2gene = rowdata[,1:2] 23 | } 24 | 25 | fns = list.files(path, pattern = "quant.sf", recursive = T, full.names = T) 26 | names = basename(dirname(fns)) 27 | names(fns) = names 28 | 29 | if (file.exists(coldata)){ 30 | coldata = read.csv(coldata) 31 | coldata = coldata[match(names, coldata[,1]),] 32 | coldata = cbind(files = fns, coldata) 33 | }else{ 34 | message("ColData not avaliable ", coldata) 35 | coldata = data.frame(files = fns, names = names) 36 | } 37 | 38 | library(SummarizedExperiment) 39 | library(tximport) 40 | 41 | txi = tximport(fns, type = "salmon", txOut = TRUE) 42 | rownames(coldata) = coldata[["names"]] 43 | extra = setdiff(rownames(txi[[1]]), as.character(rowdata[["tx"]])) 44 | if (length(extra) > 0){ 45 | rowdata = rbind(rowdata, 46 | data.frame(tx=extra, 47 | gene_id=extra, 48 | gene_name=extra)) 49 | } 50 | rowdata = rowdata[match(rownames(txi[[1]]), as.character(rowdata[["tx"]])),] 51 | rownames(rowdata) = rowdata[["tx"]] 52 | se = SummarizedExperiment(assays = list(counts = txi[["counts"]], 53 | abundance = txi[["abundance"]], 54 | length = txi[["length"]]), 55 | colData = DataFrame(coldata), 56 | rowData = rowdata) 57 | if (!is.null(tx2gene)){ 58 | gi = summarizeToGene(txi, tx2gene = tx2gene) 59 | growdata = unique(rowdata[,2:3]) 60 | growdata = growdata[match(rownames(gi[[1]]), growdata[["gene_id"]]),] 61 | rownames(growdata) = growdata[["tx"]] 62 | gse = SummarizedExperiment(assays = list(counts = gi[["counts"]], 63 | abundance = gi[["abundance"]], 64 | length = gi[["length"]]), 65 | colData = DataFrame(coldata), 66 | rowData = growdata) 67 | } 68 | 69 | if(exists("gse")){ 70 | write.csv(assays(gse)[["abundance"]], paste(c(prefix, "gene_tpm.csv"), collapse="_"), quote=FALSE) 71 | write.csv(assays(gse)[["counts"]], paste(c(prefix, "gene_counts.csv"), collapse="_"), quote=FALSE) 72 | } 73 | 74 | write.csv(assays(se)[["abundance"]], paste(c(prefix, "transcript_tpm.csv"), collapse="_"), quote=FALSE) 75 | write.csv(assays(se)[["counts"]], paste(c(prefix, "transcript_counts.csv"), collapse="_"), quote=FALSE) 76 | 77 | # Print sessioninfo to standard out 78 | citation("tximeta") 79 | sessionInfo() 80 | -------------------------------------------------------------------------------- /.github/CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # nf-core/rnaseq: Contributing Guidelines 2 | 3 | Hi there! Many thanks for taking an interest in improving nf-core/rnaseq. 4 | 5 | We try to manage the required tasks for nf-core/rnaseq using GitHub issues, you probably came to this page when creating one. Please use the pre-filled template to save time. 6 | 7 | However, don't be put off by this template - other more general issues and suggestions are welcome! Contributions to the code are even more welcome ;) 8 | 9 | > If you need help using or modifying nf-core/rnaseq then the best place to ask is on the pipeline channel on [Slack](https://nf-co.re/join/slack/). 10 | 11 | ## Contribution workflow 12 | 13 | If you'd like to write some code for nf-core/rnaseq, the standard workflow 14 | is as follows: 15 | 16 | 1. Check that there isn't already an issue about your idea in the 17 | [nf-core/rnaseq issues](https://github.com/nf-core/rnaseq/issues) to avoid 18 | duplicating work. 19 | * If there isn't one already, please create one so that others know you're working on this 20 | 2. Fork the [nf-core/rnaseq repository](https://github.com/nf-core/rnaseq) to your GitHub account 21 | 3. Make the necessary changes / additions within your forked repository 22 | 4. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged. 23 | 24 | If you're not used to this workflow with git, you can start with some [basic docs from GitHub](https://help.github.com/articles/fork-a-repo/) or even their [excellent interactive tutorial](https://try.github.io/). 25 | 26 | ## Tests 27 | 28 | When you create a pull request with changes, [Travis CI](https://travis-ci.org/) will run automatic tests. 29 | Typically, pull-requests are only fully reviewed when these tests are passing, though of course we can help out before then. 30 | 31 | There are typically two types of tests that run: 32 | 33 | ### Lint Tests 34 | 35 | The nf-core has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. 36 | To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. 37 | 38 | If any failures or warnings are encountered, please follow the listed URL for more documentation. 39 | 40 | ### Pipeline Tests 41 | 42 | Each nf-core pipeline should be set up with a minimal set of test-data. 43 | Travis CI then runs the pipeline on this data to ensure that it exists successfully. 44 | If there are any failures then the automated tests fail. 45 | These tests are run both with the latest available version of Nextflow and also the minimum required version that is stated in the pipeline code. 46 | 47 | ## Getting help 48 | 49 | For further information/help, please consult the [nf-core/rnaseq documentation](https://github.com/nf-core/rnaseq#documentation) and don't hesitate to get in touch on the [nf-core/rnaseq pipeline channel](https://nfcore.slack.com/channels/rnaseq) on [Slack](https://nf-co.re/join/slack/). 50 | -------------------------------------------------------------------------------- /bin/parse_gtf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function 3 | from collections import OrderedDict, defaultdict, Counter 4 | import logging 5 | import argparse 6 | import glob 7 | import os 8 | 9 | # Create a logger 10 | logging.basicConfig(format='%(name)s - %(asctime)s %(levelname)s: %(message)s') 11 | logger = logging.getLogger(__file__) 12 | logger.setLevel(logging.INFO) 13 | 14 | 15 | def read_top_transcript(salmon): 16 | txs = set() 17 | fn = glob.glob(os.path.join(salmon, "*", "quant.sf"))[0] 18 | with open(fn) as inh: 19 | for line in inh: 20 | if line.startswith("Name"): 21 | continue 22 | txs.add(line.split()[0]) 23 | if len(txs) > 100: 24 | break 25 | logger.info("Transcripts found in FASTA: %s" % txs) 26 | return txs 27 | 28 | 29 | def tx2gene(gtf, salmon, gene_id, extra, out): 30 | txs = read_top_transcript(salmon) 31 | votes = Counter() 32 | gene_dict = defaultdict(list) 33 | with open(gtf) as inh: 34 | for line in inh: 35 | if line.startswith("#"): 36 | continue 37 | cols = line.split("\t") 38 | attr_dict = OrderedDict() 39 | for gff_item in cols[8].split(";"): 40 | item_pair = gff_item.strip().split(" ") 41 | if len(item_pair) > 1: 42 | value = item_pair[1].strip().replace("\"", "") 43 | if value in txs: 44 | votes[item_pair[0].strip()] += 1 45 | 46 | attr_dict[item_pair[0].strip()] = value 47 | gene_dict[attr_dict[gene_id]].append(attr_dict) 48 | 49 | if not votes: 50 | logger.warning("No attribute in GTF matching transcripts") 51 | return None 52 | 53 | txid = votes.most_common(1)[0][0] 54 | logger.info("Attributed found to be transcript: %s" % txid) 55 | seen = set() 56 | with open(out, 'w') as outh: 57 | for gene in gene_dict: 58 | for row in gene_dict[gene]: 59 | if txid not in row: 60 | continue 61 | if (gene, row[txid]) not in seen: 62 | seen.add((gene, row[txid])) 63 | if not extra in row: 64 | extra_id = gene 65 | else: 66 | extra_id = row[extra] 67 | print("%s,%s,%s" % (row[txid], gene, extra_id), file=outh) 68 | 69 | 70 | if __name__ == "__main__": 71 | parser = argparse.ArgumentParser(description="""Get tx to gene names for tximport""") 72 | parser.add_argument("--gtf", type=str, help="GTF file") 73 | parser.add_argument("--salmon", type=str, help="output of salmon") 74 | parser.add_argument("--id", type=str, help="gene id in the gtf file") 75 | parser.add_argument("--extra", type=str, help="extra id in the gtf file") 76 | parser.add_argument("-o", "--output", dest='output', default='tx2gene.csv', type=str, help="file with output") 77 | 78 | args = parser.parse_args() 79 | tx2gene(args.gtf, args.salmon, args.id, args.extra, args.output) 80 | -------------------------------------------------------------------------------- /bin/edgeR_heatmap_MDS.r: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | # Command line argument processing 4 | args <- commandArgs(trailingOnly=TRUE) 5 | 6 | if (length(args) < 3) { 7 | stop("Usage: edgeR_heatmap_MDS.r (more bam files optional)", call.=FALSE) 8 | } 9 | 10 | # Load / install required packages 11 | if (!require("limma")){ 12 | source("http://bioconductor.org/biocLite.R") 13 | biocLite("limma", suppressUpdates=TRUE) 14 | library("limma") 15 | } 16 | if (!require("edgeR")){ 17 | source("http://bioconductor.org/biocLite.R") 18 | biocLite("edgeR", suppressUpdates=TRUE) 19 | library("edgeR") 20 | } 21 | if (!require("data.table")){ 22 | install.packages("data.table", dependencies=TRUE, repos='http://cloud.r-project.org/') 23 | library("data.table") 24 | } 25 | if (!require("gplots")) { 26 | install.packages("gplots", dependencies=TRUE, repos='http://cloud.r-project.org/') 27 | library("gplots") 28 | } 29 | 30 | # Load count column from all files into a list of data frames 31 | # Use data.tables fread as much much faster than read.table 32 | # Row names are GeneIDs 33 | temp <- lapply(lapply(args, fread, skip="Geneid", header=TRUE), function(x){return(as.data.frame(x)[,c(1, ncol(x))])}) 34 | 35 | # Merge into a single data frame 36 | merge.all <- function(x, y) { 37 | merge(x, y, all=TRUE, by="Geneid") 38 | } 39 | data <- data.frame(Reduce(merge.all, temp)) 40 | 41 | # Clean sample name headers 42 | colnames(data) <- gsub("Aligned.sortedByCoord.out.bam", "", colnames(data)) 43 | 44 | # Set GeneID as row name 45 | rownames(data) <- data[,1] 46 | data[,1] <- NULL 47 | 48 | # Convert data frame to edgeR DGE object 49 | dataDGE <- DGEList( counts=data.matrix(data) ) 50 | 51 | # Normalise counts 52 | dataNorm <- calcNormFactors(dataDGE) 53 | 54 | # Make MDS plot 55 | pdf('edgeR_MDS_plot.pdf') 56 | MDSdata <- plotMDS(dataNorm) 57 | dev.off() 58 | 59 | # Print distance matrix to file 60 | write.csv(MDSdata$distance.matrix, 'edgeR_MDS_distance_matrix.csv', quote=FALSE,append=TRUE) 61 | 62 | # Print plot x,y co-ordinates to file 63 | MDSxy = MDSdata$cmdscale.out 64 | colnames(MDSxy) = c(paste(MDSdata$axislabel, '1'), paste(MDSdata$axislabel, '2')) 65 | write.csv(MDSxy, 'edgeR_MDS_Aplot_coordinates_mqc.csv', quote=FALSE, append=TRUE) 66 | 67 | # Get the log counts per million values 68 | logcpm <- cpm(dataNorm, prior.count=2, log=TRUE) 69 | 70 | # Calculate the Pearsons correlation between samples 71 | # Plot a heatmap of correlations 72 | pdf('log2CPM_sample_correlation_heatmap.pdf') 73 | hmap <- heatmap.2(as.matrix(cor(logcpm, method="pearson")), 74 | key.title="Pearson's Correlation", trace="none", 75 | dendrogram="row", margin=c(9, 9) 76 | ) 77 | dev.off() 78 | 79 | # Write correlation values to file 80 | write.csv(hmap$carpet, 'log2CPM_sample_correlation_mqc.csv', quote=FALSE, append=TRUE) 81 | 82 | # Plot the heatmap dendrogram 83 | pdf('log2CPM_sample_distances_dendrogram.pdf') 84 | hmap <- heatmap.2(as.matrix(dist(t(logcpm)))) 85 | plot(hmap$rowDendrogram, main="Sample Pearson's Correlation Clustering") 86 | dev.off() 87 | 88 | file.create("corr.done") 89 | 90 | # Printing sessioninfo to standard out 91 | print("Sample correlation info:") 92 | sessionInfo() 93 | -------------------------------------------------------------------------------- /assets/email_template.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | nf-core/rnaseq Pipeline Report 10 | 11 | 12 |
13 | 14 | 15 | 16 |

nf-core/rnaseq v${version}

17 |

Run Name: $runName

18 | 19 | <% if (!success){ 20 | out << """ 21 |
22 |

nf-core/rnaseq execution completed unsuccessfully!

23 |

The exit status of the task that caused the workflow execution to fail was: $exitStatus.

24 |

The full error message was:

25 |
${errorReport}
26 |
27 | """ 28 | } else if(skipped_poor_alignment.size() > 0) { 29 | out << """ 30 |
31 |

nf-core/rnaseq execution completed with warnings!

32 |

The pipeline finished successfully, but the following samples were skipped due to very low alignment (< 5%):

33 |
    34 |
  • ${skipped_poor_alignment.join('
  • ')}
  • 35 |
36 |

37 |

38 | """ 39 | } else { 40 | out << """ 41 |
42 | nf-core/rnaseq execution completed successfully! 43 |
44 | """ 45 | } 46 | %> 47 | 48 |

The workflow was completed at $dateComplete (duration: $duration)

49 |

The command used to launch the workflow was as follows:

50 |
$commandLine
51 | 52 |

Pipeline Configuration:

53 | 54 | 55 | <% out << summary.collect{ k,v -> "" }.join("\n") %> 56 | 57 |
$k
$v
58 | 59 |

nf-core/rnaseq

60 |

https://github.com/nf-core/rnaseq

61 | 62 |
63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation. 6 | 7 | ## Our Standards 8 | 9 | Examples of behavior that contributes to creating a positive environment include: 10 | 11 | * Using welcoming and inclusive language 12 | * Being respectful of differing viewpoints and experiences 13 | * Gracefully accepting constructive criticism 14 | * Focusing on what is best for the community 15 | * Showing empathy towards other community members 16 | 17 | Examples of unacceptable behavior by participants include: 18 | 19 | * The use of sexualized language or imagery and unwelcome sexual attention or advances 20 | * Trolling, insulting/derogatory comments, and personal or political attacks 21 | * Public or private harassment 22 | * Publishing others' private information, such as a physical or electronic address, without explicit permission 23 | * Other conduct which could reasonably be considered inappropriate in a professional setting 24 | 25 | ## Our Responsibilities 26 | 27 | Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. 28 | 29 | Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. 30 | 31 | ## Scope 32 | 33 | This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. 34 | 35 | ## Enforcement 36 | 37 | Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team on [Slack](https://nf-co.re/join/slack/). The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. 38 | 39 | Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. 40 | 41 | ## Attribution 42 | 43 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version] 44 | 45 | [homepage]: http://contributor-covenant.org 46 | [version]: http://contributor-covenant.org/version/1/4/ 47 | -------------------------------------------------------------------------------- /bin/scrape_software_versions.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from __future__ import print_function 3 | from collections import OrderedDict 4 | import re 5 | 6 | regexes = { 7 | 'nf-core/rnaseq': ['v_ngi_rnaseq.txt', r"(\S+)"], 8 | 'Nextflow': ['v_nextflow.txt', r"(\S+)"], 9 | 'FastQC': ['v_fastqc.txt', r"FastQC v(\S+)"], 10 | 'Cutadapt': ['v_cutadapt.txt', r"(\S+)"], 11 | 'Trim Galore!': ['v_trim_galore.txt', r"version (\S+)"], 12 | 'SortMeRNA': ['v_sortmerna.txt', r"SortMeRNA version (\S+),"], 13 | 'STAR': ['v_star.txt', r"(\S+)"], 14 | 'HISAT2': ['v_hisat2.txt', r"version (\S+)"], 15 | 'Picard MarkDuplicates': ['v_markduplicates.txt', r"([\d\.]+)-SNAPSHOT"], 16 | 'Samtools': ['v_samtools.txt', r"samtools (\S+)"], 17 | 'featureCounts': ['v_featurecounts.txt', r"featureCounts v(\S+)"], 18 | 'Salmon': ['v_salmon.txt', r"salmon (\S+)"], 19 | 'deepTools': ['v_deeptools.txt', r"bamCoverage (\S+)"], 20 | 'StringTie': ['v_stringtie.txt', r"(\S+)"], 21 | 'Preseq': ['v_preseq.txt', r"Version: (\S+)"], 22 | 'RSeQC': ['v_rseqc.txt', r"read_duplication.py ([\d\.]+)"], 23 | 'Qualimap': ['v_qualimap.txt', r"QualiMap v(\S+)"], 24 | 'dupRadar': ['v_dupRadar.txt', r"(\S+)"], 25 | 'edgeR': ['v_edgeR.txt', r"(\S+)"], 26 | 'MultiQC': ['v_multiqc.txt', r"multiqc, version (\S+)"], 27 | } 28 | results = OrderedDict() 29 | results['nf-core/rnaseq'] = 'N/A' 30 | results['Nextflow'] = 'N/A' 31 | results['FastQC'] = 'N/A' 32 | results['Cutadapt'] = 'N/A' 33 | results['Trim Galore!'] = 'N/A' 34 | results['SortMeRNA'] = 'N/A' 35 | results['STAR'] = False 36 | results['HISAT2'] = False 37 | results['Picard MarkDuplicates'] = 'N/A' 38 | results['Samtools'] = 'N/A' 39 | results['featureCounts'] = 'N/A' 40 | results['Salmon'] = 'N/A' 41 | results['StringTie'] = 'N/A' 42 | results['Preseq'] = 'N/A' 43 | results['deepTools'] = 'N/A' 44 | results['RSeQC'] = 'N/A' 45 | results['dupRadar'] = 'N/A' 46 | results['edgeR'] = 'N/A' 47 | results['Qualimap'] = 'N/A' 48 | results['MultiQC'] = 'N/A' 49 | 50 | # Search each file using its regex 51 | for k, v in regexes.items(): 52 | try: 53 | with open(v[0]) as x: 54 | versions = x.read() 55 | match = re.search(v[1], versions) 56 | if match: 57 | results[k] = "v{}".format(match.group(1)) 58 | except IOError: 59 | results[k] = False 60 | 61 | # Strip STAR or HiSAT2 62 | for k in results: 63 | if not results[k]: 64 | del(results[k]) 65 | 66 | # Dump to YAML 67 | print (''' 68 | id: 'software_versions' 69 | section_name: 'nf-core/rnaseq Software Versions' 70 | section_href: 'https://github.com/nf-core/rnaseq' 71 | plot_type: 'html' 72 | description: 'are collected at run time from the software output.' 73 | data: | 74 |
75 | ''') 76 | for k,v in results.items(): 77 | print("
{}
{}
".format(k,v)) 78 | print ("
") 79 | 80 | # Write out regexes as csv file: 81 | with open('software_versions.csv', 'w') as f: 82 | for k,v in results.items(): 83 | f.write("{}\t{}\n".format(k,v)) 84 | -------------------------------------------------------------------------------- /bin/gtf2bed: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl 2 | 3 | # Copyright (c) 2011 Erik Aronesty (erik@q32.com) 4 | # 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy 6 | # of this software and associated documentation files (the "Software"), to deal 7 | # in the Software without restriction, including without limitation the rights 8 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | # copies of the Software, and to permit persons to whom the Software is 10 | # furnished to do so, subject to the following conditions: 11 | # 12 | # The above copyright notice and this permission notice shall be included in 13 | # all copies or substantial portions of the Software. 14 | # 15 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | # THE SOFTWARE. 22 | # 23 | # ALSO, IT WOULD BE NICE IF YOU LET ME KNOW YOU USED IT. 24 | 25 | use Getopt::Long; 26 | 27 | my $extended; 28 | GetOptions("x"=>\$extended); 29 | 30 | $in = shift @ARGV; 31 | 32 | my $in_cmd =($in =~ /\.gz$/ ? "gunzip -c $in|" : $in =~ /\.zip$/ ? "unzip -p $in|" : "$in") || die "Can't open $in: $!\n"; 33 | open IN, $in_cmd; 34 | 35 | while () { 36 | $gff = 2 if /^##gff-version 2/; 37 | $gff = 3 if /^##gff-version 3/; 38 | next if /^#/ && $gff; 39 | 40 | s/\s+$//; 41 | # 0-chr 1-src 2-feat 3-beg 4-end 5-scor 6-dir 7-fram 8-attr 42 | my @f = split /\t/; 43 | if ($gff) { 44 | # most ver 2's stick gene names in the id field 45 | ($id) = $f[8]=~ /\bID="([^"]+)"/; 46 | # most ver 3's stick unquoted names in the name field 47 | ($id) = $f[8]=~ /\bName=([^";]+)/ if !$id && $gff == 3; 48 | } else { 49 | ($id) = $f[8]=~ /transcript_id "([^"]+)"/; 50 | } 51 | 52 | next unless $id && $f[0]; 53 | 54 | if ($f[2] eq 'exon') { 55 | die "no position at exon on line $." if ! $f[3]; 56 | # gff3 puts :\d in exons sometimes 57 | $id =~ s/:\d+$// if $gff == 3; 58 | push @{$exons{$id}}, \@f; 59 | # save lowest start 60 | $trans{$id} = \@f if !$trans{$id}; 61 | } elsif ($f[2] eq 'start_codon') { 62 | #optional, output codon start/stop as "thick" region in bed 63 | $sc{$id}->[0] = $f[3]; 64 | } elsif ($f[2] eq 'stop_codon') { 65 | $sc{$id}->[1] = $f[4]; 66 | } elsif ($f[2] eq 'miRNA' ) { 67 | $trans{$id} = \@f if !$trans{$id}; 68 | push @{$exons{$id}}, \@f; 69 | } 70 | } 71 | 72 | for $id ( 73 | # sort by chr then pos 74 | sort { 75 | $trans{$a}->[0] eq $trans{$b}->[0] ? 76 | $trans{$a}->[3] <=> $trans{$b}->[3] : 77 | $trans{$a}->[0] cmp $trans{$b}->[0] 78 | } (keys(%trans)) ) { 79 | my ($chr, undef, undef, undef, undef, undef, $dir, undef, $attr, undef, $cds, $cde) = @{$trans{$id}}; 80 | my ($cds, $cde); 81 | ($cds, $cde) = @{$sc{$id}} if $sc{$id}; 82 | 83 | # sort by pos 84 | my @ex = sort { 85 | $a->[3] <=> $b->[3] 86 | } @{$exons{$id}}; 87 | 88 | my $beg = $ex[0][3]; 89 | my $end = $ex[-1][4]; 90 | 91 | if ($dir eq '-') { 92 | # swap 93 | $tmp=$cds; 94 | $cds=$cde; 95 | $cde=$tmp; 96 | $cds -= 2 if $cds; 97 | $cde += 2 if $cde; 98 | } 99 | 100 | # not specified, just use exons 101 | $cds = $beg if !$cds; 102 | $cde = $end if !$cde; 103 | 104 | # adjust start for bed 105 | --$beg; --$cds; 106 | 107 | my $exn = @ex; # exon count 108 | my $exst = join ",", map {$_->[3]-$beg-1} @ex; # exon start 109 | my $exsz = join ",", map {$_->[4]-$_->[3]+1} @ex; # exon size 110 | 111 | my $gene_id; 112 | my $extend = ""; 113 | if ($extended) { 114 | ($gene_id) = $attr =~ /gene_name "([^"]+)"/; 115 | ($gene_id) = $attr =~ /gene_id "([^"]+)"/ unless $gene_id; 116 | $extend="\t$gene_id"; 117 | } 118 | # added an extra comma to make it look exactly like ucsc's beds 119 | print "$chr\t$beg\t$end\t$id\t0\t$dir\t$cds\t$cde\t0\t$exn\t$exsz,\t$exst,$extend\n"; 120 | } 121 | 122 | 123 | close IN; 124 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ![nf-core/rnaseq](docs/images/nf-core-rnaseq_logo.png) 2 | 3 | [![Build Status](https://travis-ci.org/nf-core/rnaseq.svg?branch=master)](https://travis-ci.org/nf-core/rnaseq) 4 | [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A519.04.0-brightgreen.svg)](https://www.nextflow.io/) 5 | [![DOI](https://zenodo.org/badge/127293091.svg)](https://zenodo.org/badge/latestdoi/127293091) 6 | 7 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg)](http://bioconda.github.io/) 8 | [![Docker](https://img.shields.io/docker/automated/nfcore/rnaseq.svg)](https://hub.docker.com/r/nfcore/rnaseq/) 9 | 10 | ### Introduction 11 | 12 | **nf-core/rnaseq** is a bioinformatics analysis pipeline used for RNA sequencing data. 13 | 14 | The workflow processes raw data from 15 | FastQ inputs ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), 16 | [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)), 17 | aligns the reads 18 | ([STAR](https://github.com/alexdobin/STAR) or 19 | [HiSAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)), 20 | generates counts relative to genes 21 | ([featureCounts](http://bioinf.wehi.edu.au/featureCounts/), 22 | [StringTie](https://ccb.jhu.edu/software/stringtie/)) or transcripts 23 | ([Salmon](https://combine-lab.github.io/salmon/), 24 | [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html)) and performs extensive quality-control on the results 25 | ([RSeQC](http://rseqc.sourceforge.net/), 26 | [Qualimap](http://qualimap.bioinfo.cipf.es/), 27 | [dupRadar](https://bioconductor.org/packages/release/bioc/html/dupRadar.html), 28 | [Preseq](http://smithlabresearch.org/software/preseq/), 29 | [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html), 30 | [MultiQC](http://multiqc.info/)). See the [output documentation](docs/output.md) for more details of the results. 31 | 32 | The pipeline is built using [Nextflow](https://www.nextflow.io), a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible. 33 | 34 | ## Quick Start 35 | 36 | i. Install [`nextflow`](https://nf-co.re/usage/installation) 37 | 38 | ii. Install one of [`docker`](https://docs.docker.com/engine/installation/), [`singularity`](https://www.sylabs.io/guides/3.0/user-guide/) or [`conda`](https://conda.io/miniconda.html) 39 | 40 | iii. Download the pipeline and test it on a minimal dataset with a single command 41 | 42 | ```bash 43 | nextflow run nf-core/rnaseq -profile test, 44 | ``` 45 | 46 | iv. Start running your own analysis! 47 | 48 | ```bash 49 | nextflow run nf-core/rnaseq -profile --reads '*_R{1,2}.fastq.gz' --genome GRCh37 50 | ``` 51 | 52 | See [usage docs](docs/usage.md) for all of the available options when running the pipeline. 53 | 54 | ### Documentation 55 | 56 | The nf-core/rnaseq pipeline comes with documentation about the pipeline, found in the `docs/` directory: 57 | 58 | 1. [Installation](https://nf-co.re/usage/installation) 59 | 2. Pipeline configuration 60 | * [Local installation](https://nf-co.re/usage/local_installation) 61 | * [Adding your own system config](https://nf-co.re/usage/adding_own_config) 62 | * [Reference genomes](https://nf-co.re/usage/reference_genomes) 63 | 3. [Running the pipeline](docs/usage.md) 64 | 4. [Output and how to interpret the results](docs/output.md) 65 | 5. [Troubleshooting](https://nf-co.re/usage/troubleshooting) 66 | 67 | ### Credits 68 | 69 | These scripts were originally written for use at the [National Genomics Infrastructure](https://portal.scilifelab.se/genomics/), part of [SciLifeLab](http://www.scilifelab.se/) in Stockholm, Sweden, by Phil Ewels ([@ewels](https://github.com/ewels)) and Rickard Hammarén ([@Hammarn](https://github.com/Hammarn)). 70 | 71 | Many thanks to other who have helped out along the way too, including (but not limited to): 72 | [@Galithil](https://github.com/Galithil), 73 | [@pditommaso](https://github.com/pditommaso), 74 | [@orzechoj](https://github.com/orzechoj), 75 | [@apeltzer](https://github.com/apeltzer), 76 | [@colindaven](https://github.com/colindaven), 77 | [@lpantano](https://github.com/lpantano), 78 | [@olgabot](https://github.com/olgabot), 79 | [@jburos](https://github.com/jburos), 80 | [@drpatelh](https://github.com/drpatelh). 81 | 82 | ## Contributions and Support 83 | 84 | If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). 85 | 86 | For further information or help, don't hesitate to get in touch on [Slack](https://nfcore.slack.com/channels/rnaseq) (you can join with [this invite](https://nf-co.re/join/slack)). 87 | 88 | ## Citation 89 | 90 | If you use nf-core/rnaseq for your analysis, please cite it using the following doi: [10.5281/zenodo.1400710](https://doi.org/10.5281/zenodo.1400710) 91 | 92 | You can cite the `nf-core` pre-print as follows: 93 | 94 | > Ewels PA, Peltzer A, Fillinger S, Alneberg JA, Patel H, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. **nf-core: Community curated bioinformatics pipelines**. *bioRxiv*. 2019. p. 610741. [doi: 10.1101/610741](https://www.biorxiv.org/content/10.1101/610741v1). 95 | -------------------------------------------------------------------------------- /nextflow.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * nf-core/rnaseq Nextflow config file 4 | * ------------------------------------------------- 5 | * Default config options for all environments. 6 | */ 7 | 8 | // Global default params, used in configs 9 | params { 10 | 11 | // Pipeline Options 12 | // Workflow flags 13 | genome = false 14 | reads = "data/*{1,2}.fastq.gz" 15 | singleEnd = false 16 | 17 | // References 18 | genome = false 19 | salmon_index = false 20 | transcript_fasta = false 21 | splicesites = false 22 | saveReference = false 23 | gencode = false 24 | compressedReference = false 25 | 26 | // Strandedness 27 | forwardStranded = false 28 | reverseStranded = false 29 | unStranded = false 30 | 31 | // Trimming 32 | skipTrimming = false 33 | clip_r1 = 0 34 | clip_r2 = 0 35 | three_prime_clip_r1 = 0 36 | three_prime_clip_r2 = 0 37 | trim_nextseq = 0 38 | pico = false 39 | saveTrimmed = false 40 | 41 | // Ribosomal RNA removal 42 | removeRiboRNA = false 43 | save_nonrRNA_reads = false 44 | rRNA_database_manifest = false 45 | 46 | // Alignment 47 | aligner = 'star' 48 | pseudo_aligner = false 49 | stringTieIgnoreGTF = false 50 | seq_center = false 51 | saveAlignedIntermediates = false 52 | skipAlignment = false 53 | saveUnaligned = false 54 | 55 | // Read Counting 56 | fc_extra_attributes = 'gene_name' 57 | fc_group_features = 'gene_id' 58 | fc_count_type = 'exon' 59 | fc_group_features_type = 'gene_biotype' 60 | sampleLevel = false 61 | skipBiotypeQC = false 62 | 63 | // QC 64 | skipQC = false 65 | skipFastQC = false 66 | skipPreseq = false 67 | skipDupRadar = false 68 | skipQualimap = false 69 | skipRseQC = false 70 | skipEdgeR = false 71 | skipMultiQC = false 72 | 73 | // Defaults 74 | project = false 75 | markdup_java_options = '"-Xms4000m -Xmx7g"' //Established values for markDuplicate memory consumption, see issue PR #689 (in Sarek) for details 76 | hisat_build_memory = 200 // Required amount of memory in GB to build HISAT2 index with splice sites 77 | readPaths = null 78 | star_memory = false // Cluster specific param required for hebbe 79 | rRNA_database_manifest = "$baseDir/assets/rrna-db-defaults.txt" 80 | 81 | // Boilerplate options 82 | clusterOptions = false 83 | outdir = './results' 84 | name = false 85 | multiqc_config = "$baseDir/assets/multiqc_config.yaml" 86 | email = false 87 | email_on_fail = false 88 | max_multiqc_email_size = 25.MB 89 | plaintext_email = false 90 | monochrome_logs = false 91 | help = false 92 | igenomes_base = "./iGenomes" 93 | tracedir = "${params.outdir}/pipeline_info" 94 | awsqueue = false 95 | awsregion = 'eu-west-1' 96 | igenomesIgnore = false 97 | custom_config_version = 'master' 98 | custom_config_base = "https://raw.githubusercontent.com/nf-core/configs/${params.custom_config_version}" 99 | hostnames = false 100 | config_profile_description = false 101 | config_profile_contact = false 102 | config_profile_url = false 103 | } 104 | 105 | // Container slug. Stable releases should specify release tag! 106 | // Developmental code should specify :dev 107 | process.container = 'nfcore/rnaseq:1.4.2' 108 | 109 | // Load base.config by default for all pipelines 110 | includeConfig 'conf/base.config' 111 | 112 | // Load nf-core custom profiles from different Institutions 113 | try { 114 | includeConfig "${params.custom_config_base}/nfcore_custom.config" 115 | } catch (Exception e) { 116 | System.err.println("WARNING: Could not load nf-core/config profiles: ${params.custom_config_base}/nfcore_custom.config") 117 | } 118 | 119 | profiles { 120 | awsbatch { includeConfig 'conf/awsbatch.config' } 121 | conda { process.conda = "$baseDir/environment.yml" } 122 | debug { process.beforeScript = 'echo $HOSTNAME' } 123 | docker { docker.enabled = true } 124 | singularity { singularity.enabled = true 125 | singularity.autoMounts = true } 126 | test { includeConfig 'conf/test.config' } 127 | test_gz { includeConfig 'conf/test_gz.config' } 128 | } 129 | 130 | // Avoid this error: 131 | // WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. 132 | // Testing this in nf-core after discussion here https://github.com/nf-core/tools/pull/351, once this is established and works well, nextflow might implement this behavior as new default. 133 | docker.runOptions = '-u \$(id -u):\$(id -g)' 134 | 135 | // Load igenomes.config if required 136 | if (!params.igenomesIgnore) { 137 | includeConfig 'conf/igenomes.config' 138 | } 139 | 140 | // Capture exit codes from upstream processes when piping 141 | process.shell = ['/bin/bash', '-euo', 'pipefail'] 142 | 143 | timeline { 144 | enabled = true 145 | file = "${params.tracedir}/execution_timeline.html" 146 | } 147 | report { 148 | enabled = true 149 | file = "${params.tracedir}/execution_report.html" 150 | } 151 | trace { 152 | enabled = true 153 | file = "${params.tracedir}/execution_trace.txt" 154 | } 155 | dag { 156 | enabled = true 157 | file = "${params.tracedir}/pipeline_dag.svg" 158 | } 159 | 160 | manifest { 161 | name = 'nf-core/rnaseq' 162 | author = 'Phil Ewels, Rickard Hammarén' 163 | homePage = 'https://github.com/nf-core/rnaseq' 164 | description = 'Nextflow RNA-Seq analysis pipeline, part of the nf-core community.' 165 | mainScript = 'main.nf' 166 | nextflowVersion = '>=19.04.0' 167 | version = '1.4.2' 168 | } 169 | 170 | // Function to ensure that resource requirements don't go beyond 171 | // a maximum limit 172 | def check_max(obj, type) { 173 | if (type == 'memory') { 174 | try { 175 | if (obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1) 176 | return params.max_memory as nextflow.util.MemoryUnit 177 | else 178 | return obj 179 | } catch (all) { 180 | println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj" 181 | return obj 182 | } 183 | } else if (type == 'time') { 184 | try { 185 | if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1) 186 | return params.max_time as nextflow.util.Duration 187 | else 188 | return obj 189 | } catch (all) { 190 | println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj" 191 | return obj 192 | } 193 | } else if (type == 'cpus') { 194 | try { 195 | return Math.min( obj, params.max_cpus as int ) 196 | } catch (all) { 197 | println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj" 198 | return obj 199 | } 200 | } 201 | } 202 | -------------------------------------------------------------------------------- /bin/dupRadar.r: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | # Command line argument processing 4 | args = commandArgs(trailingOnly=TRUE) 5 | if (length(args) < 5) { 6 | stop("Usage: dupRadar.r ", call.=FALSE) 7 | } 8 | input_bam <- args[1] 9 | annotation_gtf <- args[2] 10 | stranded <- as.numeric(args[3]) 11 | paired_end <- if(args[4]=='paired') TRUE else FALSE 12 | threads <- as.numeric(args[5]) 13 | 14 | bamRegex <- "(.+)\\.bam$" 15 | 16 | if(!(grepl(bamRegex, input_bam) && file.exists(input_bam) && (!file.info(input_bam)$isdir))) stop("First argument '' must be an existing file (not a directory) with '.bam' extension...") 17 | if(!(file.exists(annotation_gtf) && (!file.info(annotation_gtf)$isdir))) stop("Second argument '' must be an existing file (and not a directory)...") 18 | if(is.na(stranded) || (!(stranded %in% (0:2)))) stop("Third argument must be a numeric value in 0(unstranded)/1(forward)/2(reverse)...") 19 | if(is.na(threads) || (threads<=0)) stop("Fifth argument must be a strictly positive numeric value...") 20 | 21 | # Remove bam file extension to generate basename 22 | input_bam_basename <- gsub(bamRegex, "\\1", input_bam) 23 | input_bam_basename <- gsub("_subsamp.*", "", input_bam_basename) 24 | input_bam_basename <- gsub("\\.sorted.*", "", input_bam_basename) 25 | input_bam_basename <- gsub("Aligned.*", "", input_bam_basename) 26 | 27 | # Debug messages (stderr) 28 | message("Input bam (Arg 1): ", input_bam) 29 | message("Input gtf (Arg 2): ", annotation_gtf) 30 | message("Strandness (Arg 3): ", c("unstranded", "forward", "reverse")[stranded+1]) 31 | message("paired/single (Arg 4): ", ifelse(paired_end, 'paired', 'single')) 32 | message("Nb threads (Arg 5): ", threads) 33 | message("R package loc. (Arg 6): ", ifelse(length(args) > 4, args[5], "Not specified")) 34 | message("Output basename : ", input_bam_basename) 35 | 36 | 37 | # Load / install packages 38 | if (length(args) > 5) { .libPaths( c( args[6], .libPaths() ) ) } 39 | if (!require("dupRadar")){ 40 | source("http://bioconductor.org/biocLite.R") 41 | biocLite("dupRadar", suppressUpdates=TRUE) 42 | library("dupRadar") 43 | } 44 | if (!require("parallel")) { 45 | install.packages("parallel", dependencies=TRUE, repos='http://cloud.r-project.org/') 46 | library("parallel") 47 | } 48 | 49 | # Duplicate stats 50 | dm <- analyzeDuprates(input_bam, annotation_gtf, stranded, paired_end, threads) 51 | write.table(dm, file=paste(input_bam_basename, "_dupMatrix.txt", sep=""), quote=F, row.name=F, sep="\t") 52 | 53 | # 2D density scatter plot 54 | pdf(paste0(input_bam_basename, "_duprateExpDens.pdf")) 55 | duprateExpDensPlot(DupMat=dm) 56 | title("Density scatter plot") 57 | mtext(input_bam_basename, side=3) 58 | dev.off() 59 | fit <- duprateExpFit(DupMat=dm) 60 | cat( 61 | paste("- dupRadar Int (duprate at low read counts):", fit$intercept), 62 | paste("- dupRadar Sl (progression of the duplication rate):", fit$slope), 63 | fill=TRUE, labels=input_bam_basename, 64 | file=paste0(input_bam_basename, "_intercept_slope.txt"), append=FALSE 65 | ) 66 | 67 | # Create a multiqc file dupInt 68 | sample_name <- gsub("Aligned.sortedByCoord.out.markDups", "", input_bam_basename) 69 | line="#id: DupInt 70 | #plot_type: 'generalstats' 71 | #pconfig: 72 | # dupRadar_intercept: 73 | # title: 'dupInt' 74 | # namespace: 'DupRadar' 75 | # description: 'Intercept value from DupRadar' 76 | # max: 100 77 | # min: 0 78 | # scale: 'RdYlGn-rev' 79 | # format: '{:.2f}%' 80 | Sample dupRadar_intercept" 81 | 82 | write(line,file=paste0(input_bam_basename, "_dup_intercept_mqc.txt"),append=TRUE) 83 | write(paste(sample_name, fit$intercept),file=paste0(input_bam_basename, "_dup_intercept_mqc.txt"),append=TRUE) 84 | 85 | # Get numbers from dupRadar GLM 86 | curve_x <- sort(log10(dm$RPK)) 87 | curve_y = 100*predict(fit$glm, data.frame(x=curve_x), type="response") 88 | # Remove all of the infinite values 89 | infs = which(curve_x %in% c(-Inf,Inf)) 90 | curve_x = curve_x[-infs] 91 | curve_y = curve_y[-infs] 92 | # Reduce number of data points 93 | curve_x <- curve_x[seq(1, length(curve_x), 10)] 94 | curve_y <- curve_y[seq(1, length(curve_y), 10)] 95 | # Convert x values back to real counts 96 | curve_x = 10^curve_x 97 | # Write to file 98 | line="#id: DupRadar 99 | #section_name: 'DupRadar' 100 | #section_href: 'bioconductor.org/packages/release/bioc/html/dupRadar.html' 101 | #description: \"provides duplication rate quality control for RNA-Seq datasets. Highly expressed genes can be expected to have a lot of duplicate reads, but high numbers of duplicates at low read counts can indicate low library complexity with technical duplication. 102 | # This plot shows the general linear models - a summary of the gene duplication distributions. \" 103 | #pconfig: 104 | # title: 'DupRadar General Linear Model' 105 | # xLog: True 106 | # xlab: 'expression (reads/kbp)' 107 | # ylab: '% duplicate reads' 108 | # ymax: 100 109 | # ymin: 0 110 | # tt_label: '{point.x:.1f} reads/kbp: {point.y:,.2f}% duplicates' 111 | # xPlotLines: 112 | # - color: 'green' 113 | # dashStyle: 'LongDash' 114 | # label: 115 | # style: {color: 'green'} 116 | # text: '0.5 RPKM' 117 | # verticalAlign: 'bottom' 118 | # y: -65 119 | # value: 0.5 120 | # width: 1 121 | # - color: 'red' 122 | # dashStyle: 'LongDash' 123 | # label: 124 | # style: {color: 'red'} 125 | # text: '1 read/bp' 126 | # verticalAlign: 'bottom' 127 | # y: -65 128 | # value: 1000 129 | # width: 1" 130 | 131 | write(line,file=paste0(input_bam_basename, "_duprateExpDensCurve_mqc.txt"),append=TRUE) 132 | write.table( 133 | cbind(curve_x, curve_y), 134 | file=paste0(input_bam_basename, "_duprateExpDensCurve_mqc.txt"), 135 | quote=FALSE, row.names=FALSE, col.names=FALSE, append=TRUE, 136 | ) 137 | 138 | # Distribution of expression box plot 139 | pdf(paste0(input_bam_basename, "_duprateExpBoxplot.pdf")) 140 | duprateExpBoxplot(DupMat=dm) 141 | title("Percent Duplication by Expression") 142 | mtext(input_bam_basename, side=3) 143 | dev.off() 144 | 145 | # Distribution of RPK values per gene 146 | pdf(paste0(input_bam_basename, "_expressionHist.pdf")) 147 | expressionHist(DupMat=dm) 148 | title("Distribution of RPK values per gene") 149 | mtext(input_bam_basename, side=3) 150 | dev.off() 151 | 152 | # Print sessioninfo to standard out 153 | print(input_bam_basename) 154 | citation("dupRadar") 155 | sessionInfo() 156 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # nf-core/rnaseq: Changelog 2 | 3 | ## Version 1.4.2 4 | 5 | * Minor version release for keeping Git History in sync 6 | * No changes with respect to 1.4.1 on pipeline level 7 | 8 | ## Version 1.4.1 9 | 10 | Major novel changes include: 11 | 12 | * Update `igenomes.config` with NCBI `GRCh38` and most recent UCSC genomes 13 | * Set `autoMounts = true` by default for `singularity` profile 14 | 15 | ### Pipeline enhancements & fixes 16 | 17 | * Fixed parameter warnings [#316](https://github.com/nf-core/rnaseq/issues/316) and [318](https://github.com/nf-core/rnaseq/issues/318) 18 | * Fixed [#307](https://github.com/nf-core/rnaseq/issues/307) - Confusing Info Printout about GFF and GTF 19 | 20 | ## Version 1.4 21 | 22 | Major novel changes include: 23 | 24 | * Support for Salmon as an alternative method to STAR and HISAT2 25 | * Several improvements in `featureCounts` handling of types other than `exon`. It is possible now to handle nuclearRNAseq data. Nuclear RNA has un-spliced RNA, and the whole transcript, including the introns, needs to be counted, e.g. by specifying `--fc_count_type transcript`. 26 | * Support for [outputting unaligned data](https://github.com/nf-core/rnaseq/issues/277) to results folders. 27 | * Added options to skip several steps 28 | 29 | * Skip trimming using `--skipTrimming` 30 | * Skip BiotypeQC using `--skipBiotypeQC` 31 | * Skip Alignment using `--skipAlignment` to only use pseudo-alignment using Salmon 32 | 33 | ### Documentation updates 34 | 35 | * Adjust wording of skipped samples [in pipeline output](https://github.com/nf-core/rnaseq/issues/290) 36 | * Fixed link to guidelines [#203](https://github.com/nf-core/rnaseq/issues/203) 37 | * Add `Citation` and `Quick Start` section to `README.md` 38 | * Add in documentation of the `--gff` parameter 39 | 40 | ### Reporting Updates 41 | 42 | * Generate MultiQC plots in the results directory [#200](https://github.com/nf-core/rnaseq/issues/200) 43 | * Get MultiQC to save plots as [standalone files](https://github.com/nf-core/rnaseq/issues/183) 44 | * Get MultiQC to write out the software versions in a `.csv` file [#185](https://github.com/nf-core/rnaseq/issues/185) 45 | * Use `file` instead of `new File` to create `pipeline_report.{html,txt}` files, and properly create subfolders 46 | 47 | ### Pipeline enhancements & fixes 48 | 49 | * Restore `SummarizedExperimment` object creation in the salmon_merge process avoiding increasing memory with sample size. 50 | * Fix sample names in feature counts and dupRadar to remove suffixes added in other processes 51 | * Removed `genebody_coverage` process [#195](https://github.com/nf-core/rnaseq/issues/195) 52 | * Implemented Pearsons correlation instead of Euclidean distance [#146](https://github.com/nf-core/rnaseq/issues/146) 53 | * Add `--stringTieIgnoreGTF` parameter [#206](https://github.com/nf-core/rnaseq/issues/206) 54 | * Removed unused `stringtie` channels for `MultiQC` 55 | * Integrate changes in `nf-core/tools v1.6` template which resolved [#90](https://github.com/nf-core/rnaseq/issues/90) 56 | * Moved process `convertGFFtoGTF` before `makeSTARindex` [#215](https://github.com/nf-core/rnaseq/issues/215) 57 | * Change all boolean parameters from `snake_case` to `camelCase` and vice versa for value parameters 58 | * Add SM ReadGroup info for QualiMap compatibility[#238](https://github.com/nf-core/rnaseq/issues/238) 59 | * Obtain edgeR + dupRadar version information [#198](https://github.com/nf-core/rnaseq/issues/198) and [#112](https://github.com/nf-core/rnaseq/issues/112) 60 | * Add `--gencode` option for compatibility of Salmon and featureCounts biotypes with GENCODE gene annotations 61 | * Added functionality to accept compressed reference data in the pipeline 62 | * Check that gtf features are on chromosomes that exist in the genome fasta file [#274](https://github.com/nf-core/rnaseq/pull/274) 63 | * Maintain all gff features upon gtf conversion (keeps `gene_biotype` or `gene_type` to make `featureCounts` happy) 64 | * Add SortMeRNA as an optional step to allow rRNA removal [#280](https://github.com/nf-core/rnaseq/issues/280) 65 | * Minimal adjustment of memory and CPU constraints for clusters with locked memory / CPU relation 66 | * Cleaned up usage, `parameters.settings.json` and the `nextflow.config` 67 | 68 | ### Dependency Updates 69 | 70 | * Dependency list is now sorted appropriately 71 | * Force matplotlib=3.0.3 72 | 73 | #### Updated Packages 74 | 75 | * Picard 2.20.0 -> 2.21.1 76 | * bioconductor-dupradar 1.12.1 -> 1.14.0 77 | * bioconductor-edger 3.24.3 -> 3.26.5 78 | * gffread 0.9.12 -> 0.11.4 79 | * trim-galore 0.6.1 -> 0.6.4 80 | * gffread 0.9.12 -> 0.11.4 81 | * rseqc 3.0.0 -> 3.0.1 82 | * R-Base 3.5 -> 3.6.1 83 | 84 | #### Added / Removed Packages 85 | 86 | * Dropped CSVtk in favor of Unix's simple `cut` and `paste` utilities 87 | * Added Salmon 0.14.2 88 | * Added TXIMeta 1.2.2 89 | * Added SummarizedExperiment 1.14.0 90 | * Added SortMeRNA 2.1b 91 | * Add tximport and summarizedexperiment dependency [#171](https://github.com/nf-core/rnaseq/issues/171) 92 | * Add Qualimap dependency [#202](https://github.com/nf-core/rnaseq/issues/202) 93 | 94 | ## [Version 1.3](https://github.com/nf-core/rnaseq/releases/tag/1.3) - 2019-03-26 95 | 96 | ### Pipeline Updates 97 | 98 | * Added configurable options to specify group attributes for featureCounts [#144](https://github.com/nf-core/rnaseq/issues/144) 99 | * Added support for RSeqC 3.0 [#148](https://github.com/nf-core/rnaseq/issues/148) 100 | * Added a `parameters.settings.json` file for use with the new `nf-core launch` helper tool. 101 | * Centralized all configuration profiles using [nf-core/configs](https://github.com/nf-core/configs) 102 | * Fixed all centralized configs [for offline usage](https://github.com/nf-core/rnaseq/issues/163) 103 | * Hide %dup in [multiqc report](https://github.com/nf-core/rnaseq/issues/150) 104 | * Add option for Trimming NextSeq data properly ([@jburos work](https://github.com/jburos)) 105 | 106 | ### Bug fixes 107 | 108 | * Fixing HISAT2 Index Building for large reference genomes [#153](https://github.com/nf-core/rnaseq/issues/153) 109 | * Fixing HISAT2 BAM sorting using more memory than available on the system 110 | * Fixing MarkDuplicates memory consumption issues following [#179](https://github.com/nf-core/rnaseq/pull/179) 111 | * Use `file` instead of `new File` to create the `pipeline_report.{html,txt}` files to avoid creating local directories when outputting to AWS S3 folders 112 | 113 | ### Dependency Updates 114 | 115 | * RSeQC 2.6.4 -> 3.0.0 116 | * Picard 2.18.15 -> 2.20.0 117 | * r-data.table 1.11.4 -> 1.12.2 118 | * bioconductor-edger 3.24.1 -> 3.24.3 119 | * r-markdown 0.8 -> 0.9 120 | * csvtk 0.15.0 -> 0.17.0 121 | * stringtie 1.3.4 -> 1.3.6 122 | * subread 1.6.2 -> 1.6.4 123 | * gffread 0.9.9 -> 0.9.12 124 | * multiqc 1.6 -> 1.7 125 | * deeptools 3.2.0 -> 3.2.1 126 | * trim-galore 0.5.0 -> 0.6.1 127 | * qualimap 2.2.2b 128 | * matplotlib 3.0.3 129 | * r-base 3.5.1 130 | 131 | ## [Version 1.2](https://github.com/nf-core/rnaseq/releases/tag/1.2) - 2018-12-12 132 | 133 | ### Pipeline updates 134 | 135 | * Removed some outdated documentation about non-existent features 136 | * Config refactoring and code cleaning 137 | * Added a `--fcExtraAttributes` option to specify more than ENSEMBL gene names in `featureCounts` 138 | * Remove legacy rseqc `strandRule` config code. [#119](https://github.com/nf-core/rnaseq/issues/119) 139 | * Added STRINGTIE ballgown output to results folder [#125](https://github.com/nf-core/rnaseq/issues/125) 140 | * HiSAT index build now requests `200GB` memory, enough to use the exons / splice junction option for building. 141 | * Added documentation about the `--hisatBuildMemory` option. 142 | * BAM indices are stored and re-used between processes [#71](https://github.com/nf-core/rnaseq/issues/71) 143 | 144 | ### Bug Fixes 145 | 146 | * Fixed conda bug which caused problems with environment resolution due to changes in bioconda [#113](https://github.com/nf-core/rnaseq/issues/113) 147 | * Fixed wrong gffread command line [#117](https://github.com/nf-core/rnaseq/issues/117) 148 | * Added `cpus = 1` to `workflow summary process` [#130](https://github.com/nf-core/rnaseq/issues/130) 149 | 150 | ## [Version 1.1](https://github.com/nf-core/rnaseq/releases/tag/1.1) - 2018-10-05 151 | 152 | ### Pipeline updates 153 | 154 | * Wrote docs and made minor tweaks to the `--skip_qc` and associated options 155 | * Removed the depreciated `uppmax-modules` config profile 156 | * Updated the `hebbe` config profile to use the new `withName` syntax too 157 | * Use new `workflow.manifest` variables in the pipeline script 158 | * Updated minimum nextflow version to `0.32.0` 159 | 160 | ### Bug Fixes 161 | 162 | * [#77](https://github.com/nf-core/rnaseq/issues/77): Added back `executor = 'local'` for the `workflow_summary_mqc` 163 | * [#95](https://github.com/nf-core/rnaseq/issues/95): Check if task.memory is false instead of null 164 | * [#97](https://github.com/nf-core/rnaseq/issues/97): Resolved edge-case where numeric sample IDs are parsed as numbers causing some samples to be incorrectly overwritten. 165 | 166 | ## [Version 1.0](https://github.com/nf-core/rnaseq/releases/tag/1.0) - 2018-08-20 167 | 168 | This release marks the point where the pipeline was moved from [SciLifeLab/NGI-RNAseq](https://github.com/SciLifeLab/NGI-RNAseq) 169 | over to the new [nf-core](http://nf-co.re/) community, at [nf-core/rnaseq](https://github.com/nf-core/rnaseq). 170 | 171 | View the previous changelog at [SciLifeLab/NGI-RNAseq/CHANGELOG.md](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/CHANGELOG.md) 172 | 173 | In addition to porting to the new nf-core community, the pipeline has had a number of major changes in this version. 174 | There have been 157 commits by 16 different contributors covering 70 different files in the pipeline: 7,357 additions and 8,236 deletions! 175 | 176 | In summary, the main changes are: 177 | 178 | * Rebranding and renaming throughout the pipeline to nf-core 179 | * Updating many parts of the pipeline config and style to meet nf-core standards 180 | * Support for GFF files in addition to GTF files 181 | * Just use `--gff` instead of `--gtf` when specifying a file path 182 | * New command line options to skip various quality control steps 183 | * More safety checks when launching a pipeline 184 | * Several new sanity checks - for example, that the specified reference genome exists 185 | * Improved performance with memory usage (especially STAR and Picard) 186 | * New BigWig file outputs for plotting coverage across the genome 187 | * Refactored gene body coverage calculation, now much faster and using much less memory 188 | * Bugfixes in the MultiQC process to avoid edge cases where it wouldn't run 189 | * MultiQC report now automatically attached to the email sent when the pipeline completes 190 | * New testing method, with data on GitHub 191 | * Now run pipeline with `-profile test` instead of using bash scripts 192 | * Rewritten continuous integration tests with Travis CI 193 | * New explicit support for Singularity containers 194 | * Improved MultiQC support for DupRadar and featureCounts 195 | * Now works for all users instead of just NGI Stockholm 196 | * New configuration for use on AWS batch 197 | * Updated config syntax to support latest versions of Nextflow 198 | * Built-in support for a number of new local HPC systems 199 | * CCGA, GIS, UCT HEX, updates to UPPMAX, CFC, BINAC, Hebbe, c3se 200 | * Slightly improved documentation (more updates to come) 201 | * Updated software packages 202 | 203 | ...and many more minor tweaks. 204 | 205 | Thanks to everyone who has worked on this release! 206 | -------------------------------------------------------------------------------- /docs/output.md: -------------------------------------------------------------------------------- 1 | # nf-core/rnaseq: Output 2 | 3 | This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. 4 | 5 | ## Pipeline overview 6 | The pipeline is built using [Nextflow](https://www.nextflow.io/) 7 | and processes data using the following steps: 8 | 9 | * [FastQC](#fastqc) - read quality control 10 | * [TrimGalore](#trimgalore) - adapter trimming 11 | * [SortMeRNA](#sortmerna) - ribosomal RNA removal 12 | * [STAR](#star) - alignment 13 | * [RSeQC](#rseqc) - RNA quality control metrics 14 | * [BAM stat](#bam-stat) 15 | * [Infer experiment](#infer-experiment) 16 | * [Junction saturation](#junction-saturation) 17 | * [RPKM saturation](#rpkm-saturation) 18 | * [Read duplication](#read-duplication) 19 | * [Inner distance](#inner-distance) 20 | * [Read distribution](#read-distribution) 21 | * [Junction annotation](#junction-annotation) 22 | * [Qualimap](#qualimap) - RNA quality control metrics 23 | * [dupRadar](#dupradar) - technical / biological read duplication 24 | * [Preseq](#preseq) - library complexity 25 | * [featureCounts](#featurecounts) - gene counts, biotype counts, rRNA estimation. 26 | * [Salmon](#salmon) - gene counts, transcripts counts. 27 | * [tximport](#tximport) - gene counts, transcripts counts, SummarizedExperimment object. 28 | * [StringTie](#stringtie) - FPKMs for genes and transcripts 29 | * [Sample_correlation](#Sample_correlation) - create MDS plot and sample pairwise distance heatmap / dendrogram 30 | * [MultiQC](#multiqc) - aggregate report, describing results of the whole pipeline 31 | 32 | ## FastQC 33 | [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences. 34 | 35 | For further reading and documentation see the [FastQC help](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). 36 | 37 | > **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the `trim_galore` directory. 38 | 39 | **Output directory: `results/fastqc`** 40 | 41 | * `sample_fastqc.html` 42 | * FastQC report, containing quality metrics for your untrimmed raw fastq files 43 | * `zips/sample_fastqc.zip` 44 | * zip file containing the FastQC report, tab-delimited data file and plot images 45 | 46 | ## TrimGalore 47 | The nfcore/rnaseq pipeline uses [TrimGalore](http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) for removal of adapter contamination and trimming of low quality regions. TrimGalore uses [Cutadapt](https://github.com/marcelm/cutadapt) for adapter trimming and runs FastQC after it finishes. 48 | 49 | MultiQC reports the percentage of bases removed by TrimGalore in the _General Statistics_ table, along with a line plot showing where reads were trimmed. 50 | 51 | **Output directory: `results/trim_galore`** 52 | 53 | Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming. 54 | 55 | * `sample_val_1.fq.gz`, `sample_val_2.fq.gz` 56 | * Trimmed FastQ data, reads 1 and 2. 57 | * NB: Only saved if `--saveTrimmed` has been specified. 58 | * `logs/sample_val_1.fq.gz_trimming_report.txt` 59 | * Trimming report (describes which parameters that were used) 60 | * `FastQC/sample_val_1_fastqc.zip` 61 | * FastQC report for trimmed reads 62 | 63 | Single-end data will have slightly different file names and only one FastQ file per sample. 64 | 65 | ## SortMeRNA 66 | 67 | When `--removeRiboRNA` is specified, nfcore/rnaseq pipeline uses [SortMeRNA](https://github.com/biocore/sortmerna) for removal of rRNA. SortMeRNA requires reference sequences and these are by default from the [SILVA database](https://www.arb-silva.de/). 68 | 69 | **Output directory: `results/SortMeRNA`** 70 | 71 | Contains FastQ files with quality and adapter trimmed reads for each sample, along with a log file describing the trimming. 72 | 73 | * `reads/sample-fw.fq.gz`, `reads/sample-rv.fq.gz` 74 | * Trimmed and rRNA depleted FastQ data, reads forward and reverse. 75 | * NB: Only saved if `--save_nonrRNA_reads` has been specified. 76 | * `logs/sample_rRNA_report.txt` 77 | * Report how many reads where removed due to matches to reference database(s). 78 | 79 | Single-end data will have slightly different file names (`reads/sample.fq.gz`) and only one FastQ file per sample. 80 | 81 | ## STAR 82 | STAR is a read aligner designed for RNA sequencing. STAR stands for Spliced Transcripts Alignment to a Reference, it produces results comparable to TopHat (the aligned previously used by NGI for RNA alignments) but is much faster. 83 | 84 | The STAR section of the MultiQC report shows a bar plot with alignment rates: good samples should have most reads as _Uniquely mapped_ and few _Unmapped_ reads. 85 | 86 | ![STAR](images/star_alignment_plot.png) 87 | 88 | **Output directory: `results/STAR`** 89 | 90 | * `Sample_Aligned.sortedByCoord.out.bam` 91 | * The aligned BAM file 92 | * `Sample_Log.final.out` 93 | * The STAR alignment report, contains mapping results summary 94 | * `Sample_Log.out` and `Sample_Log.progress.out` 95 | * STAR log files, containing a lot of detailed information about the run. Typically only useful for debugging purposes. 96 | * `Sample_SJ.out.tab` 97 | * Filtered splice junctions detected in the mapping 98 | * `unaligned/...` 99 | * Contains the unmapped reads that couldn't be mapped against the reference genome chosen. This is only available when the user specifically asks for `--saveUnaligned` output. 100 | 101 | ## RSeQC 102 | 103 | RSeQC is a package of scripts designed to evaluate the quality of RNA seq data. You can find out more about the package at the [RSeQC website](http://rseqc.sourceforge.net/). 104 | 105 | This pipeline runs several, but not all RSeQC scripts. All of these results are summarised within the MultiQC report and described below. 106 | 107 | **Output directory: `results/rseqc`** 108 | 109 | These are all quality metrics files and contains the raw data used for the plots in the MultiQC report. In general, the `.r` files are R scripts for generating the figures, the `.txt` are summary files, the `.xls` are data tables and the `.pdf` files are summary figures. 110 | 111 | ### BAM stat 112 | **Output: `Sample_bam_stat.txt`** 113 | 114 | This script gives numerous statistics about the aligned BAM files produced by STAR. A typical output looks as follows: 115 | 116 | ```txt 117 | #Output (all numbers are read count) 118 | #================================================== 119 | Total records: 41465027 120 | QC failed: 0 121 | Optical/PCR duplicate: 0 122 | Non Primary Hits 8720455 123 | Unmapped reads: 0 124 | 125 | mapq < mapq_cut (non-unique): 3127757 126 | mapq >= mapq_cut (unique): 29616815 127 | Read-1: 14841738 128 | Read-2: 14775077 129 | Reads map to '+': 14805391 130 | Reads map to '-': 14811424 131 | Non-splice reads: 25455360 132 | Splice reads: 4161455 133 | Reads mapped in proper pairs: 21856264 134 | Proper-paired reads map to different chrom: 7648 135 | ``` 136 | 137 | MultiQC plots each of these statistics in a dot plot. Each sample in the project is a dot - hover to see the sample highlighted across all fields. 138 | 139 | RSeQC documentation: [bam_stat.py](http://rseqc.sourceforge.net/#bam-stat-py) 140 | 141 | ### Infer experiment 142 | **Output: `Sample_infer_experiment.txt`** 143 | 144 | This script predicts the mode of library preparation (sense-stranded or antisense-stranded) according to how aligned reads overlay gene features in the reference genome. 145 | Example output from an unstranded (~50% sense/antisense) library of paired end data: 146 | 147 | **From MultiQC report:** 148 | ![infer_experiment](images/rseqc_infer_experiment_plot.png) 149 | 150 | **From the `infer_experiment.txt` file:** 151 | 152 | ```txt 153 | This is PairEnd Data 154 | Fraction of reads failed to determine: 0.0409 155 | Fraction of reads explained by "1++,1--,2+-,2-+": 0.4839 156 | Fraction of reads explained by "1+-,1-+,2++,2--": 0.4752 157 | ``` 158 | 159 | RSeQC documentation: [infer_experiment.py](http://rseqc.sourceforge.net/#infer-experiment-py) 160 | 161 | 162 | ### Junction saturation 163 | **Output:** 164 | * `Sample_rseqc.junctionSaturation_plot.pdf` 165 | * `Sample_rseqc.junctionSaturation_plot.r` 166 | 167 | This script shows the number of splice sites detected at the data at various levels of subsampling. A sample that reaches a plateau before getting to 100% data indicates that all junctions in the library have been detected, and that further sequencing will not yield more observations. A good sample should approach such a plateau of _Known junctions_, very deep sequencing is typically requires to saturate all _Novel Junctions_ in a sample. 168 | 169 | None of the lines in this example have plateaued and thus these samples could reveal more alternative splicing information if they were sequenced deeper. 170 | 171 | ![Junction saturation](images/rseqc_junction_saturation_plot.png) 172 | 173 | RSeQC documentation: [junction_saturation.py](http://rseqc.sourceforge.net/#junction-saturation-py) 174 | 175 | ### RPKM saturation 176 | **Output:** 177 | 178 | * `Sample_RPKM_saturation.eRPKM.xls` 179 | * `Sample_RPKM_saturation.rawCount.xls` 180 | * `Sample_RPKM_saturation.saturation.pdf` 181 | * `Sample_RPKM_saturation.saturation.r` 182 | 183 | This tool resamples a subset of the total RNA reads and calculates the RPKM value for each subset. We use the default subsets of every 5% of the total reads. 184 | A percent relative error is then calculated based on the subsamples; this is the y-axis in the graph. A typical PDF figure looks as follows: 185 | 186 | ![RPKM saturation](images/saturation.png) 187 | 188 | A complex library will have low resampling error in well expressed genes. 189 | 190 | This data is not currently reported in the MultiQC report. 191 | 192 | RSeQC documentation: [RPKM_saturation.py](http://rseqc.sourceforge.net/#rpkm-saturation-py) 193 | 194 | 195 | ### Read duplication 196 | **Output:** 197 | 198 | * `Sample_read_duplication.DupRate_plot.pdf` 199 | * `Sample_read_duplication.DupRate_plot.r` 200 | * `Sample_read_duplication.pos.DupRate.xls` 201 | * `Sample_read_duplication.seq.DupRate.xls` 202 | 203 | This plot shows the number of reads (y-axis) with a given number of exact duplicates (x-axis). Most reads in an RNA-seq library should have a low number of exact duplicates. Samples which have many reads with many duplicates (a large area under the curve) may be suffering excessive technical duplication. 204 | 205 | ![Read duplication](images/rseqc_read_dups_plot.png) 206 | 207 | RSeQC documentation: [read_duplication.py](http://rseqc.sourceforge.net/#read-duplication-py) 208 | 209 | ### Inner distance 210 | **Output:** 211 | 212 | * `Sample_rseqc.inner_distance.txt` 213 | * `Sample_rseqc.inner_distance_freq.txt` 214 | * `Sample_rseqc.inner_distance_plot.r` 215 | 216 | The inner distance script tries to calculate the inner distance between two paired RNA reads. It is the distance between the end of read 1 to the start of read 2, 217 | and it is sometimes confused with the insert size (see [this blog post](http://thegenomefactory.blogspot.com.au/2013/08/paired-end-read-confusion-library.html) for disambiguation): 218 | ![inner distance concept](images/inner_distance_concept.png) 219 | > _Credit: modified from RSeQC documentation._ 220 | 221 | Note that values can be negative if the reads overlap. A typical set of samples may look like this: 222 | ![Inner distance](images/rseqc_inner_distance_plot.png) 223 | 224 | This plot will not be generated for single-end data. Very short inner distances are often seen in old or degraded samples (_eg._ FFPE). 225 | 226 | RSeQC documentation: [inner_distance.py](http://rseqc.sourceforge.net/#inner-distance-py) 227 | 228 | ### Read distribution 229 | **Output: `Sample_read_distribution.txt`** 230 | 231 | This tool calculates how mapped reads are distributed over genomic features. A good result for a standard RNA seq experiments is generally to have as many exonic reads as possible (`CDS_Exons`). A large amount of intronic reads could be indicative of DNA contamination in your sample or some other problem. 232 | 233 | ![Read distribution](images/rseqc_read_distribution_plot.png) 234 | 235 | RSeQC documentation: [read_distribution.py](http://rseqc.sourceforge.net/#read-distribution-py) 236 | 237 | 238 | ### Junction annotation 239 | **Output:** 240 | 241 | * `Sample_junction_annotation_log.txt` 242 | * `Sample_rseqc.junction.xls` 243 | * `Sample_rseqc.junction_plot.r` 244 | * `Sample_rseqc.splice_events.pdf` 245 | * `Sample_rseqc.splice_junction.pdf` 246 | 247 | Junction annotation compares detected splice junctions to a reference gene model. An RNA read can be spliced 2 or more times, each time is called a splicing event. 248 | 249 | ![Junction annotation](images/rseqc_junction_annotation_junctions_plot.png) 250 | 251 | RSeQC documentation: [junction_annotation.py](http://rseqc.sourceforge.net/#junction-annotation-py) 252 | 253 | ## Qualimap 254 | [Qualimap](http://qualimap.bioinfo.cipf.es/) is a standalone package written in java. It calculates read alignment assignment, transcript coverage, read genomic origin, junction analysis and 3'-5' bias. 255 | 256 | **Output directory: `results/qualimap`** 257 | 258 | * `rnaseq_qc_results.txt` 259 | * `qualimapReport.html` 260 | * `css` 261 | * `raw_data_qualimapReport` 262 | * `images_qualimapReport` 263 | 264 | Qualimap RNAseq documentation: [Qualimap docs](http://qualimap.bioinfo.cipf.es/doc_html/analysis.html#rna-seq-qc). 265 | 266 | ## dupRadar 267 | [dupRadar](https://www.bioconductor.org/packages/release/bioc/html/dupRadar.html) is a Bioconductor library for R. It plots the duplication rate against expression (RPKM) for every gene. A good sample with little technical duplication will only show high numbers of duplicates for highly expressed genes. Samples with technical duplication will have high duplication for all genes, irrespective of transcription level. 268 | 269 | ![dupRadar](images/dupRadar_plot.png) 270 | > _Credit: [dupRadar documentation](https://www.bioconductor.org/packages/devel/bioc/vignettes/dupRadar/inst/doc/dupRadar.html)_ 271 | 272 | **Output directory: `results/dupRadar`** 273 | 274 | * `Sample_markDups.bam_duprateExpDens.pdf` 275 | * `Sample_markDups.bam_duprateExpBoxplot.pdf` 276 | * `Sample_markDups.bam_expressionHist.pdf` 277 | * `Sample_markDups.bam_dupMatrix.txt` 278 | * `Sample_markDups.bam_duprateExpDensCurve.txt` 279 | * `Sample_markDups.bam_intercept_slope.txt` 280 | 281 | DupRadar documentation: [dupRadar docs](https://www.bioconductor.org/packages/devel/bioc/vignettes/dupRadar/inst/doc/dupRadar.html) 282 | 283 | ## Preseq 284 | [Preseq](http://smithlabresearch.org/software/preseq/) estimates the complexity of a library, showing how many additional unique reads are sequenced for increasing the total read count. A shallow curve indicates that the library has reached complexity saturation and further sequencing would likely not add further unique reads. The dashed line shows a perfectly complex library where total reads = unique reads. 285 | 286 | Note that these are predictive numbers only, not absolute. The MultiQC plot can sometimes give extreme sequencing depth on the X axis - click and drag from the left side of the plot to zoom in on more realistic numbers. 287 | 288 | ![preseq](images/preseq_plot.png) 289 | 290 | **Output directory: `results/preseq`** 291 | 292 | * `sample_ccurve.txt` 293 | * This file contains plot values for the complexity curve, plotted in the MultiQC report. 294 | 295 | ## featureCounts 296 | [featureCounts](http://bioinf.wehi.edu.au/featureCounts/) from the subread package summarises the read distribution over genomic features such as genes, exons, promotors, gene bodies, genomic bins and chromosomal locations. 297 | RNA reads should mostly overlap genes, so be assigned. 298 | 299 | ![featureCounts](images/featureCounts_assignment_plot.png) 300 | 301 | We also use featureCounts to count overlaps with different classes of features. This gives a good idea of where aligned reads are ending up and can show potential problems such as rRNA contamination. 302 | ![biotypes](images/featureCounts_biotype_plot.png) 303 | 304 | **Output directory: `results/featureCounts`** 305 | 306 | * `Sample.bam_biotype_counts.txt` 307 | * Read counts for the different gene biotypes that featureCounts distinguishes. 308 | * `Sample.featureCounts.txt` 309 | * Read the counts for each gene provided in the reference `gtf` file 310 | * `Sample.featureCounts.txt.summary` 311 | * Summary file, containing statistics about the counts 312 | 313 | ## Salmon 314 | [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) from [Ocean Genomics](https://oceangenomics.com/) quasi-maps and quantifies expression relative to the transcriptome. 315 | 316 | **Output directory: `results/salmon`** 317 | 318 | * `Sample/quant.sf` 319 | * Read counts for the different transcripts. 320 | * `Sample/quant.genes.sf` 321 | * Read the counts for each gene provided in the reference `gtf` file 322 | * `Sample/logs` 323 | * Summary file with information about the process 324 | * `unaligned/` 325 | * Contains a list of unmapped reads that can be used to generate a FastQ of unmapped reads for downstream analysis. 326 | 327 | ## tximport 328 | [tximport](https://bioconductor.org/packages/release/bioc/html/tximport.html) imports transcript-level abundance, estimated counts and transcript lengths, and summarizes into matrices for use with downstream gene-level analysis packages. Average transcript length, weighted by sample-specific transcript abundance estimates, is provided as a matrix which can be used as an offset for different expression of gene-level counts. 329 | 330 | **Output directory: `results/salmon`** 331 | 332 | * `salmon_merged_transcript_tpm.csv` 333 | * TPM counts for the different transcripts. 334 | * `salmon_merged_gene_tpm.csv` 335 | * TPM counts for the different genes. 336 | * `salmon_merged_transcript_counts.csv` 337 | * estimated counts for the different transcripts. 338 | * `salmon_merged_gene_counts.csv` 339 | * estimated counts for the different genes. 340 | * `tx2gene.csv` 341 | * CSV file with transcript and genes (`params.fc_group_features`) and extra name (`params.fc_extra_attributes`) in each column. 342 | * `se.rds` 343 | * RDS object to be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) with the TPM (`abundance`), estimated counts (`counts`) and transcript length (`length`) in the assays slot for transcripts. 344 | * `gse.rds` 345 | * RDS object to be loaded in R that contains a [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) with the TPM (`abundance`), estimated counts (`counts`) and transcript length (`length`) in the assays slot for genes. 346 | 347 | 348 | ### Index files 349 | 350 | **Output directory: `results/reference_genome/salmon_index`** 351 | 352 | * `duplicate_clusters.tsv` 353 | * Stores which transcripts are duplicates of one another 354 | * `hash.bin` 355 | * `header.json` 356 | * Information about k-mer size, uniquely identifying hashes for the reference 357 | * `indexing.log` 358 | * Time log for creating transcriptome index 359 | * `quasi_index.log` 360 | * Step-by-step log for making transcriptome index 361 | * `refInfo.json` 362 | * Information about file used for the reference 363 | * `rsd.bin` 364 | * `sa.bin` 365 | * `txpInfo.bin` 366 | * `versionInfo.json` 367 | * Salmon and indexing version sed to make the index 368 | 369 | ### Quantification output 370 | 371 | **Output directory: `results/salmon`** 372 | 373 | * `aux_info/` 374 | * Auxiliary info e.g. versions and number of mapped reads 375 | * `cmd_info.json` 376 | * Information about the Salmon quantification command, version, and options 377 | * `lib_format_counts.json` 378 | * Number of fragments assigned, unassigned and incompatible 379 | * `libParams/` 380 | * Contains the file `flenDist.txt` for the fragment length distribution 381 | * `logs/` 382 | * Contains the file `salmon_quant.log` giving a record of Salmon's quantification 383 | * `quant.sf` 384 | * *Transcript*-level quantification of the sample, including gene length, effective length, TPM, and number of reads 385 | * `quant.genes.sf` 386 | * *Gene*-level quantification of the sample, including gene length, effective length, TPM, and number of reads 387 | * `Sample.transcript.tpm.txt` 388 | * Subset of `quant.sf`, only containing the transcript id and TPM values 389 | * `Sample.gene.tpm.txt` 390 | * Subset of `quant.genes.sf`, only containing the gene id and TPM values 391 | 392 | ## StringTie 393 | [StringTie](https://ccb.jhu.edu/software/stringtie/) assembles RNA-Seq alignments into potential transcripts. It assembles and quantitates full-length transcripts representing multiple splice variants for each gene locus. 394 | 395 | StringTie outputs FPKM metrics for genes and transcripts as well as the transcript features that it generates. 396 | 397 | **Output directory: `results/stringtie`** 398 | 399 | * `_Aligned.sortedByCoord.out.bam.gene_abund.txt` 400 | * Gene aboundances, FPKM values 401 | * `_Aligned.sortedByCoord.out.bam_transcripts.gtf` 402 | * This `.gtf` file contains all of the assembled transcipts from StringTie 403 | * `_Aligned.sortedByCoord.out.bam.cov_refs.gtf` 404 | * This `.gtf` file contains the transcripts that are fully covered by reads. 405 | 406 | ## Sample Correlation 407 | [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) is a Bioconductor package for R used for RNA-seq data analysis. The script included in the pipeline uses edgeR to normalise read counts and create a heatmap showing Pearson's correlation and a dendrogram showing pairwise Euclidean distances between the samples in the experiment. It also creates a 2D MDS scatter plot showing sample grouping. These help to show sample similarity and can reveal batch effects and sample groupings. 408 | 409 | **Heatmap:** 410 | 411 | ![heatmap](images/mqc_hcplot_hocmzpdjsq.png) 412 | 413 | **MDS plot:** 414 | 415 | ![mds_plot](images/mqc_hcplot_ltqchiyxfz.png) 416 | 417 | **Output directory: `results/sample_correlation`** 418 | 419 | * `edgeR_MDS_plot.pdf` 420 | * MDS scatter plot showing sample similarity 421 | * `edgeR_MDS_distance_matrix.csv` 422 | * Distance matrix containing raw data from MDS analysis 423 | * `edgeR_MDS_Aplot_coordinates_mqc.csv` 424 | * Scatter plot coordinates from MDS plot, used for MultiQC report 425 | * `log2CPM_sample_distances_dendrogram.pdf` 426 | * Dendrogram showing the Euclidean distance between your samples 427 | * `log2CPM_sample_correlation_heatmap.pdf` 428 | * Heatmap showing the Pearsons correlation between your samples 429 | * `log2CPM_sample_correlation_mqc.csv` 430 | * Raw data from Pearsons correlation heatmap, used for MultiQC report 431 | 432 | ## MultiQC 433 | [MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory. 434 | 435 | The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability. 436 | 437 | **Output directory: `results/multiqc`** 438 | 439 | * `Project_multiqc_report.html` 440 | * MultiQC report - a standalone HTML file that can be viewed in your web browser 441 | * `Project_multiqc_data/` 442 | * Directory containing parsed statistics from the different tools used in the pipeline 443 | 444 | For more information about how to use MultiQC reports, see [http://multiqc.info](http://multiqc.info) 445 | -------------------------------------------------------------------------------- /parameters.settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "parameters": [ 3 | { 4 | "name": "reads", 5 | "label": "Input files", 6 | "usage": "Specify the location of your input FastQ files.", 7 | "group": "Main options", 8 | "default_value": "'data/*{1,2}.fastq.gz'", 9 | "render": "textfield", 10 | "pattern": ".*\\*.*", 11 | "type": "string" 12 | }, 13 | { 14 | "name": "singleEnd", 15 | "label": "Single-end sequencing input", 16 | "usage": "Use single-end sequencing inputs instead of paired-end.", 17 | "group": "Main options", 18 | "render": "check-box", 19 | "default_value": false, 20 | "type": "boolean" 21 | }, 22 | { 23 | "name": "genome", 24 | "label": "Alignment reference iGenomes key", 25 | "usage": "Ref. genome key for iGenomes", 26 | "group": "Alignment", 27 | "render": "drop-down", 28 | "type": "string", 29 | "choices": [ 30 | "", 31 | "GRCh37", 32 | "GRCm38", 33 | "TAIR10", 34 | "EB2", 35 | "UMD3.1", 36 | "WBcel235", 37 | "CanFam3.1", 38 | "GRCz10", 39 | "BDGP6", 40 | "EquCab2", 41 | "EB1", 42 | "Galgal4", 43 | "Gm01", 44 | "Mmul_1", 45 | "IRGSP-1.0", 46 | "CHIMP2.1.4", 47 | "Rnor_6.0", 48 | "R64-1-1", 49 | "EF2", 50 | "Sbi1", 51 | "Sscrofa10.2", 52 | "AGPv3" 53 | ], 54 | "default_value": "" 55 | }, 56 | { 57 | "name": "star_index", 58 | "label": "STAR index", 59 | "usage": "Path to STAR index", 60 | "group": "Alignment", 61 | "render": "file", 62 | "type": "string", 63 | "pattern": ".*", 64 | "default_value": "" 65 | }, 66 | { 67 | "name": "hisat2_index", 68 | "label": "HISAT2 index", 69 | "usage": "Path to HiSAT2 index", 70 | "group": "Alignment", 71 | "render": "file", 72 | "type": "string", 73 | "pattern": ".*", 74 | "default_value": "" 75 | }, 76 | { 77 | "name": "salmon_index", 78 | "label": "Salmon index", 79 | "usage": "Path to Salmon index", 80 | "group": "Alignment", 81 | "render": "file", 82 | "type": "string", 83 | "pattern": ".*", 84 | "default_value": "" 85 | }, 86 | { 87 | "name": "fasta", 88 | "label": "FASTA", 89 | "usage": "Path to Fasta reference", 90 | "group": "Alignment", 91 | "render": "file", 92 | "type": "string", 93 | "pattern": ".*", 94 | "default_value": "" 95 | }, 96 | { 97 | "name": "transcript_fasta", 98 | "label": "FASTA", 99 | "usage": "Path to transcript fasta file", 100 | "group": "Alignment", 101 | "render": "file", 102 | "type": "string", 103 | "pattern": ".*", 104 | "default_value": "" 105 | }, 106 | { 107 | "name": "splicesites", 108 | "label": "HISAT2 splice sites file", 109 | "usage": "Optional splice-sites file for building a HISAT2 alignment index", 110 | "group": "Alignment", 111 | "render": "file", 112 | "type": "string", 113 | "pattern": ".*", 114 | "default_value": "" 115 | }, 116 | { 117 | "name": "gtf", 118 | "label": "GTF", 119 | "usage": "Path to GTF file", 120 | "group": "Alignment", 121 | "render": "file", 122 | "type": "string", 123 | "pattern": ".*", 124 | "default_value": "" 125 | }, 126 | { 127 | "name": "gff", 128 | "label": "GFF", 129 | "usage": "Path to GFF3 file", 130 | "group": "Alignment", 131 | "render": "file", 132 | "type": "string", 133 | "pattern": ".*", 134 | "default_value": "" 135 | }, 136 | { 137 | "name": "bed12", 138 | "label": "BED12", 139 | "usage": "Path to bed12 file", 140 | "group": "Alignment", 141 | "render": "file", 142 | "type": "string", 143 | "pattern": ".*", 144 | "default_value": "" 145 | }, 146 | { 147 | "name": "saveReference", 148 | "label": "Save reference genome index", 149 | "usage": "Save the generated reference files to the results directory.", 150 | "group": "Pipeline defaults", 151 | "render": "check-box", 152 | "default_value": false, 153 | "type": "boolean" 154 | }, 155 | { 156 | "name": "forwardStranded", 157 | "label": "Forward stranded", 158 | "usage": "Samples are made using a forward-stranded library type.", 159 | "group": "Main options", 160 | "render": "check-box", 161 | "default_value": false, 162 | "type": "boolean" 163 | }, 164 | { 165 | "name": "reverseStranded", 166 | "label": "Reverse stranded", 167 | "usage": "Samples are made using a reverse-stranded library type.", 168 | "group": "Main options", 169 | "render": "check-box", 170 | "default_value": false, 171 | "type": "boolean" 172 | }, 173 | { 174 | "name": "unStranded", 175 | "label": "Unstranded", 176 | "usage": "Force the library strandedness to be unstranded", 177 | "render": "none", 178 | "default_value": false, 179 | "type": "boolean", 180 | "group": "Advanced" 181 | }, 182 | { 183 | "name": "clip_r1", 184 | "label": "Read Clipping: 5' R1", 185 | "usage": "Instructs Trim Galore to remove bp from the 5' end of read 1 (or single-end reads).", 186 | "group": "Read trimming", 187 | "render": "textfield", 188 | "pattern": "\\d*", 189 | "type": "integer", 190 | "default_value": 0 191 | }, 192 | { 193 | "name": "clip_r2", 194 | "label": "Read Clipping: 5' R1", 195 | "usage": "Instructs Trim Galore to remove bp from the 5' end of read 2 (paired-end reads only).", 196 | "group": "Read trimming", 197 | "render": "textfield", 198 | "pattern": "\\d*", 199 | "type": "integer", 200 | "default_value": 0 201 | }, 202 | { 203 | "name": "three_prime_clip_r1", 204 | "label": "Read Clipping: 3' R1", 205 | "usage": "Instructs Trim Galore to remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed.", 206 | "group": "Read trimming", 207 | "render": "textfield", 208 | "pattern": "\\d*", 209 | "type": "integer", 210 | "default_value": 0 211 | }, 212 | { 213 | "name": "three_prime_clip_r2", 214 | "label": "Read Clipping: 3' R2", 215 | "usage": "Instructs Trim Galore to remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed.", 216 | "group": "Read trimming", 217 | "render": "textfield", 218 | "pattern": "\\d*", 219 | "type": "integer", 220 | "default_value": 0 221 | }, 222 | { 223 | "name": "trim_nextseq", 224 | "label": "NextSeq Trimming", 225 | "usage": "This enables the option --nextseq-trim=3'CUTOFF within Cutadapt in Trim Galore, which will set a quality cutoff (that is normally given with -q instead), but qualities of G bases are ignored. This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without any signal are called as high-quality G bases.", 226 | "group": "Read trimming", 227 | "render": "textfield", 228 | "pattern": "\\d*", 229 | "type": "integer", 230 | "default_value": 0 231 | }, 232 | { 233 | "name": "pico", 234 | "label": "Library type: Pico", 235 | "usage": "Set trimming and standedness settings for the SMARTer Stranded Total RNA-Seq Kit - Pico Input kit.", 236 | "group": "Main options", 237 | "render": "check-box", 238 | "default_value": false, 239 | "type": "boolean" 240 | }, 241 | { 242 | "name": "saveTrimmed", 243 | "label": "Save Trimmed FastQ files", 244 | "usage": "Save the trimmed FastQ files to the results directory.", 245 | "group": "Pipeline defaults", 246 | "render": "check-box", 247 | "default_value": false, 248 | "type": "boolean" 249 | }, 250 | { 251 | "name": "aligner", 252 | "label": "Alignment tool", 253 | "usage": "Choose whether to align reads with STAR or HISAT2", 254 | "type": "string", 255 | "render": "radio-button", 256 | "choices": [ 257 | "star", 258 | "hisat2" 259 | ], 260 | "default_value": "star", 261 | "group": "Alignment" 262 | }, 263 | { 264 | "name": "removeRiboRNA", 265 | "label": "Remove ribosomal RNA", 266 | "usage": "Choose whether to remove rRNA or not", 267 | "type": "boolean", 268 | "render": "check-box", 269 | "default_value": false, 270 | "group": "rRNA Removal Settings" 271 | }, 272 | { 273 | "name": "saveNonRiboRNAReads", 274 | "label": "Save non-ribosomal RNA reads as FastQ to results", 275 | "usage": "By default, the pipeline doesn't save non-rRNA FastQ files.", 276 | "default_value": false, 277 | "type": "boolean", 278 | "render": "check-box", 279 | "group": "rRNA Removal Settings" 280 | }, 281 | { 282 | "name": "rRNA_database_manifest", 283 | "label": "Specify path to rRNA database manifest file", 284 | "usage": "By default, the pipeline uses a predefined SILVA list. Users can specify their own if necessary.", 285 | "pattern": ".*", 286 | "type": "string", 287 | "default_value": "", 288 | "group": "rRNA Removal Settings" 289 | }, 290 | { 291 | "name": "pseudo_aligner", 292 | "label": "Pseudo alignment tool", 293 | "usage": "Choose whether to pseudo align reads with Salmon", 294 | "type": "string", 295 | "render": "radio-button", 296 | "choices": [ "salmon" ], 297 | "default_value": "", 298 | "group": "Alignment" 299 | }, 300 | { 301 | "name": "stringTieIgnoreGTF", 302 | "label": "Alignment options", 303 | "usage": "Perform reference-guided de novo assembly of transcripts using StringTie i.e. dont restrict to those in GTF file.", 304 | "group": "Alignment", 305 | "render": "check-box", 306 | "default_value": false, 307 | "type": "boolean" 308 | }, 309 | { 310 | "name": "seq_center", 311 | "label": "Sequencing center", 312 | "usage": "Add sequencing center in @RG line of output BAM header", 313 | "group": "Advanced", 314 | "render": "textfield", 315 | "pattern": ".*", 316 | "type": "string", 317 | "default_value": "" 318 | }, 319 | { 320 | "name": "saveAlignedIntermediates", 321 | "label": "Save Aligned Intermediate BAM files", 322 | "usage": "Save intermediate BAM files to the results directory.", 323 | "group": "Pipeline defaults", 324 | "render": "check-box", 325 | "default_value": false, 326 | "type": "boolean" 327 | }, 328 | { 329 | "name": "fc_group_features", 330 | "label": "FeatureCounts Group Features", 331 | "usage": "By default, the pipeline uses `gene_name` as the default gene identifier group. Specifying `--fc_group_features` uses a different category present in your provided GTF file.", 332 | "default_value": "gene_id", 333 | "render": "textfield", 334 | "pattern": ".*", 335 | "type": "string", 336 | "group": "FeatureCount settings" 337 | }, 338 | { 339 | "name": "fc_group_features_type", 340 | "label": "FeatureCounts Group Features Biotype", 341 | "usage": "GTF attribute name that gives the biotype of a feature.", 342 | "group": "FeatureCount settings", 343 | "default_value": "gene_biotype", 344 | "render": "textfield", 345 | "pattern": ".*", 346 | "type": "string" 347 | }, 348 | { 349 | "name": "fc_extra_attributes", 350 | "label": "FeatureCounts Extra Gene Names", 351 | "usage": "By default the pipeline uses `gene_names` as additional gene identifiers apart from ENSEMBL identifiers. --fc_extra_attributes is passed to featureCounts as an --extraAttributes parameter", 352 | "render": "textfield", 353 | "pattern": ".*", 354 | "type": "string", 355 | "default_value": "", 356 | "group": "FeatureCount settings" 357 | }, 358 | { 359 | "name": "skipQC", 360 | "label": "Skip all QC steps, apart from MultiQC", 361 | "render": "check-box", 362 | "default_value": false, 363 | "type": "boolean", 364 | "group": "Skip pipeline steps" 365 | }, 366 | { 367 | "name": "skipFastQC", 368 | "label": "Skip FastQC", 369 | "render": "check-box", 370 | "default_value": false, 371 | "type": "boolean", 372 | "group": "Skip pipeline steps" 373 | }, 374 | { 375 | "name": "skipPreseq", 376 | "label": "Skip Preseq analysis", 377 | "render": "check-box", 378 | "default_value": false, 379 | "type": "boolean", 380 | "group": "Skip pipeline steps" 381 | }, 382 | { 383 | "name": "skipDupRadar", 384 | "label": "Skip DupRadar QC", 385 | "render": "check-box", 386 | "default_value": false, 387 | "type": "boolean", 388 | "group": "Skip pipeline steps" 389 | }, 390 | { 391 | "name": "skipQualimap", 392 | "label": "Skip Qualimap step", 393 | "render": "check-box", 394 | "default_value": false, 395 | "type": "boolean", 396 | "group": "Skip pipeline steps" 397 | }, 398 | { 399 | "name": "skipRseQC", 400 | "label": "Skip RSeQC steps, apart from genebody coverage", 401 | "render": "check-box", 402 | "default_value": false, 403 | "type": "boolean", 404 | "group": "Skip pipeline steps" 405 | }, 406 | { 407 | "name": "skipEdgeR", 408 | "label": "Skip edgeR QC analysis", 409 | "render": "check-box", 410 | "default_value": false, 411 | "type": "boolean", 412 | "group": "Skip pipeline steps" 413 | }, 414 | { 415 | "name": "skipMultiQC", 416 | "label": "Skip MultiQC", 417 | "render": "check-box", 418 | "default_value": false, 419 | "type": "boolean", 420 | "group": "Skip pipeline steps" 421 | }, 422 | { 423 | "name": "sampleLevel", 424 | "label": "sampleLevel", 425 | "usage": "Turn off project-level analysis (edgeR MDS plot and heatmap).", 426 | "group": "Pipeline defaults", 427 | "render": "check-box", 428 | "default_value": false, 429 | "type": "boolean" 430 | }, 431 | { 432 | "name": "outdir", 433 | "label": "Output directory", 434 | "usage": "Set where to save the results from the pipeline", 435 | "group": "Main options", 436 | "default_value": "./results", 437 | "render": "textfield", 438 | "pattern": ".*", 439 | "type": "string" 440 | }, 441 | { 442 | "name": "email", 443 | "label": "Your email address", 444 | "usage": "Your email address, required to receive completion notification.", 445 | "group": "Pipeline defaults", 446 | "render": "textfield", 447 | "pattern": "^$|(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$)", 448 | "type": "string", 449 | "default_value": "" 450 | }, 451 | { 452 | "name": "max_multiqc_email_size", 453 | "label": "Maximum MultiQC email file size", 454 | "usage": "Theshold size for MultiQC report to be attached in notification email. If file generated by pipeline exceeds the threshold, it will not be attached.", 455 | "group": "Pipeline defaults", 456 | "default_value": "25.MB", 457 | "render": "textfield", 458 | "pattern": "\\d+\\.[KMGT]?B", 459 | "type": "string" 460 | }, 461 | { 462 | "name": "name", 463 | "label": "Custom run name", 464 | "usage": "Helper variable. Do not set, use -name instead.", 465 | "group": "Advanced", 466 | "render": "none", 467 | "pattern": ".*", 468 | "type": "string", 469 | "default_value": "" 470 | }, 471 | { 472 | "name": "awsregion", 473 | "label": "AWS Region", 474 | "usage": "The AWS region to run your job in.", 475 | "group": "AWS cloud usage", 476 | "default_value": "eu-west-1", 477 | "render": "textfield", 478 | "pattern": ".*", 479 | "type": "string" 480 | }, 481 | { 482 | "name": "awsqueue", 483 | "label": "AWS job queue", 484 | "usage": "The JobQueue that you intend to use on AWS Batch.", 485 | "group": "AWS cloud usage", 486 | "render": "textfield", 487 | "pattern": ".*", 488 | "type": "string", 489 | "default_value": "" 490 | }, 491 | { 492 | "name": "hisat_build_memory", 493 | "label": "HISAT2 indexing: required memory for splice sites in GB", 494 | "usage": "HISAT2 needs a very large amount of memory to build an index with splice sites. If the available memory is below this threshold, the index build will proceed without splicing information.", 495 | "group": "Advanced", 496 | "default_value": 200, 497 | "render": "textfield", 498 | "type": "integer" 499 | }, 500 | { 501 | "name": "star_memory", 502 | "label": "STAR memory", 503 | "usage": "Instead of using the default amount available, force STAR to use a given amount of memory", 504 | "group": "Advanced", 505 | "render": "textfield", 506 | "pattern": "^$|\\d+\\.[KMGT]?B", 507 | "type": "string", 508 | "default_value": "" 509 | }, 510 | { 511 | "name": "multiqc_config", 512 | "label": "MultiQC Config", 513 | "usage": "Path to a config file for MultiQC", 514 | "group": "Advanced", 515 | "default_value": "/Users/ewels/GitHub/nf-core/rnaseq/assets/multiqc_config.yaml", 516 | "render": "file", 517 | "pattern": ".*\\.yaml", 518 | "type": "string" 519 | }, 520 | { 521 | "name": "project", 522 | "label": "Cluster project", 523 | "usage": "For use on HPC systems where a project ID is required for job submission", 524 | "group": "Cluster job submission", 525 | "render": "textfield", 526 | "pattern": ".*", 527 | "type": "string", 528 | "default_value": "" 529 | }, 530 | { 531 | "name": "igenomes_base", 532 | "label": "iGenomes base path", 533 | "usage": "Base path for iGenomes reference files", 534 | "group": "Alignment", 535 | "default_value": "s3://ngi-igenomes/igenomes/", 536 | "render": "textfield", 537 | "pattern": ".*", 538 | "type": "string" 539 | }, 540 | { 541 | "name": "container", 542 | "label": "Software container", 543 | "usage": "Dockerhub address for pipeline container", 544 | "default_value": "nfcore/rnaseq:latest", 545 | "render": "textfield", 546 | "pattern": ".*", 547 | "type": "string", 548 | "group": "Pipeline defaults" 549 | }, 550 | { 551 | "name": "plaintext_email", 552 | "label": "Plain text email", 553 | "usage": "Set to receive plain-text e-mails instead of HTML formatted.", 554 | "group": "Pipeline defaults", 555 | "render": "check-box", 556 | "default_value": false, 557 | "type": "boolean" 558 | }, 559 | { 560 | "name": "help", 561 | "label": "Help", 562 | "usage": "Specify to show the pipeline help text.", 563 | "group": "Pipeline defaults", 564 | "render": "none", 565 | "default_value": false, 566 | "type": "boolean" 567 | }, 568 | { 569 | "name": "max_cpus", 570 | "label": "Maximum available CPUs", 571 | "usage": "Use to set a top-limit for the default CPUs requirement for each process.", 572 | "group": "Pipeline defaults", 573 | "default_value": 16, 574 | "render": "textfield", 575 | "type": "integer" 576 | }, 577 | { 578 | "name": "max_time", 579 | "label": "Maximum available time", 580 | "usage": "Use to set a top-limit for the default time requirement for each process.", 581 | "group": "Pipeline defaults", 582 | "default_value": "10d", 583 | "render": "textfield", 584 | "pattern": "\\d+[smhd]", 585 | "type": "string" 586 | }, 587 | { 588 | "name": "max_memory", 589 | "label": "Maximum available memory", 590 | "usage": "Use to set a top-limit for the default memory requirement for each process.", 591 | "group": "Pipeline defaults", 592 | "default_value": "128.GB", 593 | "render": "textfield", 594 | "pattern": "\\d+\\.[KMGT]?B", 595 | "type": "string" 596 | }, 597 | { 598 | "name": "tracedir", 599 | "label": "Trace directory", 600 | "usage": "Set to where the pipeline trace should be saved. Set to local path when using AWS on S3.", 601 | "group": "AWS cloud usage", 602 | "default_value": "./results/pipeline_info", 603 | "render": "textfield", 604 | "pattern": ".*", 605 | "type": "string" 606 | }, 607 | { 608 | "name": "readPaths", 609 | "label": "Read Paths", 610 | "usage": "For use with nextflow config files only", 611 | "group": "Advanced", 612 | "render": "none", 613 | "pattern": ".*", 614 | "type": "string", 615 | "default_value": "" 616 | } 617 | ] 618 | } 619 | -------------------------------------------------------------------------------- /conf/igenomes.config: -------------------------------------------------------------------------------- 1 | /* 2 | * ------------------------------------------------- 3 | * Nextflow config file for iGenomes paths 4 | * ------------------------------------------------- 5 | * Defines reference genomes, using iGenome paths 6 | * Can be used by any config that customises the base 7 | * path using $params.igenomes_base / --igenomes_base 8 | */ 9 | 10 | params { 11 | // illumina iGenomes reference file paths 12 | genomes { 13 | 'GRCh37' { 14 | fasta = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa" 15 | bwa = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa" 16 | bowtie2 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/" 17 | star = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/" 18 | bismark = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/" 19 | gtf = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf" 20 | bed12 = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed" 21 | mito_name = "MT" 22 | macs_gsize = "2.7e9" 23 | blacklist = "${baseDir}/assets/blacklists/GRCh37-blacklist.bed" 24 | } 25 | 'GRCh38' { 26 | fasta = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa" 27 | bwa = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa" 28 | bowtie2 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/" 29 | star = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/" 30 | bismark = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Sequence/BismarkIndex/" 31 | gtf = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf" 32 | bed12 = "${params.igenomes_base}/Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed" 33 | mito_name = "chrM" 34 | macs_gsize = "2.7e9" 35 | blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" 36 | } 37 | 'GRCm38' { 38 | fasta = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa" 39 | bwa = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BWAIndex/genome.fa" 40 | bowtie2 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/Bowtie2Index/" 41 | star = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/STARIndex/" 42 | bismark = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Sequence/BismarkIndex/" 43 | gtf = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf" 44 | bed12 = "${params.igenomes_base}/Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.bed" 45 | mito_name = "MT" 46 | macs_gsize = "1.87e9" 47 | blacklist = "${baseDir}/assets/blacklists/GRCm38-blacklist.bed" 48 | } 49 | 'TAIR10' { 50 | fasta = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome.fa" 51 | bwa = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BWAIndex/genome.fa" 52 | bowtie2 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/Bowtie2Index/" 53 | star = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/STARIndex/" 54 | bismark = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/BismarkIndex/" 55 | gtf = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.gtf" 56 | bed12 = "${params.igenomes_base}/Arabidopsis_thaliana/Ensembl/TAIR10/Annotation/Genes/genes.bed" 57 | mito_name = "Mt" 58 | } 59 | 'EB2' { 60 | fasta = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/WholeGenomeFasta/genome.fa" 61 | bwa = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BWAIndex/genome.fa" 62 | bowtie2 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/Bowtie2Index/" 63 | star = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/STARIndex/" 64 | bismark = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Sequence/BismarkIndex/" 65 | gtf = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.gtf" 66 | bed12 = "${params.igenomes_base}/Bacillus_subtilis_168/Ensembl/EB2/Annotation/Genes/genes.bed" 67 | } 68 | 'UMD3.1' { 69 | fasta = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/WholeGenomeFasta/genome.fa" 70 | bwa = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BWAIndex/genome.fa" 71 | bowtie2 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/Bowtie2Index/" 72 | star = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/STARIndex/" 73 | bismark = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Sequence/BismarkIndex/" 74 | gtf = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.gtf" 75 | bed12 = "${params.igenomes_base}/Bos_taurus/Ensembl/UMD3.1/Annotation/Genes/genes.bed" 76 | mito_name = "MT" 77 | } 78 | 'WBcel235' { 79 | fasta = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa" 80 | bwa = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BWAIndex/genome.fa" 81 | bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/Bowtie2Index/" 82 | star = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/STARIndex/" 83 | bismark = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/BismarkIndex/" 84 | gtf = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.gtf" 85 | bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/genes.bed" 86 | mito_name = "MtDNA" 87 | macs_gsize = "9e7" 88 | } 89 | 'CanFam3.1' { 90 | fasta = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/WholeGenomeFasta/genome.fa" 91 | bwa = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BWAIndex/genome.fa" 92 | bowtie2 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/Bowtie2Index/" 93 | star = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/STARIndex/" 94 | bismark = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Sequence/BismarkIndex/" 95 | gtf = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.gtf" 96 | bed12 = "${params.igenomes_base}/Canis_familiaris/Ensembl/CanFam3.1/Annotation/Genes/genes.bed" 97 | mito_name = "MT" 98 | } 99 | 'GRCz10' { 100 | fasta = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/WholeGenomeFasta/genome.fa" 101 | bwa = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BWAIndex/genome.fa" 102 | bowtie2 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/Bowtie2Index/" 103 | star = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/STARIndex/" 104 | bismark = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Sequence/BismarkIndex/" 105 | gtf = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.gtf" 106 | bed12 = "${params.igenomes_base}/Danio_rerio/Ensembl/GRCz10/Annotation/Genes/genes.bed" 107 | mito_name = "MT" 108 | } 109 | 'BDGP6' { 110 | fasta = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/WholeGenomeFasta/genome.fa" 111 | bwa = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BWAIndex/genome.fa" 112 | bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/Bowtie2Index/" 113 | star = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/STARIndex/" 114 | bismark = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Sequence/BismarkIndex/" 115 | gtf = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.gtf" 116 | bed12 = "${params.igenomes_base}/Drosophila_melanogaster/Ensembl/BDGP6/Annotation/Genes/genes.bed" 117 | mito_name = "M" 118 | macs_gsize = "1.2e8" 119 | } 120 | 'EquCab2' { 121 | fasta = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/WholeGenomeFasta/genome.fa" 122 | bwa = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BWAIndex/genome.fa" 123 | bowtie2 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/Bowtie2Index/" 124 | star = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/STARIndex/" 125 | bismark = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Sequence/BismarkIndex/" 126 | gtf = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.gtf" 127 | bed12 = "${params.igenomes_base}/Equus_caballus/Ensembl/EquCab2/Annotation/Genes/genes.bed" 128 | mito_name = "MT" 129 | } 130 | 'EB1' { 131 | fasta = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/WholeGenomeFasta/genome.fa" 132 | bwa = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BWAIndex/genome.fa" 133 | bowtie2 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/Bowtie2Index/" 134 | star = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/STARIndex/" 135 | bismark = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Sequence/BismarkIndex/" 136 | gtf = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.gtf" 137 | bed12 = "${params.igenomes_base}/Escherichia_coli_K_12_DH10B/Ensembl/EB1/Annotation/Genes/genes.bed" 138 | } 139 | 'Galgal4' { 140 | fasta = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/WholeGenomeFasta/genome.fa" 141 | bwa = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BWAIndex/genome.fa" 142 | bowtie2 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/Bowtie2Index/" 143 | star = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/STARIndex/" 144 | bismark = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Sequence/BismarkIndex/" 145 | gtf = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.gtf" 146 | bed12 = "${params.igenomes_base}/Gallus_gallus/Ensembl/Galgal4/Annotation/Genes/genes.bed" 147 | mito_name = "MT" 148 | } 149 | 'Gm01' { 150 | fasta = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/WholeGenomeFasta/genome.fa" 151 | bwa = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BWAIndex/genome.fa" 152 | bowtie2 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/Bowtie2Index/" 153 | star = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/STARIndex/" 154 | bismark = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Sequence/BismarkIndex/" 155 | gtf = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.gtf" 156 | bed12 = "${params.igenomes_base}/Glycine_max/Ensembl/Gm01/Annotation/Genes/genes.bed" 157 | } 158 | 'Mmul_1' { 159 | fasta = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/WholeGenomeFasta/genome.fa" 160 | bwa = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BWAIndex/genome.fa" 161 | bowtie2 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/Bowtie2Index/" 162 | star = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/STARIndex/" 163 | bismark = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Sequence/BismarkIndex/" 164 | gtf = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.gtf" 165 | bed12 = "${params.igenomes_base}/Macaca_mulatta/Ensembl/Mmul_1/Annotation/Genes/genes.bed" 166 | mito_name = "MT" 167 | } 168 | 'IRGSP-1.0' { 169 | fasta = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/WholeGenomeFasta/genome.fa" 170 | bwa = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BWAIndex/genome.fa" 171 | bowtie2 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/Bowtie2Index/" 172 | star = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/STARIndex/" 173 | bismark = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Sequence/BismarkIndex/" 174 | gtf = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.gtf" 175 | bed12 = "${params.igenomes_base}/Oryza_sativa_japonica/Ensembl/IRGSP-1.0/Annotation/Genes/genes.bed" 176 | mito_name = "Mt" 177 | } 178 | 'CHIMP2.1.4' { 179 | fasta = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/WholeGenomeFasta/genome.fa" 180 | bwa = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BWAIndex/genome.fa" 181 | bowtie2 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/Bowtie2Index/" 182 | star = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/STARIndex/" 183 | bismark = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Sequence/BismarkIndex/" 184 | gtf = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.gtf" 185 | bed12 = "${params.igenomes_base}/Pan_troglodytes/Ensembl/CHIMP2.1.4/Annotation/Genes/genes.bed" 186 | mito_name = "MT" 187 | } 188 | 'Rnor_6.0' { 189 | fasta = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/WholeGenomeFasta/genome.fa" 190 | bwa = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BWAIndex/genome.fa" 191 | bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/Bowtie2Index/" 192 | star = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/STARIndex/" 193 | bismark = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Sequence/BismarkIndex/" 194 | gtf = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.gtf" 195 | bed12 = "${params.igenomes_base}/Rattus_norvegicus/Ensembl/Rnor_6.0/Annotation/Genes/genes.bed" 196 | mito_name = "MT" 197 | } 198 | 'R64-1-1' { 199 | fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa" 200 | bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BWAIndex/genome.fa" 201 | bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/Bowtie2Index/" 202 | star = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/STARIndex/" 203 | bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/BismarkIndex/" 204 | gtf = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf" 205 | bed12 = "${params.igenomes_base}/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.bed" 206 | mito_name = "MT" 207 | macs_gsize = "1.2e7" 208 | } 209 | 'EF2' { 210 | fasta = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/WholeGenomeFasta/genome.fa" 211 | bwa = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BWAIndex/genome.fa" 212 | bowtie2 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/Bowtie2Index/" 213 | star = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/STARIndex/" 214 | bismark = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Sequence/BismarkIndex/" 215 | gtf = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.gtf" 216 | bed12 = "${params.igenomes_base}/Schizosaccharomyces_pombe/Ensembl/EF2/Annotation/Genes/genes.bed" 217 | mito_name = "MT" 218 | macs_gsize = "1.21e7" 219 | } 220 | 'Sbi1' { 221 | fasta = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/WholeGenomeFasta/genome.fa" 222 | bwa = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BWAIndex/genome.fa" 223 | bowtie2 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/Bowtie2Index/" 224 | star = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/STARIndex/" 225 | bismark = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Sequence/BismarkIndex/" 226 | gtf = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.gtf" 227 | bed12 = "${params.igenomes_base}/Sorghum_bicolor/Ensembl/Sbi1/Annotation/Genes/genes.bed" 228 | } 229 | 'Sscrofa10.2' { 230 | fasta = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/WholeGenomeFasta/genome.fa" 231 | bwa = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BWAIndex/genome.fa" 232 | bowtie2 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/Bowtie2Index/" 233 | star = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/STARIndex/" 234 | bismark = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Sequence/BismarkIndex/" 235 | gtf = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.gtf" 236 | bed12 = "${params.igenomes_base}/Sus_scrofa/Ensembl/Sscrofa10.2/Annotation/Genes/genes.bed" 237 | mito_name = "MT" 238 | } 239 | 'AGPv3' { 240 | fasta = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/WholeGenomeFasta/genome.fa" 241 | bwa = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BWAIndex/genome.fa" 242 | bowtie2 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/Bowtie2Index/" 243 | star = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/STARIndex/" 244 | bismark = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Sequence/BismarkIndex/" 245 | gtf = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.gtf" 246 | bed12 = "${params.igenomes_base}/Zea_mays/Ensembl/AGPv3/Annotation/Genes/genes.bed" 247 | mito_name = "Mt" 248 | } 249 | 'hg38' { 250 | fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa" 251 | bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BWAIndex/genome.fa" 252 | bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/Bowtie2Index/" 253 | star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/STARIndex/" 254 | bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Sequence/BismarkIndex/" 255 | gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.gtf" 256 | bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg38/Annotation/Genes/genes.bed" 257 | mito_name = "chrM" 258 | macs_gsize = "2.7e9" 259 | blacklist = "${baseDir}/assets/blacklists/hg38-blacklist.bed" 260 | } 261 | 'hg19' { 262 | fasta = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa" 263 | bwa = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa" 264 | bowtie2 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/" 265 | star = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/STARIndex/" 266 | bismark = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Sequence/BismarkIndex/" 267 | gtf = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf" 268 | bed12 = "${params.igenomes_base}/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.bed" 269 | mito_name = "chrM" 270 | macs_gsize = "2.7e9" 271 | blacklist = "${baseDir}/assets/blacklists/hg19-blacklist.bed" 272 | } 273 | 'mm10' { 274 | fasta = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa" 275 | bwa = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa" 276 | bowtie2 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/Bowtie2Index/" 277 | star = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/STARIndex/" 278 | bismark = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Sequence/BismarkIndex/" 279 | gtf = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.gtf" 280 | bed12 = "${params.igenomes_base}/Mus_musculus/UCSC/mm10/Annotation/Genes/genes.bed" 281 | mito_name = "chrM" 282 | macs_gsize = "1.87e9" 283 | blacklist = "${baseDir}/assets/blacklists/mm10-blacklist.bed" 284 | } 285 | 'bosTau8' { 286 | fasta = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/WholeGenomeFasta/genome.fa" 287 | bwa = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BWAIndex/genome.fa" 288 | bowtie2 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/Bowtie2Index/" 289 | star = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/STARIndex/" 290 | bismark = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Sequence/BismarkIndex/" 291 | gtf = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.gtf" 292 | bed12 = "${params.igenomes_base}/Bos_taurus/UCSC/bosTau8/Annotation/Genes/genes.bed" 293 | mito_name = "chrM" 294 | } 295 | 'ce10' { 296 | fasta = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta/genome.fa" 297 | bwa = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BWAIndex/genome.fa" 298 | bowtie2 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/Bowtie2Index/" 299 | star = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/STARIndex/" 300 | bismark = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Sequence/BismarkIndex/" 301 | gtf = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf" 302 | bed12 = "${params.igenomes_base}/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.bed" 303 | mito_name = "chrM" 304 | macs_gsize = "9e7" 305 | } 306 | 'canFam3' { 307 | fasta = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/WholeGenomeFasta/genome.fa" 308 | bwa = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BWAIndex/genome.fa" 309 | bowtie2 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/Bowtie2Index/" 310 | star = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/STARIndex/" 311 | bismark = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Sequence/BismarkIndex/" 312 | gtf = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.gtf" 313 | bed12 = "${params.igenomes_base}/Canis_familiaris/UCSC/canFam3/Annotation/Genes/genes.bed" 314 | mito_name = "chrM" 315 | } 316 | 'danRer10' { 317 | fasta = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/WholeGenomeFasta/genome.fa" 318 | bwa = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BWAIndex/genome.fa" 319 | bowtie2 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/Bowtie2Index/" 320 | star = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/STARIndex/" 321 | bismark = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Sequence/BismarkIndex/" 322 | gtf = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.gtf" 323 | bed12 = "${params.igenomes_base}/Danio_rerio/UCSC/danRer10/Annotation/Genes/genes.bed" 324 | mito_name = "chrM" 325 | } 326 | 'dm6' { 327 | fasta = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/WholeGenomeFasta/genome.fa" 328 | bwa = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BWAIndex/genome.fa" 329 | bowtie2 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/Bowtie2Index/" 330 | star = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/STARIndex/" 331 | bismark = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Sequence/BismarkIndex/" 332 | gtf = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.gtf" 333 | bed12 = "${params.igenomes_base}/Drosophila_melanogaster/UCSC/dm6/Annotation/Genes/genes.bed" 334 | mito_name = "chrM" 335 | macs_gsize = "1.2e8" 336 | } 337 | 'equCab2' { 338 | fasta = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/WholeGenomeFasta/genome.fa" 339 | bwa = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BWAIndex/genome.fa" 340 | bowtie2 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/Bowtie2Index/" 341 | star = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/STARIndex/" 342 | bismark = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Sequence/BismarkIndex/" 343 | gtf = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.gtf" 344 | bed12 = "${params.igenomes_base}/Equus_caballus/UCSC/equCab2/Annotation/Genes/genes.bed" 345 | mito_name = "chrM" 346 | } 347 | 'galGal4' { 348 | fasta = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/WholeGenomeFasta/genome.fa" 349 | bwa = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BWAIndex/genome.fa" 350 | bowtie2 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/Bowtie2Index/" 351 | star = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/STARIndex/" 352 | bismark = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Sequence/BismarkIndex/" 353 | gtf = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.gtf" 354 | bed12 = "${params.igenomes_base}/Gallus_gallus/UCSC/galGal4/Annotation/Genes/genes.bed" 355 | mito_name = "chrM" 356 | } 357 | 'panTro4' { 358 | fasta = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/WholeGenomeFasta/genome.fa" 359 | bwa = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BWAIndex/genome.fa" 360 | bowtie2 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/Bowtie2Index/" 361 | star = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/STARIndex/" 362 | bismark = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Sequence/BismarkIndex/" 363 | gtf = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.gtf" 364 | bed12 = "${params.igenomes_base}/Pan_troglodytes/UCSC/panTro4/Annotation/Genes/genes.bed" 365 | mito_name = "chrM" 366 | } 367 | 'rn6' { 368 | fasta = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/WholeGenomeFasta/genome.fa" 369 | bwa = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BWAIndex/genome.fa" 370 | bowtie2 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/Bowtie2Index/" 371 | star = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/STARIndex/" 372 | bismark = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Sequence/BismarkIndex/" 373 | gtf = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.gtf" 374 | bed12 = "${params.igenomes_base}/Rattus_norvegicus/UCSC/rn6/Annotation/Genes/genes.bed" 375 | mito_name = "chrM" 376 | } 377 | 'sacCer3' { 378 | fasta = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/WholeGenomeFasta/genome.fa" 379 | bwa = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa" 380 | bowtie2 = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/Bowtie2Index/" 381 | star = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/STARIndex/" 382 | bismark = "${params.igenomes_base}/Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BismarkIndex/" 383 | mito_name = "chrM" 384 | macs_gsize = "1.2e7" 385 | } 386 | 'susScr3' { 387 | fasta = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/WholeGenomeFasta/genome.fa" 388 | bwa = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BWAIndex/genome.fa" 389 | bowtie2 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/Bowtie2Index/" 390 | star = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/STARIndex/" 391 | bismark = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Sequence/BismarkIndex/" 392 | gtf = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.gtf" 393 | bed12 = "${params.igenomes_base}/Sus_scrofa/UCSC/susScr3/Annotation/Genes/genes.bed" 394 | mito_name = "chrM" 395 | } 396 | } 397 | } 398 | -------------------------------------------------------------------------------- /docs/usage.md: -------------------------------------------------------------------------------- 1 | # nf-core/rnaseq: Usage 2 | 3 | ## Table of contents 4 | 5 | 6 | 7 | * [Table of contents](#table-of-contents) 8 | * [Introduction](#introduction) 9 | * [Running the pipeline](#running-the-pipeline) 10 | * [Updating the pipeline](#updating-the-pipeline) 11 | * [Reproducibility](#reproducibility) 12 | * [Main arguments](#main-arguments) 13 | * [`-profile`](#-profile) 14 | * [`--reads`](#--reads) 15 | * [`--singleEnd`](#--singleend) 16 | * [Library strandedness](#library-strandedness) 17 | * [FeatureCounts Extra Gene Names](#featurecounts-extra-gene-names) 18 | * [Default "`gene_name`" Attribute Type](#default-attribute-type) 19 | * [Extra Gene Names or IDs](#extra-gene-names-or-ids) 20 | * [Default "`exon`" Attribute](#default-exon-type) 21 | * [Transcriptome mapping with Salmon](#transcriptome-mapping-with-salmon) 22 | * [Alignment tool](#alignment-tool) 23 | * [Reference genomes](#reference-genomes) 24 | * [`--genome` (using iGenomes)](#--genome-using-igenomes) 25 | * [`--star_index`, `--hisat2_index`, `--fasta`, `--gtf`, `--bed12`](#--star_index---hisat2_index---fasta---gtf---bed12) 26 | * [`--saveReference`](#--savereference) 27 | * [`--saveTrimmed`](#--savetrimmed) 28 | * [`--saveAlignedIntermediates`](#--savealignedintermediates) 29 | * [`--gencode`](#--gencode) 30 | * ["Type" of gene](#type-of-gene) 31 | * [Transcript IDs in FASTA files](#transcript-ids-in-fasta-files) 32 | * [`--skipAlignment`](#--skipAlignment) 33 | * [`--compressedReference`](#--compressedReference) 34 | * [Create compressed (tar.gz) STAR indices](#create-compressed-tar-gz-star-indices) 35 | * [Create compressed (tar.gz) HiSat2 indices](#create-compressed-tar-gz-hisat2-indices) 36 | * [Create compressed (tar.gz) Salmon indices](#create-compressed-tar-gz-salmon-indices) 37 | * [Adapter Trimming](#adapter-trimming) 38 | * [`--clip_r1 [int]`](#--clip_r1-int) 39 | * [`--clip_r2 [int]`](#--clip_r2-int) 40 | * [`--three_prime_clip_r1 [int]`](#--three_prime_clip_r1-int) 41 | * [`--three_prime_clip_r2 [int]`](#--three_prime_clip_r2-int) 42 | * [`--trim_nextseq [int]`](#--trim_nextseq) 43 | * [`--skipTrimming`](#--skipTrimming) 44 | * [Ribosomal RNA removal](#ribosomal-rna-removal) 45 | * [`--removeRiboRNA`](#--removeRiboRNA) 46 | * [`--save_nonrRNA_reads`](#--save_nonrrna_reads) 47 | * [`--rRNA_database_manifest`](#--rrna_database_manifest) 48 | * [Library Prep Presets](#library-prep-presets) 49 | * [`--pico`](#--pico) 50 | * [Skipping QC steps](#skipping-qc-steps) 51 | * [Job resources](#job-resources) 52 | * [Automatic resubmission](#automatic-resubmission) 53 | * [Custom resource requests](#custom-resource-requests) 54 | * [AWS Batch specific parameters](#aws-batch-specific-parameters) 55 | * [`--awsqueue`](#--awsqueue) 56 | * [`--awsregion`](#--awsregion) 57 | * [Other command line parameters](#other-command-line-parameters) 58 | * [`--outdir`](#--outdir) 59 | * [`--email`](#--email) 60 | * [`--email_on_fail`](#--email_on_fail) 61 | * [`-name`](#-name) 62 | * [`-resume`](#-resume) 63 | * [`-c`](#-c) 64 | * [`--custom_config_version`](#--custom_config_version) 65 | * [`--custom_config_base`](#--custom_config_base) 66 | * [`--max_memory`](#--max_memory) 67 | * [`--max_time`](#--max_time) 68 | * [`--max_cpus`](#--max_cpus) 69 | * [`--hisat_build_memory`](#--hisat_build_memory) 70 | * [`--sampleLevel`](#--samplelevel) 71 | * [`--plaintext_email`](#--plaintext_email) 72 | * [`--monochrome_logs`](#--monochrome_logs) 73 | * [`--multiqc_config`](#--multiqc_config) 74 | * [Stand-alone scripts](#stand-alone-scripts) 75 | 76 | 77 | ## Introduction 78 | 79 | Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through `screen` / `tmux` or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler. 80 | 81 | It is recommended to limit the Nextflow Java virtual machines memory. We recommend adding the following line to your environment (typically in `~/.bashrc` or `~./bash_profile`): 82 | 83 | ```bash 84 | NXF_OPTS='-Xms1g -Xmx4g' 85 | ``` 86 | 87 | ## Running the pipeline 88 | 89 | The typical command for running the pipeline is as follows: 90 | 91 | ```bash 92 | nextflow run nf-core/rnaseq --reads '*_R{1,2}.fastq.gz' --genome GRCh37 -profile docker 93 | ``` 94 | 95 | This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. 96 | 97 | Note that the pipeline will create the following files in your working directory: 98 | 99 | ```bash 100 | work # Directory containing the nextflow working files 101 | results # Finished results (configurable, see below) 102 | .nextflow_log # Log file from Nextflow 103 | # Other nextflow hidden files, eg. history of pipeline runs and old logs. 104 | ``` 105 | 106 | ### Updating the pipeline 107 | 108 | When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: 109 | 110 | ```bash 111 | nextflow pull nf-core/rnaseq 112 | ``` 113 | 114 | ### Reproducibility 115 | 116 | It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. 117 | 118 | First, go to the [nf-core/rnaseq releases page](https://github.com/nf-core/rnaseq/releases) and find the latest version number - numeric only (eg. `1.3.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 1.3.1`. 119 | 120 | This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. 121 | 122 | ## Main arguments 123 | 124 | ### `-profile` 125 | 126 | Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: `-profile docker` - the order of arguments is important! 127 | 128 | If `-profile` is not specified at all the pipeline will be run locally and expects all software to be installed and available on the `PATH`. 129 | 130 | * `awsbatch` 131 | * A generic configuration profile to be used with AWS Batch. 132 | * `conda` 133 | * A generic configuration profile to be used with [conda](https://conda.io/docs/) 134 | * Pulls most software from [Bioconda](https://bioconda.github.io/) 135 | * `docker` 136 | * A generic configuration profile to be used with [Docker](http://docker.com/) 137 | * Pulls software from dockerhub: [`nfcore/rnaseq`](http://hub.docker.com/r/nfcore/rnaseq/) 138 | * `singularity` 139 | * A generic configuration profile to be used with [Singularity](http://singularity.lbl.gov/) 140 | * Pulls software from DockerHub: [`nfcore/rnaseq`](http://hub.docker.com/r/nfcore/rnaseq/) 141 | * `test` 142 | * A profile with a complete configuration for automated testing 143 | * Includes links to test data so needs no other parameters 144 | 145 | ### `--reads` 146 | 147 | Use this to specify the location of your input FastQ files. For example: 148 | 149 | ```bash 150 | --reads 'path/to/data/sample_*_{1,2}.fastq' 151 | ``` 152 | 153 | Please note the following requirements: 154 | 155 | 1. The path must be enclosed in quotes 156 | 2. The path must have at least one `*` wildcard character 157 | 3. When using the pipeline with paired end data, the path must use `{1,2}` notation to specify read pairs. 158 | 159 | If left unspecified, a default pattern is used: `data/*{1,2}.fastq.gz` 160 | 161 | ### `--singleEnd` 162 | 163 | By default, the pipeline expects paired-end data. If you have single-end data, you need to specify `--singleEnd` on the command line when you launch the pipeline. A normal glob pattern, enclosed in quotation marks, can then be used for `--reads`. For example: 164 | 165 | ```bash 166 | --singleEnd --reads '*.fastq' 167 | ``` 168 | 169 | It is not possible to run a mixture of single-end and paired-end files in one run. 170 | 171 | ### Library strandedness 172 | 173 | Three command line flags / config parameters set the library strandedness for a run: 174 | 175 | * `--forwardStranded` 176 | * `--reverseStranded` 177 | * `--unStranded` 178 | 179 | If not set, the pipeline will be run as unstranded. Specifying `--pico` makes the pipeline run in `forwardStranded` mode. 180 | 181 | You can set a default in a cutom Nextflow configuration file such as one saved in `~/.nextflow/config` (see the [nextflow docs](https://www.nextflow.io/docs/latest/config.html) for more). For example: 182 | 183 | ```nextflow 184 | params { 185 | reverseStranded = true 186 | } 187 | ``` 188 | 189 | If you have a default strandedness set in your personal config file you can use `--unStranded` to overwrite it for a given run. 190 | 191 | These flags affect the commands used for several steps in the pipeline - namely HISAT2, featureCounts, RSeQC (`RPKM_saturation.py`), Qualimap and StringTie: 192 | 193 | * `--forwardStranded` 194 | * HISAT2: `--rna-strandness F` / `--rna-strandness FR` 195 | * featureCounts: `-s 1` 196 | * RSeQC: `-d ++,--` / `-d 1++,1--,2+-,2-+` 197 | * Qualimap: `-pe strand-specific-forward` 198 | * StringTie: `--fr` 199 | * `--reverseStranded` 200 | * HISAT2: `--rna-strandness R` / `--rna-strandness RF` 201 | * featureCounts: `-s 2` 202 | * RSeQC: `-d +-,-+` / `-d 1+-,1-+,2++,2--` 203 | * Qualimap: `-pe strand-specific-reverse` 204 | * StringTie: `--rf` 205 | 206 | ## FeatureCounts Extra Gene Names 207 | 208 | ### Default "`gene_name`" Attribute Type 209 | 210 | By default, the pipeline uses `gene_name` as the default gene identifier group. In case you need to adjust this, specify using the option `--fc_group_features` to use a different category present in your provided GTF file. Please also take care to use a suitable attribute to categorize the `biotype` of the selected features in your GTF then, using the option `--fc_group_features_type` (default: `gene_biotype`). 211 | 212 | ### Extra Gene Names or IDs 213 | 214 | By default, the pipeline uses `gene_names` as additional gene identifiers apart from ENSEMBL identifiers in the pipeline. 215 | This behaviour can be modified by specifying `--fc_extra_attributes` when running the pipeline, which is passed on to featureCounts as an `--extraAttributes` parameter. 216 | See the user guide of the [Subread package here](http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf). 217 | Note that you can also specify more than one desired value, separated by a comma: 218 | `--fc_extra_attributes gene_id,...` 219 | 220 | ### Default "`exon`" Type 221 | 222 | By default, the pipeline uses `exon` as the default to assign reads. In case you need to adjust this, specify using the option `--fc_count_type` to use a different category present in your provided GTF file (3rd column). For example, for nuclear RNA-seq, one could count reads in introns in addition to exons using `--fc_count_type transcript`. 223 | 224 | ## Transcriptome mapping with Salmon 225 | 226 | Use the `--pseudo aligner salmon` option to perform additional quantification at the transcript- and gene-level using [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html). This will be run in addition to either STAR or HiSat2 and cannot be run in isolation, mainly because it allows you to obtain QC metrics with respect to the genomic alignments. By default, the pipeline will use the genome fasta and gtf file to generate the transcript fasta file, and then to build the Salmon index. You can override these parameters using the `--transcript_fasta` and `--salmon_index`, respectively. 227 | 228 | The default Salmon parameters and a k-mer size of 31 are used to create the index. As [discussed here](https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-mode)), a k-mer size off 31 works well with reads that are 75bp or longer. 229 | 230 | ## Alignment tool 231 | 232 | By default, the pipeline uses [STAR](https://github.com/alexdobin/STAR) to align the raw FastQ reads to the reference genome. STAR is fast and common, but requires a lot of memory to run, typically around 38GB for the Human GRCh37 reference genome. 233 | 234 | If you prefer, you can use [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) as the alignment tool instead. Developed by the same group behind the popular Tophat aligner, HISAT2 has a much smaller memory footprint. 235 | 236 | To use HISAT2, use the parameter `--aligner hisat2` or set `params.aligner = 'hisat2'` in your config file. Alternatively, you can also use `--aligner salmon` if you want to just perform a fast mapping to the transcriptome with Salmon (you will also have to supply the `--transcriptome` parameter or both a `--fasta` and `--gtf`/`--gff`). 237 | 238 | ## Reference genomes 239 | 240 | The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource. 241 | 242 | ### `--genome` (using iGenomes) 243 | 244 | There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the `--genome` flag. 245 | 246 | You can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config). Common genomes that are supported are: 247 | 248 | * Human 249 | * `--genome GRCh37` 250 | * Mouse 251 | * `--genome GRCm38` 252 | * _Drosophila_ 253 | * `--genome BDGP6` 254 | * _S. cerevisiae_ 255 | * `--genome 'R64-1-1'` 256 | 257 | > There are numerous others - check the config file for more. 258 | 259 | Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to save such a file. 260 | 261 | The syntax for this reference configuration is as follows: 262 | 263 | ```nextflow 264 | params { 265 | genomes { 266 | 'GRCh37' { 267 | star = '' 268 | fasta = '' // Used if no star index given 269 | gtf = '' 270 | bed12 = '' // Generated from GTF if not given 271 | } 272 | // Any number of additional genomes, key is used with --genome 273 | } 274 | } 275 | ``` 276 | 277 | ### `--star_index`, `--hisat2_index`, `--fasta`, `--gtf`, `--bed12` 278 | 279 | If you prefer, you can specify the full path to your reference genome when you run the pipeline: 280 | 281 | ```bash 282 | --star_index '/path/to/STAR/index' \ 283 | --hisat2_index '/path/to/HISAT2/index' \ 284 | --fasta '/path/to/reference.fasta' \ 285 | --gtf '/path/to/gene_annotation.gtf' \ 286 | --gff '/path/to/gene_annotation.gff' \ 287 | --bed12 '/path/to/gene_annotation.bed' 288 | ``` 289 | 290 | Note that only one of `--star_index` / `--hisat2_index` are needed depending on which aligner you are using (see below). 291 | 292 | The minimum requirements are a Fasta and GTF file. Note that `--gff` and `--bed` are auto-derived from the `--gtf` where needed and are not required. If these are provided and no others, then all other reference files will be automatically generated by the pipeline. If you specify a `--gff` file, it will be converted to GTF format automatically by the pipeline. If you specify both, the GTF is preferred over the GFF by the pipeline. 293 | 294 | ### `--saveReference` 295 | 296 | Supply this parameter to save any generated reference genome files to your results folder. 297 | These can then be used for future pipeline runs, reducing processing times. 298 | 299 | ### `--saveTrimmed` 300 | 301 | By default, trimmed FastQ files will not be saved to the results directory. Specify this 302 | flag (or set to true in your config file) to copy these files when complete. 303 | 304 | ### `--saveUnaligned`` 305 | 306 | By default, the pipeline doesn't export unaligned/unmapped reads to a separate file. Using this option, STAR / HISAT2 and Salmon will produce a separate BAM file or a list of reads that were not aligned in a separate output directory. 307 | 308 | ### `--saveAlignedIntermediates` 309 | 310 | As above, by default intermediate BAM files from the alignment will not be saved. The final BAM files created after the Picard MarkDuplicates step are always saved. Set to true to also copy out BAM files from STAR / HISAT2 and sorting steps. 311 | 312 | ### `--gencode` 313 | 314 | If your `--gtf` file is in GENCODE format and you would like to run Salmon (`--pseudo_aligner salmon`) you will need to provide this parameter in order to build the Salmon index appropriately. The `params.fc_group_features_type=gene_type` will also be set as explained below. 315 | 316 | [GENCODE](https://www.gencodegenes.org) gene annotations are slightly different from ENSEMBL or iGenome annotations in two ways. 317 | 318 | #### "Type" of gene 319 | 320 | The `gene_biotype` field which is typically found in Ensembl GTF files contains a key word description regarding the type of gene e.g. `protein_coding`, `lincRNA`, `rRNA`. In GENCODE GTF files this field has been renamed to `gene_type`. 321 | 322 | ENSEMBL version: 323 | 324 | ```bash 325 | 8 havana transcript 70635318 70669174 . - . gene_id "ENSG00000147592"; gene_version "9"; transcript_id "ENST00000522447"; transcript_version "5"; gene_name "LACTB2"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "LACTB2-203"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS6208"; tag "basic"; transcript_support_level "2"; 326 | ``` 327 | 328 | GENCODE version: 329 | 330 | ```bash 331 | chr8 HAVANA transcript 70635318 70669174 . - . gene_id "ENSG00000147592.9"; transcript_id "ENST00000522447.5"; gene_type "protein_coding"; gene_name "LACTB2"; transcript_type "protein_coding"; transcript_name "LACTB2-203"; level 2; protein_id "ENSP00000428801.1"; transcript_support_level "2"; tag "alternative_3_UTR"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6208.1"; havana_gene "OTTHUMG00000164430.2"; havana_transcript "OTTHUMT00000378747.1"; 332 | ``` 333 | 334 | Therefore, for `featureCounts` to correctly count the different biotypes when using a GENCODE annotation the `fc_group_features_type` is automatically set to `gene_type` when the `--gencode` flag is specified. 335 | 336 | #### Transcript IDs in FASTA files 337 | 338 | The transcript IDs in GENCODE fasta files are separated by vertical pipes (`|`) rather than spaces. 339 | 340 | ENSEMBL version: 341 | 342 | ```bash 343 | >ENST00000522447.5 cds chromosome:GRCh38:8:70635318:70669174:-1 gene:ENSG00000147592.9 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:LACTB2 description:lactamase beta 2 [Source:HGNC Symbol;Acc:HGNC:18512] 344 | ``` 345 | 346 | GENCODE version: 347 | 348 | ```bash 349 | >ENST00000522447.5|ENSG00000147592.9|OTTHUMG00000164430.2|OTTHUMT00000378747.1|LACTB2-203|LACTB2|1034|protein_coding| 350 | ``` 351 | 352 | This [issue](https://github.com/COMBINE-lab/salmon/issues/15) can be overcome by specifying the `--gencode` flag when building the Salmon index. 353 | 354 | ### `--skipBiotypeQC` 355 | 356 | This skips the BiotypeQC step in the `featureCounts` process, explicitly useful when there is no available GTF/GFF with any `biotype` or similar information that could be used before. 357 | 358 | ### `--skipAlignment` 359 | 360 | By default, the pipeline aligns the input reads to the genome using either HISAT2 or STAR and counts gene expression using featureCounts. If you prefer to skip alignment altogether and only get transcript/gene expression counts with pseudo alignment, use this flag. Note that you will also need to specify `--pseudo_aligner salmon`. If you have a custom transcriptome, supply that with `--transcript_fasta`. 361 | 362 | ### Compressed Reference File Input 363 | 364 | By default, the pipeline assumes that the reference genome files are all uncompressed, i.e. raw fasta or gtf files. If instead you intend to use compressed or gzipped references, like directly from ENSEMBL: 365 | 366 | ```bash 367 | nextflow run --reads 'data/{R1,R2}*.fastq.gz' \ 368 | --genome ftp://ftp.ensembl.org/pub/release-97/fasta/microcebus_murinus/dna_index/Microcebus_murinus.Mmur_3.0.dna.toplevel.fa.gz \ 369 | --gtf ftp://ftp.ensembl.org/pub/release-97/gtf/microcebus_murinus/Microcebus_murinus.Mmur_3.0.97.gtf.gz 370 | ``` 371 | 372 | This assumes that ALL of the reference files are compressed, including the reference indices, e.g. for STAR, HiSat2 or Salmon. For instructions on how to create your own compressed reference files, see the instructions below. This also includes any files specified with `--additional_fasta`, which are assumed to be compressed as well when the `--fasta` file is compressed. The pipeline auto-detects `gz` input for reference files. Mixing of `gz` and non-compressed input is not possible! 373 | 374 | #### Create compressed (tar.gz) STAR indices 375 | 376 | STAR indices can be created by using `--saveReference`, and then using `tar` on them: 377 | 378 | ```bash 379 | cd results/reference_genome 380 | tar -zcvf star.tar.gz star 381 | ``` 382 | 383 | #### HISAT2 indices 384 | 385 | HiSAT2 indices can be created by using `--saveReference`, and then using `tar` on them: 386 | 387 | ```bash 388 | cd results/reference_genome 389 | tar -zcvf hisat2.tar.gz *.hisat2_* 390 | ``` 391 | 392 | #### Salmon index 393 | 394 | Salmon indices can be created by using `--saveReference`, and then using `tar` on them: 395 | 396 | ```bash 397 | cd results/reference_genome 398 | tar -zcvf salmon_index.tar.gz salmon_index 399 | ``` 400 | 401 | ## Adapter Trimming 402 | 403 | If specific additional trimming is required (for example, from additional tags), 404 | you can use any of the following command line parameters. These affect the command 405 | used to launch TrimGalore! 406 | 407 | ### `--clip_r1 [int]` 408 | 409 | Instructs Trim Galore to remove bp from the 5' end of read 1 (or single-end reads). 410 | 411 | ### `--clip_r2 [int]` 412 | 413 | Instructs Trim Galore to remove bp from the 5' end of read 2 (paired-end reads only). 414 | 415 | ### `--three_prime_clip_r1 [int]` 416 | 417 | Instructs Trim Galore to remove bp from the 3' end of read 1 _AFTER_ adapter/quality trimming has been performed. 418 | 419 | ### `--three_prime_clip_r2 [int]` 420 | 421 | Instructs Trim Galore to remove bp from the 3' end of read 2 _AFTER_ adapter/quality trimming has been performed. 422 | 423 | ### `--trim_nextseq [int]` 424 | 425 | This enables the option --nextseq-trim=3'CUTOFF within Cutadapt in Trim Galore, which will set a quality cutoff (that is normally given with -q instead), but qualities of G bases are ignored. This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without any signal are called as high-quality G bases. 426 | 427 | ### `--skipTrimming` 428 | 429 | This allows to skip the trimming process to save time when re-analyzing data that has been trimmed already. 430 | 431 | ## Ribosomal RNA removal 432 | 433 | If rRNA removal is desired (for example, metatranscriptomics), 434 | add the following command line parameters. 435 | 436 | ### `--removeRiboRNA` 437 | 438 | Instructs to use SortMeRNA to remove reads related to ribosomal RNA (or any patterns found in the sequences defined by `--rRNA_database_manifest`). 439 | 440 | ### `--saveNonRiboRNAReads` 441 | 442 | By default, non-rRNA FastQ files will not be saved to the results directory. Specify this 443 | flag (or set to true in your config file) to copy these files when complete. 444 | 445 | ### `--rRNA_database_manifest` 446 | 447 | By default, rRNA databases in github [`biocore/sortmerna/rRNA_databases`](https://github.com/biocore/sortmerna/tree/master/rRNA_databases) are used. Here the path to a text file can be provided that contains paths to fasta files (one per line, no ' or " for file names) that will be used for database creation for SortMeRNA instead of the default ones. You can see an example in the directory `assets/rrna-default-dbs.txt`. Consequently, similar reads to these sequences will be removed. 448 | 449 | ## Library Prep Presets 450 | 451 | Some command line options are available to automatically set parameters for common RNA-seq library preparation kits. 452 | 453 | > Note that these presets override other command line arguments. So if you specify `--pico --clip_r1 0`, the `--clip_r1` bit will be ignored. 454 | 455 | If you have a kit that you'd like a preset added for, please let us know! 456 | 457 | ### `--pico` 458 | 459 | Sets trimming and standedness settings for the _SMARTer Stranded Total RNA-Seq Kit - Pico Input_ kit. 460 | 461 | Equivalent to: `--forwardStranded` `--clip_r1 3` `--three_prime_clip_r2 3` 462 | 463 | ## Skipping QC steps 464 | 465 | The pipeline contains a large number of quality control steps. Sometimes, it may not be desirable to run all of them if time and compute resources are limited. 466 | The following options make this easy: 467 | 468 | * `--skipQC` - Skip **all QC steps**, apart from MultiQC 469 | * `--skipFastQC` - Skip FastQC 470 | * `--skipRseQC` - Skip RSeQC 471 | * `--skipQualimap` - Skip Qualimap 472 | * `--skipPreseq` - Skip Preseq 473 | * `--skipDupRadar` - Skip dupRadar (and Picard MarkDuplicates) 474 | * `--skipEdgeR` - Skip edgeR MDS plot and heatmap 475 | * `--skipMultiQC` - Skip MultiQC 476 | 477 | ## Job resources 478 | 479 | ### Automatic resubmission 480 | 481 | Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. 482 | 483 | ### Custom resource requests 484 | 485 | Wherever process-specific requirements are set in the pipeline, the default value can be changed by creating a custom config file. See the files hosted at [`nf-core/configs`](https://github.com/nf-core/configs/tree/master/conf) for examples. 486 | 487 | If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition below). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. 488 | 489 | If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack/). 490 | 491 | ## AWS Batch specific parameters 492 | 493 | Running the pipeline on AWS Batch requires a couple of specific parameters to be set according to your AWS Batch configuration. Please use the `-awsbatch` profile and then specify all of the following parameters. 494 | 495 | ### `--awsqueue` 496 | 497 | The JobQueue that you intend to use on AWS Batch. 498 | 499 | ### `--awsregion` 500 | 501 | The AWS region to run your job in. Default is set to `eu-west-1` but can be adjusted to your needs. 502 | 503 | Please make sure to also set the `-w/--work-dir` and `--outdir` parameters to a S3 storage bucket of your choice - you'll get an error message notifying you if you didn't. 504 | 505 | ## Other command line parameters 506 | 507 | ### `--outdir` 508 | 509 | The output directory where the results will be saved. 510 | 511 | ### `--email` 512 | 513 | Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run. 514 | 515 | ### `--email_on_fail` 516 | This works exactly as with `--email`, except emails are only sent if the workflow is not successful. 517 | 518 | ### `-name` 519 | 520 | Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic. 521 | 522 | This is used in the MultiQC report (if not default) and in the summary HTML / e-mail (always). 523 | 524 | **NB:** Single hyphen (core Nextflow option) 525 | 526 | ### `-resume` 527 | 528 | Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. 529 | 530 | You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names. 531 | 532 | **NB:** Single hyphen (core Nextflow option) 533 | 534 | ### `-c` 535 | 536 | Specify the path to a specific config file (this is a core NextFlow command). 537 | 538 | **NB:** Single hyphen (core Nextflow option) 539 | 540 | Note - you can use this to override pipeline defaults. 541 | 542 | ### `--custom_config_version` 543 | 544 | Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. This was implemented for reproducibility purposes. Default is set to `master`. 545 | 546 | ```bash 547 | ## Download and use config file with following git commid id 548 | --custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96 549 | ``` 550 | 551 | ### `--custom_config_base` 552 | 553 | If you're running offline, nextflow will not be able to fetch the institutional config files 554 | from the internet. If you don't need them, then this is not a problem. If you do need them, 555 | you should download the files from the repo and tell nextflow where to find them with the 556 | `custom_config_base` option. For example: 557 | 558 | ```bash 559 | ## Download and unzip the config files 560 | cd /path/to/my/configs 561 | wget https://github.com/nf-core/configs/archive/master.zip 562 | unzip master.zip 563 | 564 | ## Run the pipeline 565 | cd /path/to/my/data 566 | nextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/ 567 | ``` 568 | 569 | > Note that the nf-core/tools helper package has a `download` command to download all required pipeline 570 | > files + singularity containers + institutional configs in one go for you, to make this process easier. 571 | 572 | ### `--max_memory` 573 | 574 | Use to set a top-limit for the default memory requirement for each process. 575 | Should be a string in the format integer-unit. eg. `--max_memory '8.GB'` 576 | 577 | ### `--max_time` 578 | 579 | Use to set a top-limit for the default time requirement for each process. 580 | Should be a string in the format integer-unit. eg. `--max_time '2.h'` 581 | 582 | ### `--max_cpus` 583 | 584 | Use to set a top-limit for the default CPU requirement for each process. 585 | Should be a string in the format integer-unit. eg. `--max_cpus 1` 586 | 587 | ### `--hisat_build_memory` 588 | 589 | Required amount of memory in GB to build HISAT2 index with splice sites. 590 | The HiSAT2 index build can proceed with or without exon / splice junction information. 591 | To work with this, a very large amount of memory is required. 592 | If this memory is not available, the index build will proceed without splicing information. 593 | The `--hisat_build_memory` option changes this threshold. By default it is `200GB` - if your system 594 | `--max_memory` is set to `128GB` but your genome is small enough to build using this, then you can 595 | allow the exon build to proceed by supplying `--hisat_build_memory 100GB` 596 | 597 | ### `--sampleLevel` 598 | 599 | Used to turn of the edgeR MDS and heatmap. Set automatically when running on fewer than 3 samples. 600 | 601 | ### `--plaintext_email` 602 | 603 | Set to receive plain-text e-mails instead of HTML formatted. 604 | 605 | ### `--monochrome_logs` 606 | 607 | Set to disable colourful command line output and live life in monochrome. 608 | 609 | ### `--multiqc_config` 610 | 611 | Specify a path to a custom MultiQC configuration file. 612 | 613 | ## Stand-alone scripts 614 | 615 | The `bin` directory contains some scripts used by the pipeline which may also be run manually: 616 | 617 | * `gtf2bed` 618 | * Script used to generate the BED12 reference files used by RSeQC. Takes a `.gtf` file as input 619 | * `dupRadar.r` 620 | * dupRadar script used in the _dupRadar_ pipeline process. 621 | * `edgeR_heatmap_MDS.r` 622 | * edgeR script used in the _Sample Correlation_ process 623 | --------------------------------------------------------------------------------