├── fixes ├── recount3_extra_exons_in_gene_rejoin.png ├── transformed_log_fixed_vs_inflated_g026.png └── README.md ├── get_ce_ref_indexes.sh ├── get_human_ref_indexes.sh ├── get_mouse_ref_indexes.sh ├── singularity ├── unify_job_generation.sh ├── pump_job_generation.sh ├── link_unifier_input_from_pump_output.sh ├── hpc_unify.sh ├── hpc_pump.sh ├── run_recount_unify.sh └── run_recount_pump.sh ├── scripts ├── compare_unifier_sums.R ├── compare_unifier_runs.sh └── compare_pump_runs.sh ├── LICENSE ├── link_unifier_input_from_pump_output.sh ├── feeder.py ├── CHANGELOG.md ├── get_unify_refs.sh ├── sra └── README.md ├── gdc └── README.md ├── fetch_sra_accessions_for_study.py ├── MULTIMAPPERS.md ├── docker └── run_recount_unify.sh ├── dbgap └── README.md └── README.md /fixes/recount3_extra_exons_in_gene_rejoin.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/langmead-lab/monorail-external/HEAD/fixes/recount3_extra_exons_in_gene_rejoin.png -------------------------------------------------------------------------------- /fixes/transformed_log_fixed_vs_inflated_g026.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/langmead-lab/monorail-external/HEAD/fixes/transformed_log_fixed_vs_inflated_g026.png -------------------------------------------------------------------------------- /get_ce_ref_indexes.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | mkdir -p ce10 3 | pushd ce10 4 | for f in star_idx salmon_index unmapped_hisat2_idx gtf fasta; do 5 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/ce10/${f}.tar.gz 6 | tar -zxvf ${f}.tar.gz 7 | done 8 | popd 9 | -------------------------------------------------------------------------------- /get_human_ref_indexes.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | mkdir -p hg38 3 | pushd hg38 4 | for f in star_idx salmon_index unmapped_hisat2_idx gtf fasta; do 5 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/hg38/${f}.tar.gz 6 | tar -zxvf ${f}.tar.gz 7 | done 8 | popd 9 | -------------------------------------------------------------------------------- /get_mouse_ref_indexes.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | mkdir -p grcm38 3 | pushd grcm38 4 | for f in star_idx salmon_index unmapped_hisat2_idx gtf fasta; do 5 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/grcm38/${f}.tar.gz 6 | tar -zxvf ${f}.tar.gz 7 | done 8 | popd 9 | -------------------------------------------------------------------------------- /singularity/unify_job_generation.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | dir=$(dirname $0) 4 | #study name/accession, e.g. ERP001942 5 | study=$1 6 | #e.g. /scratch/04620/cwilks/workshop/pump/./output 7 | export PUMP_OUTPUT_DIR=$2 8 | #e.g. /scratch/04620/cwilks/workshop 9 | export WORKING_DIR=$3 10 | 11 | cat $dir/tacc_unify.sh | sed 's#^WORKING_DIR=..$#WORKING_DIR='$WORKING_DIR'#' | sed 's#^STUDY=..$#STUDY='$study'#' | sed 's#^dir=..$#dir='$dir'#' | sed 's#^PUMP_OUTPUT_DIR=..$#PUMP_OUTPUT_DIR='$PUMP_OUTPUT_DIR'#' > tacc_unify.${study}.sh 12 | -------------------------------------------------------------------------------- /singularity/pump_job_generation.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | dir=$(dirname $0) 4 | #study name/accession, e.g. ERP001942 5 | study=$1 6 | #file with list of runs accessions to process from study 7 | runs_file=$2 8 | #e.g. /scratch/04620/cwilks/workshop 9 | export WORKING_DIR=$3 10 | 11 | for f in input output temp temp_big; do mkdir -p $WORKING_DIR/$f ; done 12 | 13 | cat $dir/tacc_pump.sh | sed 's#^study=..$#study='$study'#' | sed 's#^runs_file=..$#runs_file='$runs_file'#' | sed 's#^WORKING_DIR=..$#WORKING_DIR='$WORKING_DIR'#' | sed 's#^dir=..$#dir='$dir'#' > tacc_pump.${study}.sh 14 | -------------------------------------------------------------------------------- /scripts/compare_unifier_sums.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | library(argparser, quietly=TRUE) 3 | library(data.table) 4 | 5 | p <- arg_parser("compare two matrices with assumed same columns/values but in different orders") 6 | p <- add_argument(p, "--f1", help="first sums file to compare, use this one to get columns from the other") 7 | p <- add_argument(p, "--f2", help="second sums file to compare") 8 | argv <- parse_args(p) 9 | 10 | #f1=fread(file=argv$f1,data.table=FALSE) 11 | f1=fread(file=argv$f1,data.table=TRUE) 12 | n1=names(f1) 13 | f2=fread(file=argv$f2,data.table=TRUE) 14 | all.equal(f1,f2[,..n1]) 15 | -------------------------------------------------------------------------------- /singularity/link_unifier_input_from_pump_output.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -exo pipefail 3 | ##IMPORTANT: $pump_input_dir and $unifier_input_dir HAVE to be on the SAME filesystem (hardlinks) for this to work!! 4 | pump_input_dir=$1 5 | unifier_input_dir=$2 6 | find $pump_input_dir -name "*" -type f | fgrep -v done | fgrep -v "std.out" | fgrep -v "stats.json" | perl -ne 'chomp; $f=$_; @f=split(/\//,$f); $f2=pop(@f); @f2=split(/!/,$f2); $run=shift(@f2); $study=shift(@f2); $study=~/(..)$/; $lo1=$1; $run=~/(..)$/; $lo2=$1; $ff="'$unifier_input_dir'/$lo1/$study/$lo2/$run/$study"."_0_att0"; `mkdir -p $ff`; `ln -f $f $ff/$f2`;' 7 | #absolutely need the ".done" files 8 | find $unifier_input_dir -name "*att0" | perl -ne 'chomp; $f=$_; `touch $f.done`;' 9 | -------------------------------------------------------------------------------- /scripts/compare_unifier_runs.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -x 3 | dir=$(dirname $0) 4 | export LC_ALL=C 5 | #assumes current working directory is the unifier output of the study we want to test 6 | 7 | #path to previous unifier run on same study 8 | path=$1 9 | 10 | #cwd=$(pwd) 11 | for f in `find . -name '*.gz' | fgrep -v "run_files" | fgrep -v "genotypes"`; do 12 | if [[ $f == *"gene_sums"* || $f == *"exon_sums"* ]]; then 13 | #diff <(pcat $path/$f | tail -n+3) <(pcat $f | tail -n+3) > ${f}.diff 14 | Rscript $dir/compare_unifier_sums.R --f1 $path/$f --f2 $f 15 | continue 16 | fi 17 | if [[ $f == *"metadata"* ]]; then 18 | diff <(pcat $path/$f | sort) <(pcat $f | sort) > ${f}.diff 19 | continue 20 | fi 21 | diff <(pcat $path/$f) <(pcat $f) > ${f}.diff 22 | done 23 | find . -name "*.diff" -exec ls -l {} \; 24 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 langmead-lab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /link_unifier_input_from_pump_output.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -exo pipefail 3 | ##IMPORTANT: $pump_input_dir and $unifier_input_dir HAVE to be on the SAME filesystem (hardlinks) for this to work!! 4 | pump_input_dir=$1 5 | unifier_input_dir=$2 6 | #optional, path to file w/ list of runs to include (white listing, for subsetting studies/projects) 7 | #format: list of run accessions (e.g. SRR001234) 8 | whitelisted_runs=$3 9 | 10 | rm -f stream2 11 | mkfifo stream2 12 | 13 | if [[ -n $whitelisted_runs ]]; then 14 | rm -f stream1 15 | mkfifo stream1 16 | cat $whitelisted_runs | sed 's/$/!/' > stream1 & 17 | find $pump_input_dir -name "*" -type f | fgrep -v done | fgrep -v "std.out" | fgrep -v "stats.json" | fgrep -f stream1 > stream2 & 18 | else 19 | find $pump_input_dir -name "*" -type f | fgrep -v done | fgrep -v "std.out" | fgrep -v "stats.json" > stream2 & 20 | fi 21 | 22 | cat stream2 | perl -ne 'chomp; $f=$_; @f=split(/\//,$f); $f2=pop(@f); @f2=split(/!/,$f2); $run=shift(@f2); $study=shift(@f2); $study=~/(..)$/; $lo1=$1; $run=~/(..)$/; $lo2=$1; $ff="'$unifier_input_dir'/$lo1/$study/$lo2/$run/$study"."_0_att0"; `mkdir -p $ff`; `ln -f $f $ff/$f2`;' 23 | rm -f stream1 24 | rm -f stream2 25 | #absolutely need the ".done" files 26 | find $unifier_input_dir -name "*att0" | perl -ne 'chomp; $f=$_; `touch $f.done`;' 27 | -------------------------------------------------------------------------------- /feeder.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python2.7 2 | import sys 3 | import os 4 | import time 5 | import subprocess 6 | 7 | #sleep for 1 min. 8 | TIME_TO_SLEEP=60 9 | 10 | #total number of concurrent jobs (either running or pending) 11 | TOTAL=25 12 | JOB_TAG='skx-normal' 13 | STARTING_IDX=0 14 | #JOB_LIST_FILE='gtf.jobs.remaining' 15 | 16 | def count_queued(job_tag, user='cwilks'): 17 | return int(subprocess.check_output('squeue -u %s | grep %s | wc -l' % (user,job_tag),shell=True)) 18 | 19 | def check_capacity(limit, job_tag, user='cwilks'): 20 | queued = count_queued(job_tag, user=user) 21 | return limit - queued 22 | 23 | def submit_job(job_str): 24 | subprocess.call(job_str, shell=True) 25 | 26 | def submit_new_jobs(capacity, job_str, current_idx): 27 | if capacity <= 0: 28 | return current_idx 29 | for i in range(0,capacity): 30 | submit_job(job_str) 31 | return current_idx + capacity 32 | 33 | if __name__ == '__main__': 34 | user = 'cwilks' 35 | if len(sys.argv) > 1: 36 | user = sys.argv[1] 37 | script = 'job-skx-normal_short.sh' 38 | if len(sys.argv) > 2: 39 | script = sys.argv[2] 40 | while(True): 41 | capacity = check_capacity(TOTAL, JOB_TAG, user=user) 42 | sys.stdout.write("capacity is %d\n" % capacity) 43 | current_idx = submit_new_jobs(capacity, 'sbatch %s' % script, 0) 44 | time.sleep(TIME_TO_SLEEP) 45 | -------------------------------------------------------------------------------- /singularity/hpc_unify.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | #fill these in as per your HPC cluster setup (default is for SLURM) 3 | #SBATCH --partition= 4 | #SBATCH --job-name= 5 | #SBATCH --nodes= 6 | #SBATCH --ntasks-per-node= 7 | #SBATCH --time= 8 | #runner for recount-unify (post-pump Monorail) on HPC 9 | set -exo pipefail 10 | 11 | #requires GNU parallel to run pump intra-node processes 12 | module load gnuparallel 13 | module load singularity 14 | 15 | #.e.g /path/to/monorail-exertnal/singularity 16 | dir=./ 17 | export IMAGE=/path/to/singularity_cache/recount-unify-1.0.4.simg 18 | export REF=hg38 19 | #this containers the subdirs hg38 (grcm38) and hg38_unify (grcm38_unify) 20 | export REFS_DIR=/path/to/refs 21 | export NUM_CORES=40 22 | 23 | #default project name (sra, tcga, gtex, etc...) and compilation ID (for rail_id generation) 24 | export PROJECT_SHORT_NAME_AND_ID='sra:101' 25 | 26 | #e.g. /scratch/04620/cwilks/workshop 27 | WORKING_DIR=$1 28 | STUDY=$2 29 | PUMP_OUTPUT_DIR=$3 30 | 31 | JOB_ID=$SLURM_JOB_ID 32 | export WORKING_DIR=$WORKING_DIR/unify/${STUDY}.${JOB_ID} 33 | mkdir -p $WORKING_DIR 34 | pump_study_samples_file=$WORKING_DIR/input_samples.tsv 35 | echo "study sample" > $pump_study_samples_file 36 | find $PUMP_OUTPUT_DIR -name "*.manifest" | sed 's#^.\+att0/$[^!]\+$!$[^!]\+$!.\+$#\2\t\1#' >> $pump_study_samples_file 37 | 38 | num_samples=$(tail -n+2 $pump_study_samples_file | wc -l) 39 | echo "number of samples in $STUDY's pump output for $PUMP_OUTPUT_DIR: $num_samples" 40 | 41 | /bin/bash $dir/run_recount_unify.sh $IMAGE $REF $REFS_DIR $WORKING_DIR $PUMP_OUTPUT_DIR $pump_study_samples_file $NUM_CORES $PROJECT_SHORT_NAME_AND_ID > $WORKING_DIR/unify.run 2>&1 42 | -------------------------------------------------------------------------------- /singularity/hpc_pump.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash -l 2 | #fill these in as per your HPC cluster setup (default is for SLURM) 3 | #SBATCH --partition= 4 | #SBATCH --job-name= 5 | #SBATCH --nodes= 6 | #SBATCH --ntasks-per-node= 7 | #SBATCH --time= 8 | 9 | #requires GNU parallel to run pump intra-node processes 10 | 11 | module load gnuparallel 12 | module load singularity 13 | 14 | #e.g. /path/to/monorail-external/singularity 15 | dir=./ 16 | export IMAGE=/path/to/singularity_cache/recount-rs5-1.0.6.simg 17 | export REF=hg38 18 | #this containers the subdirs hg38 (grcm38) and hg38_unify (grcm38_unify) 19 | export REFS_DIR=/path/to/refs 20 | export NUM_PUMP_PROCESSES=16 21 | export NUM_CORES=8 22 | 23 | #study name/accession, e.g. ERP001942 24 | study=$1 25 | #file with list of runs accessions to process from study 26 | #e.g. /home1/04620/cwilks/scratch/workshop/SRP096788.runs.txt 27 | runs_file=$2 28 | #e.g. /scratch/04620/cwilks/workshop 29 | WORKING_DIR=$3 30 | 31 | JOB_ID=$SLURM_JOB_ID 32 | export WORKING_DIR=$WORKING_DIR/pump/${study}.${JOB_ID} 33 | for f in input output temp temp_big; do mkdir -p $WORKING_DIR/$f ; done 34 | 35 | #store the log for each job run 36 | mkdir -p $WORKING_DIR/jobs_run/${JOB_ID} 37 | 38 | echo -n "" > $WORKING_DIR/pump.jobs 39 | for r in `cat $runs_file`; do 40 | echo "LD_PRELOAD=/work/00410/huang/share/patch/myopen.so /bin/bash -x $dir/run_recount_pump.sh $IMAGE $r $study $REF $NUM_CORES $REFS_DIR > $WORKING_DIR/${r}.${study}.pump.run 2>&1" >> $WORKING_DIR/pump.jobs 41 | done 42 | 43 | #ignore failures to get done as many as possible (e.g. don't want to lose the node if only one sub run/sample fails) 44 | parallel -j $NUM_PUMP_PROCESSES < $WORKING_DIR/pump.jobs || true 45 | -------------------------------------------------------------------------------- /scripts/compare_pump_runs.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | set -x 3 | export LC_ALL=C 4 | #assumes current working directory is the pump output of accession/sample we want to test 5 | #expect the following legit differences (all diffed below except manifest), which have no significant bearing on recount3/snaptron: 6 | #*.manifest 7 | #*.Chimeric.out.sam.zst (CO and PG headers maybe slightly different due to differences in filenames/paths, and/or order) 8 | #*.Chimeric.out.junction.zst (header may be slightly different due to same reason as Chimeric.out.sam.zst, and/or order) 9 | #*.salmon.tsv.zst (small changes in floating point coverage output, likely due to stochastic nature of computation) 10 | #*.jx_bed.zst (purely order) 11 | #*.bamcount_jx.tsv.zst (purely order) 12 | #*.bamcount_nonref.csv.zst (purely order) 13 | #any files in one directory not in the other 14 | 15 | #*.exon_bw_count.zst SHOULD NOT BE DIFFERENT AT ALL, or there's something wrong!! 16 | 17 | #path to previous pump run on same accession/sample (assumes "sra" is download method) 18 | path=$1 19 | 20 | ls -l | tr -s " " \\t | cut -f 5,9 | sort > one.list 21 | ls -l $path | tr -s " " \\t | cut -f 5,9 | sort > two.list 22 | diff one.list two.list | fgrep -v log | fgrep -v unmapped | fgrep -v summary | fgrep sra > list.diff 23 | 24 | for suffix in exon_bw_count.zst salmon.tsv.zst jx_bed.zst bamcount_jx.tsv.zst bamcount_nonref.csv.zst Chimeric.out.sam.zst Chimeric.out.junction.zst; do 25 | fn1=$(ls *.${suffix}) 26 | fn2=$(ls $path/*.${suffix}) 27 | zstd -cd $fn1 > one 28 | zstd -cd $fn2 > two 29 | cmd="cat " 30 | if [[ "$suffix" == "salmon.tsv.zst" || "$suffix" == "jx_bed.zst" ]]; then 31 | cmd="cut -f 1-3,5-" 32 | fi 33 | diff <($cmd one | sort) <($cmd two | sort) > ${suffix}.diff 34 | done 35 | rm -f one two 36 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # CHANGELOG for all things Monorail 2 | 3 | # 20230603 4 | update recount-pump `download.sh` (yet again) to use sratoolkit 3.0.2 for *both* prefetch *and* fastq-dump. 5 | 6 | The SRA file extension `.sralite` is now supported (beyond just the `.sra` file). 7 | 8 | Outputs should still be backwards compatible with previous Pump runs, so no need to re-run Pump runs from before this. 9 | 10 | Pump image now at 1.1.3 11 | 12 | # 20221012 13 | updated recount-pump `download.sh` (again) to use 1) sratoolkit 3.0.0 *for downloading via prefetch* to fix issues with dbGaP in 2.11.2 14 | 15 | Outputs are still backwards compatible with previous Pump runs, so no need to re-run Pump runs from before this. 16 | 17 | Pump image now at 1.1.1 18 | 19 | ## 20221011 20 | updated recount-pump `download.sh` to use 1) sratoolkit 2.11.2 *for downloading via prefetch* to fix issues with 2.9.1 and the cloud 2) support limited download-from-S3 *from within AWS* (e.g. on an EC2 instance, download method is `s3` instead of e.g. `local`). Also, the main Snakemake configuration file can be overridden from the command line by setting `CONFIGFILE=/container/visible/path/to/monorail_config.json`. This can be used to modify the `download_exe` setting to a user-developed script to support other download methods, e.g. `{"download_exe":"/container/visible/path/to/my_download.sh"}`. 21 | 22 | Outputs are still backwards compatible with previous Pump runs, so no need to re-run Pump runs from before this. 23 | 24 | Pump image now at 1.1.0 25 | 26 | ## 20221005 27 | updated recount-unify's workflow.bash to support more than just human G026 annotation for gene sums checking via Megadepth 28 | (was hanging at this step for non-human organisms). Only applies to non-human organism Unifier runs (should not longer stall at check step). 29 | 30 | Unifier image now at 1.1.1 31 | 32 | ## 20220219 33 | fixed recount-unify's rejoin collision bug causing a number of genes to have wrong sums, update is not compatible with previous versions of the Unifier. Users should re-run unifier on all pump outputs with this release! 34 | 35 | Unifier image now at 1.1.0 36 | 37 | ## 20210408 38 | added post-run row count checks to Unifier 39 | 40 | ## 20210315 41 | added some support for dbGaP runs via the `docker/run_recount_unify.sh` script 42 | -------------------------------------------------------------------------------- /get_unify_refs.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | #either hg38 or grcm38 3 | set -ex 4 | 5 | org=$1 6 | 7 | mkdir -p ${org}_unify 8 | pushd ${org}_unify 9 | 10 | #first grab all the disjoint exon mapping files (mapping back to annotated exons & genes) 11 | #as well as a stand in file with 0's for those samples w/o any exon sums (blank_exon_sums) 12 | #finally get the chromosome sizes for the genome reference used in recount-pump 13 | for f in disjoint2exons.bed.gz disjoint2exons2genes.bed.gz disjoint2exons2genes.rejoin_genes.bed.gz recount_pump.chr_sizes.tsv.gz blank_exon_sums.gz exon_bitmasks.tsv.gz exon_bitmask_coords.tsv.gz ; do 14 | unzipped=$(echo $f | sed 's/\.gz$//') 15 | if [[ ! -e "$unzipped" ]]; then 16 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}_unify/$f 17 | gunzip $f 18 | fi 19 | done 20 | 21 | #get rows counts for Unify post-run validation 22 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}_unify/gene_exon_annotation_row_counts.tsv 23 | 24 | #next get list of annotated jx's which is separate the main annotations used in recount-pump 25 | #annotated junctions stay gzipped 26 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}_unify/annotated_junctions.tsv.gz 27 | 28 | #now get genome ref FASTA file, this is part of the recount-pump refs 29 | #so just get it from there 30 | if [[ ! -e ../${org}/fasta/genome.fa ]]; then 31 | mkdir -p ../${org} 32 | pushd ../${org} 33 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}/fasta.tar.gz 34 | tar -zxvf fasta.tar.gz 35 | popd 36 | fi 37 | #can't be symbolic since the container won't be able to follow it 38 | ln -f ../${org}/fasta/genome.fa recount_pump.fa 39 | 40 | #now get disjoint of annotated exons, which is also part of the recount-pump refs 41 | if [[ ! -e ../${org}/gtf/exons.bed ]]; then 42 | mkdir -p ../${org} 43 | pushd ../${org} 44 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}/gtf.tar.gz 45 | tar -zxvf gtf.tar.gz 46 | popd 47 | fi 48 | #need to add a header to the exons file and gzip it 49 | #slight misnomer in the header, "gene" is really "chromosome" but leave for backwards compatibility 50 | cat <(echo "gene start end name score strand") ../${org}/gtf/exons.bed | gzip > exons.w_header.bed.gz 51 | 52 | #finally, grab per-annotation ordering and default annotation disjoin-exon-per-gene BED file for post-run resum check 53 | annotations="G026 G029 R109 F006 ERCC SIRV" 54 | default="G026" 55 | if [[ $org == "grcm38" ]]; then 56 | annotations="M023 ERCC SIRV" 57 | default="M023" 58 | fi 59 | if [[ ! -e disjoint2exons2genes.${default}.sorted.cut.bed ]]; then 60 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}_unify/disjoint2exons2genes.${default}.sorted.cut.bed.gz 61 | gunzip disjoint2exons2genes.${default}.sorted.cut.bed.gz 62 | fi 63 | for f in $annotations; do 64 | f="${f}.gene_sums.gene_order.tsv.gz" 65 | unzipped=$(echo $f | sed 's/\.gz$//') 66 | if [[ ! -e "$unzipped" ]]; then 67 | wget https://genome-idx.s3.amazonaws.com/recount/recount-ref/${org}_unify/$f 68 | gunzip $f 69 | fi 70 | done 71 | 72 | popd 73 | -------------------------------------------------------------------------------- /sra/README.md: -------------------------------------------------------------------------------- 1 | Downloading from SRA is a common task for us. 2 | It's also not completely straightforward. 3 | 4 | While many of the sequence run accessions *may* be found on the European mirror of SRA (ENA), not all will be there. 5 | See the last section of this document for downloading from ENA. 6 | 7 | First, it's helpful to know there's some fairly up-to-date info from the SRA folks themselves: 8 | https://github.com/ncbi/sra-tools/wiki 9 | 10 | Second, you will need to install a version (preferably the latest) of SRA-tools 11 | to download from SRA. 12 | 13 | If you're on a well maintained HPC system (e.g. TACC) there's probably a form of the `module` system running. 14 | 15 | `module spider sra` may result in a module you can `module load ` 16 | 17 | Otherwise conda's probably the easiest: 18 | 19 | `conda install -c bioconda sra-tools` 20 | 21 | As of 2019-07-29 the most consistently stable approach to downloading 22 | from SRA (public data) is to use the following strategy: 23 | 24 | * prefetch HTTPS to download cSRA formatted files 25 | * parallel-fastq-dump to convert from cSRA to actual FASTQ 26 | 27 | # Downloading a sequence run 28 | generic `prefetch` command line: 29 | 30 | `prefetch --max-size 200G -L info -t http -O ` 31 | 32 | working example: 33 | 34 | `prefetch --max-size 200G -L info -t http -O ./ ERR204946` 35 | 36 | If you do want to use Aspera for speed (and you know it'll work on your particular accession), just swap the `http` in the `-t` option with `fasp`. 37 | 38 | # Converting a downloaded cSRA file to FASTQ file(s) 39 | By leveraging the `-N` and `-X` (range of spots) options of `fastq-dump`, `parallel-fastq-dump` can run multiple `fastq-dump`s in parallel on the same downloaded cSRA file. 40 | 41 | https://github.com/rvalieris/parallel-fastq-dump 42 | 43 | We have found `parallel-fastq-dump` to consistently work with certain `fastq-dump` options better than `fasterq-dump`. 44 | 45 | generic `parallel-fastq-dump` command line: 46 | 47 | ``` 48 | parallel-fastq-dump --sra-id /path/to/downloaded_cSRA_files --threads --tmpdir /path/to/temporary_working_dir -L info --split-3 --skip-technical --outdir ./ --gzip 49 | ``` 50 | 51 | NOTE: the argument to `--tmpdir` needs to exist before running the command. 52 | 53 | working example: 54 | ``` 55 | parallel-fastq-dump --sra-id ./ERR204946.sra --threads 4 --tmpdir ./tmp -L info --split-3 --skip-technical --outdir ./ --gzip 56 | ``` 57 | 58 | If for some reason `parallel-fastq-dump` isn't available, you can fall back to using good old `fastq-dump` it'll just be slower (and the options may be a little different). 59 | 60 | # Pipeline specific information 61 | The above will work in either a one-off way or for a pipeline. 62 | 63 | However, additional consideration should be taken in the case of a pipeline processing many run accessions in a batch. 64 | 65 | * To save temporary working space, configure SRA-tools to NOT "Enable Local File Caching (2)" 66 | * If running downloads from within in a container (Docker/Singularity) you *may* need to modify the SRA-tools configuration paths for downloading files to work within the container context 67 | 68 | 69 | # Downloading a sequence run from ENA 70 | 71 | On occasion we may want to download FASTQs from ENA rather than SRA. 72 | 73 | This would mainly be because ENA supports 1) direct compressed FASTQ download 2) from an FTP. 74 | These features obviate the need of using SRA-tools to 1) download and 2) convert to FASTQ. 75 | 76 | ENA also supports Globus: 77 | 78 | https://www.ebi.ac.uk/about/news/service-news/read-data-through-globus-gridftp 79 | 80 | However, *not all* SRA runs are available from ENA! 81 | 82 | Quick example of a direct FASTQ sequence run download link via HTTP/FTP for a 6-digit run: 83 | 84 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR204/ERR204946/ERR204946_1.fastq.gz 85 | 86 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR204/ERR204946/ERR204946_2.fastq.gz 87 | 88 | Single end, 7-digit run: 89 | 90 | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916/SRR1016916.fastq.gz 91 | 92 | See the ENA examples page for more details: 93 | https://www.ebi.ac.uk/ena/browse/read-download 94 | -------------------------------------------------------------------------------- /gdc/README.md: -------------------------------------------------------------------------------- 1 | The protected cancer projects TCGA & TARGET and the publicly accessible cancer cell line project (CCLE) sequence data (old, 2012 version) are all hosted on the Genomic Data Commons (GDC) maintained by the University of Chicago. 2 | 3 | They can be accessed via the "legacy" portal: 4 | 5 | https://portal.gdc.cancer.gov/legacy-archive/search/f 6 | 7 | NOTE: There is an updated version of the CCLE sequence data done by Broad (who also did the original, 2012 version) available from the SRA with more samples (total of 1019 in SRA vs. 935 in GDC). This newer version was uploaded to SRA 03/2019: 8 | 9 | https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP186687 10 | 11 | # GDC Access Key 12 | NOTE: for CCLE files only, you DO NOT NEED a token file, they are publicly accessible, no special access required. 13 | 14 | Similar to SRA (but entirely distinct) the GDC issues an access key/token. 15 | 16 | HOWEVER, GDC's access keys expire in a month from the time they're issued/downloaded (SRA's do not expire for several months=>a few years). 17 | 18 | To get the GDC access key you must have a current eRA username/password (for users internal to NIH it may be different). This should be the same as the username/password you use to login to dbGaP. 19 | 20 | Once you have this, you can navigate to the main GDC legacy portal page listed above and click on the "Login" link in the upper right hand corner of the page. 21 | 22 | This will pop-up a separate window for the eRA login site, just enter your current eRA username/password on the left-hand pane and click the "Log in" blue button. This window will then close and the GDC legacy portal page will refresh but now should see your eRA/dbGaP username in the upper right hand corner in place of "Login". Click your user name to see a drop down where you can "Download Token" as a file to your local machine. 23 | 24 | KEEP THIS FILE SECURE! `chmod` it so only you have READ/WRITABLE permissions on it! 25 | 26 | This key file will be needed as a command line argument to the GDC transfer tool. 27 | 28 | # GDC File Transfer 29 | 30 | You'll need the GDC-specific file transfer tool: 31 | 32 | `conda install -c bioconda gdc-client` 33 | 34 | Legacy projects in the GDC use UUIDs (verion 4) to uniquely identify sequence read files (and BAMs). 35 | 36 | e.g. `3d94efb8-94d4-4fde-97d8-18a159279996` 37 | 38 | This file UUID will be used to reference the specific sequence read file you want to download in the GDC transfer tool. 39 | 40 | general example: 41 | ``` 42 | gdc-client download /path/to/GDC_TOKEN_FILE -n -d /path/to/download/dir --retry-amount 3 --wait-time 3 43 | ``` 44 | As noted above, you don't need a token file for CCLE data. 45 | 46 | The command above includes a retry parameter and a wait-time parameter which control 1) how many retries to attempt and 2) the amount of time (sec.) to wait between retries. 47 | 48 | The `gdc-client` can also take a "manifest" file either downloaded from the shopping-cart-esque interface linked above or hand created, the main part of that file is a list of GDC file UUIDs that the client will download (serially?) if you want to download more than one at a time for a given session. 49 | 50 | There are times where the `gdc-client` appeared to stall, these may be due to filesystem issues or it may be the client itself. Using the `timeout` command from `bash` in front of a `gdc-client` command may help get around this in the context of a pipeline where you want to quickly fail individual files which can't download or are too slow. 51 | 52 | # Post download extraction operation 53 | 54 | RNA-seq files from GDC are typically packaged as either gzipped TAR files containing FASTQ files *OR* as tar files containing gzipped FASTQs. 55 | 56 | Before running the downloaded files from GDC through Monorail-external, you *must* extract the FASTQs ahead of time. 57 | 58 | # Running with Monorail's Containers 59 | 60 | Monorail run its alignment workflow (the `pump`), including downloads in Docker/Singularity containers. 61 | Therefore, the GDC_TOKEN_FILE must be accessible to the container when it's running the download. 62 | 63 | Every file UUID to be downloaded must reference the full path to the GDC_TOKEN_FILE *within the container*, e.g.: 64 | 65 | ```/container-mounts/recount/ref/gdc_creds/gdc-user-token.2019-12-26T23_45_00-08_00.txt``` 66 | 67 | This requires that the GDC_TOKEN_FILE is stored under the references directory on the host filesystem, which is already mounted into the container under `/container-mounts/recount/ref` (by default). 68 | -------------------------------------------------------------------------------- /fetch_sra_accessions_for_study.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | #script to write out jobs to download all of SRA human & mouse RNA-seq metadata 3 | import sys 4 | import os 5 | import argparse 6 | import shutil 7 | import re 8 | from Bio import Entrez as e 9 | import xml.dom.minidom 10 | import xml.etree.ElementTree as ET 11 | 12 | scriptpath = os.path.dirname(sys.argv[0]) 13 | #required to set an email 14 | e.email="downloadsRUs@dos.com" 15 | 16 | #always query for: Illumina + RNA-seq + Transcriptomic source + public while skipping smallRNAs 17 | #base_query = '(((((illumina[Platform]) AND rna seq[Strategy]) AND transcriptomic[Source]) AND public[Access]) NOT size fractionation[Selection])' 18 | base_query = '(((illumina[Platform]) AND rna seq[Strategy]) NOT size fractionation[Selection])' 19 | public_only = 'public[Access]' 20 | 21 | parser = argparse.ArgumentParser(description='query NCBI SRA for run accessions associated with passed in study accessions, e.g. ERP001942') 22 | parser.add_argument('--orgn', metavar='[SRA organism string]', type=str, default='human', help='biological organism to query ([default: human], mouse)') 23 | parser.add_argument('--tmp', metavar='[path string]', type=str, default='/tmp', help='DTD files from SRA will be stored here, default="/tmp/"') 24 | parser.add_argument('--batch-size', metavar='[integer]', type=int, default=50, help='number of full records to retrieve in a single curl job') 25 | parser.add_argument('--study', metavar='[accession string]', type=str, help='search for a single SRA accession (typically a study one e.g. ERP001942)') 26 | 27 | parser.add_argument('--exclude-protected', action='store_const', const=True, default=False, help='will query only from public, default will include protected (dbGaP) as well as public runs') 28 | 29 | parser.add_argument('--base-query', metavar='[SRA query string]', type=str, default=None, help='override base query, default: \'%s\'' % base_query) 30 | 31 | args = parser.parse_args() 32 | 33 | args.tmp=args.tmp+'/'+args.study 34 | os.makedirs(args.tmp, exist_ok=True) 35 | 36 | if args.exclude_protected: 37 | base_query = '(' + base_query + ' AND ' + public_only + ')' 38 | 39 | if args.study is not None: 40 | base_query = '(' + base_query + ' AND %s[Accession])' % args.study 41 | 42 | if args.base_query is not None: 43 | base_query = args.base_query 44 | 45 | patt = re.compile(r'\s+') 46 | orgn_nospace = 'all_organisms' 47 | if args.orgn != 'all': 48 | orgn_nospace = re.sub(patt, r'_', args.orgn) 49 | base_query += " AND %s[Organism]" % args.orgn 50 | 51 | es_ = e.esearch(db='sra', term=base_query, usehistory='y') 52 | #workaround for non-home directories for writing DTDs locally: 53 | #https://github.com/biopython/biopython/issues/918 54 | def _Entrez_read(handle, validate=True, escape=False): 55 | from Bio.Entrez import Parser 56 | from Bio import Entrez 57 | handler = Entrez.Parser.DataHandler(validate, escape) 58 | handler.directory = args.tmp # the only difference between this and `Entrez.read` 59 | record = handler.read(handle) 60 | return record 61 | es = _Entrez_read(es_) 62 | 63 | #number of records is # of EXPERIMENTs (SRX) NOT # RUNs (SRR) 64 | total_records = int(es["Count"]) 65 | sys.stderr.write("Total # of records is %d for %s using query %s\n" % (total_records, args.orgn, base_query)) 66 | 67 | num_fetches = int(total_records / args.batch_size) + 1 68 | 69 | try: 70 | for retstart_idx in range(0,num_fetches): 71 | start_idx = retstart_idx * args.batch_size 72 | end_idx = (start_idx + args.batch_size)-1 73 | fetch_handle = e.efetch(db='sra',retstart=start_idx, retmax=args.batch_size,retmode='xml',webenv=es['WebEnv'],query_key=es['QueryKey']) 74 | #biopython's Entrez module doesn't parse SRA's raw XML return format 75 | #so use Python's built-in ElementTree parser 76 | root = ET.fromstring(fetch_handle.read()) 77 | #Top Level: EXPERIMENT_PACKAGE 78 | for exp in root.findall('EXPERIMENT_PACKAGE'): 79 | #=1.0.4), are compared directly against the current recount3 data release;s exon sums. The order of exon features/rows will *not* be the same, so a simple `cbind` operation will not work in R. A reordering of one or both RSE will be required to make them directly `cbind`able. 80 | 81 | Any per-study exon sums from 3rd party studies run through the Unifier prior to 1.0.4 should be discarded and not used (not applicable to Snaptron's exon sums). 82 | -------------------------------------------------------------------------------- /singularity/run_recount_unify.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | #make sure singularity is loaded/in $PATH ahead of time 3 | set -exo pipefail 4 | 5 | #singularity_image_file=recount-unify_latest.sif 6 | 7 | #e.g. recount-unify_latest.sif 8 | singularity_image_file=$1 9 | 10 | #hg38 or grcm38 11 | REF=$2 12 | 13 | #full path on host to one direcotry above where get_unify_refs.sh deposited its files 14 | REF_DIR_HOST=$3 15 | 16 | #full path on host to where we should actually run the unifier 17 | WORKING_DIR_HOST=$4 18 | pushd $WORKING_DIR_HOST 19 | 20 | #full path on host to where the output from recount-pump resides 21 | #this needs to be writable by this script! 22 | INPUT_DIR_HOST=$5 23 | 24 | #list of 2 or more study_idsample_id + any number of tab delimited optional fields 25 | #this file MUST have a header otherwise the first row will be skipped! 26 | SAMPLE_ID_MANIFEST_HOST=$6 27 | 28 | #number of processes to start within container, 10-40 are reasonable depending on the system/run 29 | NUM_CPUS=$7 30 | 31 | #optional, this is used as the project name as well as compilation_id in the jx output, defaults to "sra" and 101 respectively 32 | #compilation short name (e.g. "sra", "gtex", or "tcga", to be compatible with recount3 outputs) or custom name 33 | #format: 'project_short_name:project_id', default:'sra:101' 34 | export PROJECT_SHORT_NAME='sra' 35 | export PROJECT_ID=101 36 | export PROJECT_SHORT_NAME_AND_ID=$8 37 | if [[ -n $PROJECT_SHORT_NAME_AND_ID ]]; then 38 | failed_format_check=$(perl -e '$in="'$PROJECT_SHORT_NAME_AND_ID'"; chomp($in); ($p,$pid)=split(/:/,$in); if(length($p) == 0 || $pid!~/^(\d+)$/ || $pid < 100 || $pid > 249) { print" bad project short name ($p) and/or project ID ($pid) input, format :project_id(int)> and project_id must be > 100 and < 250, exiting\n"; exit(-1);}') 39 | if [[ -n $failed_format_check ]]; then 40 | echo $failed_format_check 41 | exit -1 42 | fi 43 | export PROJECT_SHORT_NAME=$(echo $PROJECT_SHORT_NAME_AND_ID | tr ':' \\t | cut -f 1) 44 | export PROJECT_ID=$(echo $PROJECT_SHORT_NAME_AND_ID | tr ':' \\t | cut -f 2) 45 | fi 46 | echo "PROJECT_SHORT_NAME=$PROJECT_SHORT_NAME" 47 | echo "PROJECT_ID=$PROJECT_ID" 48 | 49 | export MULTI_STUDY=1 50 | #optional, only used if you explicitly want recount-unify to build a single study 51 | #this is usually true only when you want to skip producing recount3 formatted data 52 | #e.g. you only want Snaptron-ready output 53 | SINGLE_STUDY_ONLY=$9 54 | if [[ ! -z $SINGLE_STUDY_ONLY ]]; then 55 | MULTI_STUDY= 56 | fi 57 | 58 | INPUT_FROM_PUMP_DIR=$WORKING_DIR_HOST/input_from_pump 59 | mkdir -p $INPUT_FROM_PUMP_DIR 60 | 61 | #NOTE: the following assumes the unifier is being run on the *same filesystem* as the pump 62 | #as it makes hard links to the pump output 63 | #make sure input data is properly organized for the Unifier 64 | #assumes an original output format: $INPUT_DIR_HOST/sampleID_att0/sampleID!studyID!*.manifest 65 | #we can skip this if $SKIP_FIND is set in the running environment 66 | #../geuvadis_small_output/ERR188431_att0/ERR188431!ERP001942!hg38!sra.align.log 67 | if [[ ! -z $MULTI_STUDY && -z $SKIP_FIND ]]; then 68 | find $INPUT_DIR_HOST -name '*!*' | perl -ne 'BEGIN { $run_id=1; } $working_dir="'$INPUT_FROM_PUMP_DIR'"; chomp; $f=$_; @f=split(/\//,$f); $fm=pop(@f); $original=join("/",@f); $run_dir=pop(@f); @f2=split(/!/,$fm); $sample=shift(@f2); if(!$h{$sample}) { $h{$sample}=$run_id++; } $i=$h{$sample}; $study=shift(@f2); $study=~/(..)$/; $lo1=$1; $sample=~/(..)$/; $lo2=$1; $parent=join("/",@f); $newsub="$working_dir/$lo1/$study/$lo2/$sample"; $i++; $run_dir=~s/(_att\d+)$/_in$i$1/; `mkdir -p $newsub/$run_dir ; ln -f $f $newsub/$run_dir/ ; touch $newsub/$run_dir.done`;' 69 | fi 70 | 71 | #inside container mount for REF files 72 | export REF_DIR=/container-mounts/ref 73 | 74 | REF_DIR_HOST=$REF_DIR_HOST/$REF"_unify" 75 | 76 | export ORGANISM_REF=$REF 77 | 78 | #human 79 | if [[ $REF == 'hg38' ]]; then 80 | export LIST_OF_ANNOTATIONS='G026,G029,R109,F006,ERCC,SIRV' 81 | #mouse 82 | else 83 | if [[ $REF == 'grcm38' ]]; then 84 | export LIST_OF_ANNOTATIONS='M023,ERCC,SIRV' 85 | else 86 | echo "ERROR: unrecognized organism: $REF, exiting" 87 | exit 88 | fi 89 | fi 90 | 91 | #generic names are used for the organism specific REF files upstream 92 | #so just need to assign them to the ENV vars expected by the container 93 | export ANNOTATED_JXS=$REF_DIR/annotated_junctions.tsv.gz 94 | export EXON_COORDINATES_BED=$REF_DIR/exons.w_header.bed.gz 95 | export EXON_REJOIN_MAPPING=$REF_DIR/disjoint2exons.bed 96 | export GENE_REJOIN_MAPPING=$REF_DIR/disjoint2exons2genes.bed 97 | export GENE_ANNOTATION_MAPPING=$REF_DIR/disjoint2exons2genes.rejoin_genes.bed 98 | export REF_FASTA=$REF_DIR/recount_pump.fa 99 | export REF_SIZES=$REF_DIR/recount_pump.chr_sizes.tsv 100 | export EXON_BITMASKS=$REF_DIR/exon_bitmasks.tsv 101 | export EXON_BITMASK_COORDS=$REF_DIR/exon_bitmask_coords.tsv 102 | 103 | export INPUT_DIR=/container-mounts/input 104 | export WORKING_DIR=/container-mounts/working 105 | export REF_DIR=/container-mounts/ref 106 | 107 | export RECOUNT_CPUS=$NUM_CPUS 108 | 109 | 110 | #do some munging of the passed in sample IDs + optional metadata files 111 | sample_id_manfest_fn=$(basename $SAMPLE_ID_MANIFEST_HOST) 112 | #first copy the full original samples manifest into a directory visible to the container 113 | rsync -av $SAMPLE_ID_MANIFEST_HOST $WORKING_DIR_HOST/$sample_id_manfest_fn 114 | export SAMPLE_ID_MANIFEST_HOST_ORIG=$WORKING_DIR_HOST/$sample_id_manfest_fn 115 | export SAMPLE_ID_MANIFEST_HOST=$WORKING_DIR_HOST/ids.input 116 | #now cut out just the 1st 2 columns (study, sample_id), expecting a header 117 | cut -f 1,2 $SAMPLE_ID_MANIFEST_HOST_ORIG | tail -n+2 > $SAMPLE_ID_MANIFEST_HOST 118 | export SAMPLE_ID_MANIFEST=$WORKING_DIR/ids.input 119 | export SAMPLE_ID_MANIFEST_ORIG=$WORKING_DIR/$sample_id_manfest_fn 120 | 121 | singularity exec -B $INPUT_FROM_PUMP_DIR:$INPUT_DIR -B $WORKING_DIR_HOST:$WORKING_DIR -B $REF_DIR_HOST:$REF_DIR $singularity_image_file /bin/bash -x -c "source activate recount-unify && /recount-unify/workflow.bash" 122 | 123 | #putting all relevant final output files in one directory 124 | mkdir -p ../run_files 125 | mv * ../run_files/ 126 | mv ../run_files/ ./ 127 | pushd run_files 128 | mv ids.tsv junctions.* lucene* samples.* gene_sums_per_study exon_sums_per_study junction_counts_per_study metadata ../ 129 | popd 130 | -------------------------------------------------------------------------------- /singularity/run_recount_pump.sh: -------------------------------------------------------------------------------- 1 | #make sure singularity/docker are in PATH 2 | umask 0077 3 | export PERL5LIB= 4 | 5 | #if running a sample from dbGaP (hosted by SRA), you must 6 | #use >=recount-pump:1.1.1 and you'll need to provide the path to the key file (*.ngc) 7 | #via the NGC environmental variable, e.g.: 8 | #export NGC=/container-mounts/recount/ref/.ncbi/prj_.ngc 9 | #where /container-mounts/recount/ref is the *within* container path to the same path as $ref_path below 10 | 11 | #this script will automatically attempt to determine whether Singularity or Docker should be run 12 | #based on whether or not the argument to $container_image image below has a ".simg" or a ".sif" suffix 13 | #if not, it will run Docker 14 | 15 | #e.g. if Singularity, then something like recount-rs5-1.0.2.simg or recount-rs5-1.0.2.sif 16 | #OR if Docker, the name of the image in the local repo or the full name:version 17 | container_image=$1 18 | 19 | #run accession (sra, e.g. SRR390728), or internal ID (local), 20 | #this can be anything as long as its consistently used to identify the particular sample 21 | run_acc=$2 22 | #"local", "copy" (still local), or the SRA study accession (e.g. SRP020237) if downloading from SRA 23 | study=$3 24 | #"hg38" (human) or "grcm38" (mouse) 25 | ref_name=$4 26 | #number of processes to start within container, 4-16 are reasonable depending on the system/run 27 | num_cpus=$5 28 | #full path to location of downloaded refs 29 | #this directory should contain either "hg38" or "grcm38" subdirectories (or both) 30 | ref_path=$6 31 | 32 | #full file path to first read mates (optional) 33 | fp1=$7 34 | #full file path to second read mates (optional, if not using a 2nd read file but want to set actual_study below, just set to same 35 | #value as fp1 36 | fp2=$8 37 | 38 | #if running "local" (or "copy"), then use this to pass the real study name (optional) 39 | actual_study=$9 40 | 41 | #if overriding pump Snakemake parameters (see recount-pump/workflow/rs5/workflow.bash comments) 42 | #set CONFIGFILE=/container/reachable/path/to/config.json 43 | #if using a different version of sratoolkit than what's in the base recount-pump container 44 | #you will need to set the VDB_CONFIG to a container reachable path for this sratoolkit's version of user-settings.mkfg is located 45 | 46 | #change this if you want a different root path for all the outputs 47 | #(Docker needs absolute paths to be volume bound in the container) 48 | if [[ -z $WORKING_DIR ]]; then 49 | root=`pwd` 50 | else 51 | root=$WORKING_DIR 52 | fi 53 | 54 | export RECOUNT_JOB_ID=${run_acc}_in0_att0 55 | 56 | #assumes these directories are subdirs in current working directory 57 | export RECOUNT_INPUT_HOST=$root/input/${run_acc}_att0 58 | #RECOUNT_OUTPUT_HOST stores the final set of files and some of the intermediate files while the process is running. 59 | export RECOUNT_OUTPUT_HOST=$root/output/${run_acc}_att0 60 | #RECOUNT_TEMP_HOST stores the initial download of sequence files, typically this should be on a fast filesystem as it's the most IO intensive from our experience (use either a performance oriented distributed FS like Lustre or GPFS, or a ramdisk). 61 | export RECOUNT_TEMP_HOST=$root/temp/${run_acc}_att0 62 | #the *full* path to the reference indexes on the host (this directory should contain the ref_name passed in as a subdir e.g. "hg38") 63 | export RECOUNT_REF_HOST=$ref_path 64 | 65 | mkdir -p $RECOUNT_TEMP_HOST/input 66 | mkdir -p $RECOUNT_INPUT_HOST 67 | mkdir -p $RECOUNT_OUTPUT_HOST 68 | 69 | if [[ -z $RECOUNT_TEMP ]]; then 70 | export RECOUNT_TEMP=/container-mounts/recount/temp 71 | fi 72 | 73 | #expects at least $fp1 to be passed in 74 | if [[ $study == 'local' || $study == 'copy' ]]; then 75 | if [[ -z "$actual_study" ]]; then 76 | actual_study="LOCAL_STUDY" 77 | fi 78 | 79 | if [[ $study == 'local' ]]; then 80 | #hard link the input FASTQ(s) into input directory 81 | #THIS ASSUMES input files are *on the same filesystem* as the input directory! 82 | #this is required for accessing the files in the container 83 | ln -f $fp1 $RECOUNT_TEMP_HOST/input/ 84 | else 85 | cp $fp1 $RECOUNT_TEMP_HOST/input/ 86 | fi 87 | fp1_fn=$(basename $fp1) 88 | fp_string="$RECOUNT_TEMP/input/$fp1_fn" 89 | if [[ ! -z $fp2 && $fp2 != $fp1 ]]; then 90 | if [[ $study == 'local' ]]; then 91 | ln -f $fp2 $RECOUNT_TEMP_HOST/input/ 92 | else 93 | cp $fp2 $RECOUNT_TEMP_HOST/input/ 94 | fi 95 | fp2_fn=$(basename $fp2) 96 | fp_string="$RECOUNT_TEMP/input/$fp1_fn;$RECOUNT_TEMP/input/$fp2_fn" 97 | fi 98 | #only one run accession per run of this file 99 | #If you try to list multiple items in a single accessions.txt file you'll get a mixed run which will fail. 100 | echo -n "${run_acc},${actual_study},${ref_name},local,$fp_string" > ${RECOUNT_INPUT_HOST}/accession.txt 101 | else 102 | echo -n "${run_acc},${study},${ref_name},sra,${run_acc}" > ${RECOUNT_INPUT_HOST}/accession.txt 103 | fi 104 | 105 | export RECOUNT_INPUT=/container-mounts/recount/input 106 | export RECOUNT_OUTPUT=/container-mounts/recount/output 107 | export RECOUNT_REF=/container-mounts/recount/ref 108 | 109 | export RECOUNT_CPUS=$num_cpus 110 | 111 | export RECOUNT_TEMP_BIG_HOST=$root/temp_big/${run_acc}_att0 112 | mkdir -p $RECOUNT_TEMP_BIG_HOST 113 | if [[ -z $RECOUNT_TEMP_BIG ]]; then 114 | export RECOUNT_TEMP_BIG=/container-mounts/recount/temp_big 115 | fi 116 | 117 | export KEEP_BAM=1 118 | 119 | use_singularity=$(perl -e 'print "1\n" if("'$container_image'"=~/(\.sif$)|(\.simg$)/);') 120 | if [[ -z $CONFIGFILE ]]; then 121 | CONFIGFILE="" 122 | fi 123 | if [[ -z $use_singularity ]]; then 124 | echo "running Docker" 125 | if [[ -n $DOCKER_USER ]]; then 126 | DOCKER_USER="--user $DOCKER_USER" 127 | fi 128 | if [[ -n $NGC ]]; then 129 | extra="-e NGC $extra" 130 | fi 131 | #shared memory flag for all docker containers sharing the host's shared memory 132 | if [[ -z $NO_SHARED_MEM ]]; then 133 | extra=${extra}" --ipc=host" 134 | fi 135 | #maybe needed for aarch64 run 136 | #DOCKER_USER="--user root" 137 | docker run $DOCKER_USER --rm -e VDB_CONFIG -e RECOUNT_INPUT -e RECOUNT_OUTPUT -e RECOUNT_REF -e RECOUNT_TEMP -e RECOUNT_TEMP_BIG -e RECOUNT_CPUS -e KEEP_BAM -e KEEP_FASTQ -e KEEP_UNMAPPED_FASTQ -e NO_SHARED_MEM -e CONFIGFILE -v $RECOUNT_REF_HOST:$RECOUNT_REF -v $RECOUNT_TEMP_BIG_HOST:$RECOUNT_TEMP_BIG -v $RECOUNT_TEMP_HOST:$RECOUNT_TEMP -v $RECOUNT_INPUT_HOST:$RECOUNT_INPUT -v $RECOUNT_OUTPUT_HOST:$RECOUNT_OUTPUT $extra --name recount-pump${run_acc} $container_image 138 | else 139 | echo "running Singularity" 140 | singularity exec -B $RECOUNT_REF_HOST:$RECOUNT_REF -B $RECOUNT_TEMP_BIG_HOST:$RECOUNT_TEMP_BIG -B $RECOUNT_TEMP_HOST:$RECOUNT_TEMP -B $RECOUNT_INPUT_HOST:$RECOUNT_INPUT -B $RECOUNT_OUTPUT_HOST:$RECOUNT_OUTPUT $extra $container_image /bin/bash -x -c "source activate recount && /startup.sh && /workflow.bash" 141 | fi 142 | -------------------------------------------------------------------------------- /docker/run_recount_unify.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | #make sure docker is loaded/in $PATH ahead of time 3 | set -exo pipefail 4 | 5 | #For dbGaP runs, define PROTECTED=1 in the running environment (e.g. in bash "export PROTECTED=1") 6 | #To do genotyping with QC, define SNPS_FILE_FOR_GENOTYPING=$REF_DIR_HOST/path/to/snps.vcf 7 | #To bypass initial link creation for a rerun, define SKIP_FIND=1 8 | 9 | #set RU_USER in environment to a user name you want the unifier run as, other it'll default to root 10 | #and won't try to move files around at the end (all raw unifier output will be kept at the root of $WORKING_DIR_HOST) 11 | #typically if you're running in a AWS VM, either "ec2-user" or "ubuntu" should do the trick 12 | if [[ -z RU_USER ]]; then 13 | RU_USER=root 14 | fi 15 | 16 | 17 | #docker only takes the UID, so need to get that locally 18 | #this won't work for ldap/external user account control 19 | RU_UID=$(egrep -e "^$RU_USER:" /etc/passwd | cut -d':' -f 3) 20 | 21 | #docker image name 22 | image_name=$1 23 | 24 | #hg38 or grcm38 25 | REF=$2 26 | 27 | #full path on host to one directory above where get_unify_refs.sh deposited its files 28 | REF_DIR_HOST=$3 29 | 30 | #full path on host to where we should actually run the unifier 31 | WORKING_DIR_HOST=$4 32 | 33 | #full path on host to where the output from recount-pump resides 34 | #this needs to be writable by this script! 35 | INPUT_DIR_HOST=$5 36 | 37 | #list of 2 or more study_idsample_id + any number of tab delimited optional fields 38 | #this file MUST have a header otherwise the first row will be skipped! 39 | SAMPLE_ID_MANIFEST_HOST=$6 40 | 41 | #number of processes to start within container, 10-40 are reasonable depending on the system/run 42 | NUM_CPUS=$7 43 | 44 | #pushd $WORKING_DIR_HOST 45 | 46 | #optional, this is used as the project name as well as compilation_id in the jx output, defaults to "sra" and 101 respectively 47 | #compilation short name (e.g. "sra", "gtex", or "tcga", to be compatible with recount3 outputs) or custom name 48 | #format: 'project_short_name:project_id', default:'sra:101' 49 | export PROJECT_SHORT_NAME='sra' 50 | export PROJECT_ID=101 51 | export PROJECT_SHORT_NAME_AND_ID=$8 52 | if [[ -n $PROJECT_SHORT_NAME_AND_ID ]]; then 53 | failed_format_check=$(perl -e '$in="'$PROJECT_SHORT_NAME_AND_ID'"; chomp($in); ($p,$pid)=split(/:/,$in); if(length($p) == 0 || $pid!~/^(\d+)$/ || $pid < 100 || $pid > 249) { print" bad project short name ($p) and/or project ID ($pid) input, format :project_id(int)> and project_id must be > 100 and < 250, exiting\n"; exit(-1);}') 54 | if [[ -n $failed_format_check ]]; then 55 | echo $failed_format_check 56 | exit -1 57 | fi 58 | export PROJECT_SHORT_NAME=$(echo $PROJECT_SHORT_NAME_AND_ID | tr ':' \\t | cut -f 1) 59 | export PROJECT_ID=$(echo $PROJECT_SHORT_NAME_AND_ID | tr ':' \\t | cut -f 2) 60 | fi 61 | echo "PROJECT_SHORT_NAME=$PROJECT_SHORT_NAME" 62 | echo "PROJECT_ID=$PROJECT_ID" 63 | 64 | export MULTI_STUDY=1 65 | #optional, only used if you explicitly want recount-unify to build a single study 66 | #this is usually true only when you want to skip producing recount3 formatted data 67 | #e.g. you only want Snaptron-ready output 68 | SINGLE_STUDY_ONLY=$9 69 | if [[ ! -z $SINGLE_STUDY_ONLY ]]; then 70 | MULTI_STUDY= 71 | fi 72 | 73 | INPUT_FROM_PUMP_DIR=$WORKING_DIR_HOST/input_from_pump 74 | mkdir -p $INPUT_FROM_PUMP_DIR 75 | 76 | #NOTE: the following assumes the unifier is being run on the *same filesystem* as the pump 77 | #as it makes hard links to the pump output 78 | #make sure input data is properly organized for the Unifier 79 | #assumes an original output format: $INPUT_DIR_HOST/sampleID_att0/sampleID!studyID!*.manifest 80 | #we can skip this if $SKIP_FIND is set in the running environment 81 | #../geuvadis_small_output/ERR188431_att0/ERR188431!ERP001942!hg38!sra.align.log 82 | if [[ ! -z $MULTI_STUDY && -z $SKIP_FIND ]]; then 83 | find $INPUT_DIR_HOST -name '*!*' | perl -ne 'BEGIN { $run_id=1; } $working_dir="'$INPUT_FROM_PUMP_DIR'"; chomp; $f=$_; @f=split(/\//,$f); $fm=pop(@f); $original=join("/",@f); $run_dir=pop(@f); @f2=split(/!/,$fm); $sample=shift(@f2); if(!$h{$sample}) { $h{$sample}=$run_id++; } $i=$h{$sample}; $study=shift(@f2); $study=~/(..)$/; $lo1=$1; $sample=~/(..)$/; $lo2=$1; $parent=join("/",@f); $newsub="$working_dir/$lo1/$study/$lo2/$sample"; $i++; $run_dir=~s/(_att\d+)$/_in$i$1/; `mkdir -p $newsub/$run_dir ; ln -f $f $newsub/$run_dir/ ; touch $newsub/$run_dir.done`;' 84 | fi 85 | 86 | #inside container mount for REF files 87 | export REF_DIR=/container-mounts/ref 88 | 89 | REF_DIR_HOST=$REF_DIR_HOST/$REF"_unify" 90 | 91 | #human 92 | if [[ $REF == 'hg38' ]]; then 93 | export LIST_OF_ANNOTATIONS='G026,G029,R109,F006,ERCC,SIRV' 94 | #mouse 95 | else 96 | if [[ $REF == 'grcm38' ]]; then 97 | export LIST_OF_ANNOTATIONS='M023,ERCC,SIRV' 98 | else 99 | echo "ERROR: unrecognized organism: $REF, exiting" 100 | exit 101 | fi 102 | fi 103 | 104 | #generic names are used for the organism specific REF files upstream 105 | #so just need to assign them to the ENV vars expected by the container 106 | export ANNOTATED_JXS=$REF_DIR/annotated_junctions.tsv.gz 107 | export EXON_COORDINATES_BED=$REF_DIR/exons.w_header.bed.gz 108 | export EXON_REJOIN_MAPPING=$REF_DIR/disjoint2exons.bed 109 | export GENE_REJOIN_MAPPING=$REF_DIR/disjoint2exons2genes.bed 110 | export GENE_ANNOTATION_MAPPING=$REF_DIR/disjoint2exons2genes.rejoin_genes.bed 111 | export REF_FASTA=$REF_DIR/recount_pump.fa 112 | export REF_SIZES=$REF_DIR/recount_pump.chr_sizes.tsv 113 | export EXON_BITMASKS=$REF_DIR/exon_bitmasks.tsv 114 | export EXON_BITMASK_COORDS=$REF_DIR/exon_bitmask_coords.tsv 115 | 116 | export INPUT_DIR=/container-mounts/input 117 | export WORKING_DIR=/container-mounts/working 118 | export REF_DIR=/container-mounts/ref 119 | 120 | export RECOUNT_CPUS=$NUM_CPUS 121 | 122 | 123 | #do some munging of the passed in sample IDs + optional metadata files 124 | sample_id_manfest_fn=$(basename $SAMPLE_ID_MANIFEST_HOST) 125 | #first copy the full original samples manifest into a directory visible to the container 126 | cp $SAMPLE_ID_MANIFEST_HOST $WORKING_DIR_HOST/$sample_id_manfest_fn 127 | export SAMPLE_ID_MANIFEST_HOST_ORIG=$WORKING_DIR_HOST/$sample_id_manfest_fn 128 | export SAMPLE_ID_MANIFEST_HOST=$WORKING_DIR_HOST/ids.input 129 | #now cut out just the 1st 2 columns (study, sample_id), expecting a header 130 | cut -f 1,2 $SAMPLE_ID_MANIFEST_HOST_ORIG | tail -n+2 > $SAMPLE_ID_MANIFEST_HOST 131 | export SAMPLE_ID_MANIFEST=$WORKING_DIR/ids.input 132 | export SAMPLE_ID_MANIFEST_ORIG=$WORKING_DIR/$sample_id_manfest_fn 133 | 134 | export ORGANISM_REF=$REF 135 | 136 | perl -e 'print "SAMPLE_ID_MANIFEST\nREF_DIR\nSAMPLE_ID_MANIFEST_ORIG\nRECOUNT_CPUS\nWORKING_DIR\nINPUT_DIR\nEXON_BITMASK_COORDS\nEXON_BITMASKS\nREF_SIZES\nREF_FASTA\nGENE_ANNOTATION_MAPPING\nGENE_REJOIN_MAPPING\nEXON_REJOIN_MAPPING\nEXON_COORDINATES_BED\nANNOTATED_JXS\nLIST_OF_ANNOTATIONS\nPROJECT_SHORT_NAME\nPROJECT_ID\nORGANISM_REF\nMULTI_STUDY\nSNPS_FILE_FOR_GENOTYPING\nPROTECTED";' > $WORKING_DIR_HOST/env.list 137 | 138 | docker run --rm --user $RU_UID --name runifer --env-file $WORKING_DIR_HOST/env.list --volume $INPUT_FROM_PUMP_DIR:$INPUT_DIR --volume $WORKING_DIR_HOST:${WORKING_DIR}:rw --volume $REF_DIR_HOST:${REF_DIR}:rw $image_name /bin/bash -c "source activate recount-unify && /recount-unify/workflow.bash" 139 | 140 | if [[ "$RU_USER" != "root" ]]; then 141 | #putting all relevant final output files in one directory 142 | mkdir -p $WORKING_DIR_HOST/run_files 143 | pushd $WORKING_DIR_HOST 144 | ls | fgrep -v run_files | xargs -I {} mv {} run_files/ 145 | pushd $WORKING_DIR_HOST/run_files 146 | mv ids.tsv junctions.* lucene* samples.* gene_sums_per_study exon_sums_per_study junction_counts_per_study metadata $WORKING_DIR_HOST/ 147 | popd 148 | popd 149 | fi 150 | -------------------------------------------------------------------------------- /dbgap/README.md: -------------------------------------------------------------------------------- 1 | This page documents the specific issues with downloading dbGaP (protected) data from SRA. 2 | 3 | This does NOT cover TCGA/CCLE/TARGET data from the GDC. That is an entirely separate process covered [here](https://github.com/langmead-lab/monorail-external/blob/master/gdc/README.md). 4 | 5 | For the general approach to downloading data from SRA, please see the [Download data from SRA](https://github.com/langmead-lab/monorail-external/blob/master/sra/README.md) page first: 6 | 7 | 8 | While the download/convert mechanisms are the same for public and protected data (as documented in the page referenced above), the additional nuances of protected data can potentially add significant trouble to the process. 9 | 10 | # Setting up your dbGaP project secure key 11 | First, you need to "import" the correct key for SRA-tools to use via `vdb-tools` in the compute environment where you're doing the downloading. This is a *per-project* key. So you'll have a different key for GTEx than for some other dbGaP project. 12 | 13 | This project key should be downloadable from dbGaP after 1) logging in using your eRA/other dbGaP username/password and 2) gaining access to a specific dbGaP protected project (e.g. GTEx). The exact flow is: 14 | 15 | * Click on the "dbGaP Data Browser" from the dbGaP main page 16 | * Click the "My Projects" sub-tab under the "Authorized Access" tab (this should be default), you should then see a list of projects your dbGaP user is authorized to download from 17 | * To the far right of the project you want to download from there should be a list of links, click the last one "get dbGaP repository key" 18 | 19 | The key filename will look something like: 20 | `prj_8716_D19642.ngc` 21 | 22 | where the `8716` denotes the project ID and it'll have what is essentially a password string (mine has 16 characters) within it. 23 | 24 | MAKE SURE YOU `chmod` THE FILE PERMISSIONS TO BE READ/WRITABLE ONLY BY YOU! 25 | 26 | This is the key file to import into SRA Tools. 27 | You should use `vdb-config -i` to 1) import the key file and 2) update the path of the destination where you plan to at least initially download the SRA files. 28 | 29 | Once imported, it will create a new file in the config directory, e.g.: 30 | `$HOME/.ncbi/dbGap-8716.enc_key` 31 | 32 | Once you've run `vdb-config -i` initially, you can also directly edit the `$HOME/.ncbi/user-settings.mkfg` to at least change paths, despite the warning not to do this. 33 | 34 | UPDATE 2022-01-26 35 | Of critical importance to be able to download and avoiding the `Access denied...` errors as illustrated in the next section, is to ensure that the `$HOME/.ncbi/user-settings.mkfg` file contains something similar to the following for the protected study of interest referenced by your request ID, in our example: `dbGaP-30495` (this is for prefetch ~2.9.x): 36 | 37 | ```/repository/user/protected/dbGaP-30495/apps/file/volumes/flat = "files" 38 | /repository/user/protected/dbGaP-30495/apps/sra/volumes/sraFlat = "sra" 39 | /repository/user/protected/dbGaP-30495/cache-enabled = "true" 40 | /repository/user/protected/dbGaP-30495/description = "" 41 | /repository/user/protected/dbGaP-30495/download-ticket = "C0DFE10E-D79C-45E3-BCE3-6EB1941EEFFB" 42 | /repository/user/protected/dbGaP-30495/encryption-key-path = "/home/ubuntu/.ncbi/dbGap-30495.enc_key" 43 | /repository/user/protected/dbGaP-30495/root = "" 44 | ``` 45 | 46 | You must have all of those after you initially import the prject `ngc` key 47 | 48 | # Determining the correct project filesystem path to download to 49 | Second, you need to be careful about the filesystem path you download/convert the protected data to/in using `prefetch` and `fastq-dump`. 50 | 51 | If you're not, you *will* receive errors of the type: 52 | ``` 53 | prefetch.2.9.1 err: query unauthorized while resolving query within virtual file system module - failed to resolve accession 'SRR1440541' - Access denied - please request permission to access phs000424/GRU in dbGaP ( 403 ) 54 | ``` 55 | (substituting the version of prefetch/run_accession/study_accession you're using) 56 | 57 | This is referenced in this issue: 58 | https://github.com/ncbi/sra-tools/issues/9 59 | 60 | and appears very briefly in the dbGap-specific section of the SRA-tools official wiki (near the end): 61 | https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data 62 | 63 | The best approach is to initially configure SRA-tools to use the actual path you want to download the dbGaP data to; e.g. probably a filesystem which is 1) large and 2) performant for concurrent read/writes (such as a Lustre-based FS). 64 | 65 | * The directory referenced in the SRA-tool configuration (e.g. via `vdb-config -i`) needs to be the parent of *all* download directories/files for the specific dbGaP project 66 | * The directory cannot be NFS mounted 67 | 68 | Contrary to my previous understanding, the path in the configuration does NOT need to have the `dbGaP-` as part of its path. 69 | 70 | You may also want to use a symlink. 71 | 72 | ## Symlinks 73 | This does work, however, the non-NFS FS requirement still applies and it still needs to be the root of all downloads/conversions. 74 | 75 | This becomes slightly trickier when run within a container (Docker/Singularity). 76 | 77 | The way to do this in a container is to create the "root" path referenced in the `vdb-config` for the specific dbGaP project to be a symlink to a path *within* the container where the data will be saved. 78 | 79 | The path will be marked as broken by the filesystem *outside* of the container, but as long as it's valid within the container this is fine. This path can then separately be mapped via the container to a real filesystem path elsewhere. 80 | 81 | An example of this on a local HPC at JHU (MARCC) for GTEx is (where `$HOME` is my MARCC home directory): 82 | 83 | `$HOME/ncbi/dbGaP-8716` 84 | 85 | which in turn is symlinked to: 86 | 87 | `/home/recount-pump/dbGaP-8716` 88 | 89 | which only resolves inside a properly configured *Singularity* container. 90 | This may not work with Docker due to home directories not being automatically mounted in Docker containers. 91 | 92 | For Docker, you may need to reconfigure SRA Tools via `vdb-config -i` (or directly editing the config file as above) to ensure that the key file path (e.g. `$HOME/.ncbi/dbGap-8716.enc_key`) and download data root directory (e.g. `/storage/cwilks/recount-pump/recount/dbGaP-8716` below) are on a path visible to the Docker container. 93 | 94 | In the Monorail (recount-pump portion) pipeline, the `creds/cluster.ini` for this example GTEx run on MARCC looks like: 95 | ``` 96 | ... 97 | temp_base=/storage/cwilks/recount-pump/recount/dbGaP-8716 98 | input_base=/storage/cwilks/recount-pump/recount/input 99 | output_base=/storage/cwilks/recount-pump/recount/output 100 | ref_mount=/container-mounts/recount/ref 101 | temp_mount=/container-mounts/recount/dbGaP-8716 102 | input_mount=/container-mounts/recount/input 103 | output_mount=/container-mounts/recount/output 104 | ... 105 | ``` 106 | (`` used to obscure full path for security reasons) 107 | 108 | # SRA/vdb-config Problems 109 | Listed here are a few problem you might encounter when trying to get dbGaP downloads to work via SRA tools (by error). 110 | 111 | ## Ongoing access denied errors 112 | 113 | First I've noticed on some systems (e.g. JHPCE) that unless you're running the `prefetch` command *under* the authorized download directory as your current working directory, the download will still result in an `access denied` error. This is not true on MARCC even for the same version of SRA-toolkit. 114 | 115 | The fix then on these systems is to temporarily `cd` into the authorized download directory, run the `prefetch` command, then `cd` back to your original working directory (using `pushd ; prefetch ... ; popd` is a convenient way to do this). 116 | 117 | If you think you've done everything above correctly, but are still getting an `...Access denied - please request permission to access...` error, redownload and re-import your dbGaP key, it may be out of date/corrupted for some reason. 118 | To do this cleanly, it's best to completely remove your current SRA tools/vdb-config and start from scratch, then import the new key as if for the first time, otherwise, vdb-config may not update anything. 119 | 120 | ## Re-creating SRA tools config directory error 121 | 122 | If in the process of resetting your SRA tools configuration (as in the problem above), you might get an error about setting `config in $HOME`, this appears to be related to the directory permissions of the `$HOME/.ncbi` directory, best to remove it, then recreate it with default permissions, then try `vdb-config -i` again. 123 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # monorail-external 2 | 3 | For a record of recent changes, please see the [CHANGELOG](https://github.com/langmead-lab/monorail-external/blob/master/CHANGELOG.md) 4 | 5 | For convenience, the latest stable versions of the images are: 6 | 7 | * Pump: `1.1.3` (as of 2023-06-03) https://quay.io/broadsword/recount-pump?tab=tags 8 | * Unify: `1.1.1` (as of 2022-10-05) https://quay.io/repository/broadsword/recount-unify?tab=tags 9 | 10 | We *strongly* suggest all users update their Unify image to 1.1.0 (or later) due to the rejoin gene collision bug fixed on 2022-02-19. 11 | Also, any studies unified with Unifier images from before that date should be re-unified with the updated image (SHA256 2a1b0cfa005a or later). 12 | Please see: 13 | https://github.com/langmead-lab/monorail-external/tree/master/fixes#rejoin-gene-collision-fix-2022-02-19 14 | 15 | If you do use the Unifier as of 1.1.0, please ensure you have all additional reference-related files as well, as these were expanded with that release: 16 | https://github.com/langmead-lab/monorail-external/commit/646c59124d546da63cbb73356273bb174b2a63ea 17 | 18 | The full source for recount (both pump and unify) is now public: 19 | 20 | https://github.com/langmead-lab/recount-pump 21 | 22 | https://github.com/langmead-lab/recount-unify 23 | 24 | If you find Monorail and/or recount3 useful, please cite this paper: 25 | 26 | Wilks, C., Zheng, S.C., Chen, F.Y. et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol 22, 323 (2021). https://doi.org/10.1186/s13059-021-02533-6 27 | 28 | While we don't control (or specifically support) the following, users may find it useful: 29 | 30 | David McGaughey (@davemcg) has graciously made public both [notes/example](https://github.com/langmead-lab/monorail-external/issues/10) and a wrapper Snakemake [script](https://github.com/davemcg/Snakerail) to run monorail, which works around some existing issues with the current implementation. 31 | 32 | ## Summary 33 | 34 | This is for helping potential users of the Monorail RNA-seq processing pipeline (alignment/quantification) get started running their own data through it. 35 | 36 | Caveat emptor: both the Monorail pipeline itself and this repo are a work in process, not the final product. 37 | 38 | If you're reading this and decide to use the pipeline that's great, but you are beta testing it. 39 | 40 | Please file issues here as you find them. 41 | 42 | Monorail is split into 2 parts: 43 | 44 | * recount-pump 45 | * recount-unify 46 | 47 | `recount` comes from the fact that Monorail is the way that the data in `recount3+` is refreshed. 48 | 49 | However, Monorail also creates data for https://github.com/ChristopherWilks/snaptron 50 | 51 | ## Requirements 52 | 53 | * Container platform (Singularity) 54 | * Pre-downloaded (or pre-built) genome-of-choice reference indexes (e.g. HG38 or GRCM38), see next section of the README for more details 55 | * List of SRA accessions to process or locally accessible file paths of runs to process 56 | * Computational resources (memory, CPU, disk space) 57 | 58 | You can specify the number of CPUs to use but the amount of memory used will be dictated by how large the STAR reference index is. 59 | For human it's *30 GBs of useable RAM*. 60 | 61 | Multiple CPUs (cores/threads) are used by the following processes run within the pipeline: 62 | 63 | * STAR (uses all CPUs given) 64 | * Salmon (upto 8 CPUs given) 65 | * parallel-fastq-dump (upto 4 CPUs given) 66 | * bamcount (upto 4 CPUs given) 67 | 68 | Snakemake itself will parallelize the various steps in the pipeline if they can be run indepdendently and are not taking all the CPUs. 69 | 70 | The amount of disk space will be run-dependent, but typically varies from 10's of MBs to 100's of MBs per *run accession* (for human/mouse). 71 | 72 | ## Getting Reference Indexes 73 | 74 | You will need to either download or pre-build the reference index files including the STAR, Salmon, the transcriptome, and HISAT2 indexes used in the Monorail pipeline. 75 | 76 | Reference indexes + annotations are already built/extracted for human (HG38, Gencode V26) and mouse (GRCM38, Gencode M23). 77 | 78 | For human hg38, `cd` into the path you will use for the `$RECOUNT_REF_HOST` path in the `singularity/run_recount_pump.sh` runner script and then run this script from the root of this repo: 79 | 80 | `get_human_ref_indexes.sh` 81 | 82 | Similarly for mouse GRCM38, do the same as above but run: 83 | 84 | `get_mouse_ref_indexes.sh` 85 | 86 | For the purpose of building your own reference indexes, the versions of the 3 tools that use them in recount-pump are: 87 | 88 | * STAR 2.7.3a 89 | * Salmon 0.12.0 90 | * HISAT2 2.1.0 91 | 92 | For the unifier, run the `get_unify_refs.sh` script with either `hg38` or `grcm38` as the one argument. 93 | 94 | ## Pump (per-sample alignment stage) 95 | 96 | You need to have Singularity running, I'll be using singularity 2.6.0 here because it's what we have been running. 97 | 98 | Singularity versions 3.x and up will probably work, but I haven't tested them extensively. 99 | 100 | We 2 modes of input: 101 | 102 | * downloading a sequence run from SRA 103 | * local FASTQ files 104 | 105 | For local runs, both gzipped and uncompressed FASTQs are supported as well as paired/single ended runs. 106 | 107 | The example script below assumes the recount-pump Singularity image is already downloaded/converted and is present in the working directory. 108 | e.g. `recount-rs5-1.0.6.simg` 109 | 110 | Check the quay.io listing for up-to-date Monorail Docker images (which can be converted into Singularity images): 111 | 112 | https://quay.io/repository/benlangmead/recount-rs5?tab=tags 113 | 114 | As of 2020-10-20 version `1.0.6` is a stable release. 115 | 116 | ### Conversion from Docker to Singularity 117 | 118 | We store versions of the monorail pipeline as Docker images in quay.io, however, they can easily be converted to Singularity images once downloaded locally: 119 | 120 | ```singularity pull docker://quay.io/benlangmead/recount-rs5:1.0.6``` 121 | 122 | will result in a Singularity image file in the current working directory: 123 | 124 | `recount-rs5-1.0.6.simg` 125 | 126 | If using the Docker version of the `recount-pump` image, supply the docker URI+version in the pump commands below instead of the path to the singularity file, e.g. `quay.io/benlangmead/recount-rs5:1.0.6` instead of `/path/to/recount-pump-singularity.simg`. 127 | 128 | NOTE: any host filesystem path mapped into a running container *must not* be a symbolic link, as the symlink will not be able to be followed within the container. 129 | 130 | Also, you will need to set the `$RECOUNT_HOST_REF` path in the script to where ever you download/build the relevant reference indexes (see below for more details). 131 | 132 | 133 | ### SRA input 134 | 135 | All you need to provide is the run accession of the sequencing run you want to process via monorail: 136 | 137 | Example: 138 | 139 | `/bin/bash run_recount_pump.sh /path/to/recount-pump-singularity.simg SRR390728 SRP020237 hg38 10 /path/to/references` 140 | 141 | The `/path/to/references` is the full path to whereever the appropriate reference getter script put them. 142 | Note: this path should not include the final subdirectory named for the reference version e.g. `hg38` or `grcm38`. 143 | 144 | This will startup a container, download the SRR390728 run accession (paired) from the study SRP020237 using upto 10 CPUs/cores. 145 | 146 | 147 | ### Local input 148 | 149 | You will need to provide a label/ID for the dataset (in place of "my_local_run") and the path to at least one FASTQ file. 150 | 151 | Example: 152 | 153 | Download the following two, tiny FASTQ files: 154 | 155 | http://snaptron.cs.jhu.edu/data/temp/SRR390728_1.fastq.gz 156 | 157 | http://snaptron.cs.jhu.edu/data/temp/SRR390728_2.fastq.gz 158 | 159 | ``` 160 | /bin/bash run_recount_pump.sh /path/to/recount-pump-singularity.simg SRR390728 local hg38 20 /path/to/references /path/to/SRR390728_1.fastq.gz /path/to/SRR390728_2.fastq.gz SRP020237 161 | ``` 162 | 163 | This will startup a container, attempt to hardlink the fastq filepaths into a temp directory, and process them using up to 20 CPUs/cores. 164 | 165 | Important: the script assumes that the input fastq files reside on the same filesystem as where the working directory is, this is required for the container to be able to access the files as the script *hardlinks* them for access by the container (the container can't follow symlinks). 166 | 167 | The 2nd mates file path is optional as is the gzip compression. 168 | The pipeline uses the `.gz` extension to figure out if gzip compression is being used or not. 169 | 170 | The final parameter is the *actual* study name, `SRP020237`, since we're overloading the normal study position with `local`. 171 | 172 | ### Additional Options 173 | 174 | As of 1.0.5 there is some support for altering how the workflow is run with the following environment variables: 175 | 176 | * KEEP_BAM=1 177 | * KEEP_FASTQ=1 178 | * NO_SHARED_MEM=1 179 | 180 | An example with all three options using the test local example: 181 | 182 | ``` 183 | export KEEP_BAM=1 && export KEEP_FASTQ=1 && export NO_SHARED_MEM=1 && /bin/bash run_recount_pump.sh /path/to/recount-pump-singularity.simg SRR390728 local hg38 20 /path/to/references /path/to/SRR390728_1.fastq.gz /path/to/SRR390728_2.fastq.gz 184 | ``` 185 | 186 | This will keep the first pass alignment BAM, the original FASTQ files, and will force STAR to be run in NoSharedMemory mode with respect to it's genome index for the first pass alignment. 187 | 188 | 189 | ## Unifier (aggregation over per-sample pump outputs) 190 | 191 | Run `get_unify_refs.sh ` to get the appropriate reference related files for the unifier (this is in addition to the reference related files for the pump downloaded above), where `` is currently either `hg38` for human or `grcm38` for mouse. 192 | 193 | WARNING: the current version of the unifier reference related files *only* work with the recount-unifier image version 1.0.4, earlier versions of the image or the reference related files *will not work* with newer versions of the other. 194 | 195 | Follow the same process as for recount-pump (above) to convert to singularity. 196 | 197 | The unifier aggregates the following cross sample outputs: 198 | 199 | * gene sums 200 | * exon sums 201 | * junction split read counts 202 | 203 | The Unifier assumes the directory/file layout and naming of default runs of recount-pump, where a single sample's output is stored like this under one main parent directory: 204 | 205 | `/_att0/` 206 | 207 | e.g. 208 | 209 | `pump_output/sample1_att0/` 210 | `pump_output/sample2_att0/` 211 | `pump_output/sample3_att0/` 212 | .... 213 | 214 | The `/path/to/pump/output` argument below references the path to the `` above. This path *must* be on the same filesystem as the unifier's working directory (`path/to/working/directory` below). Also, it must *not* be a symbolic link itself. This is because the unifier script will use `find` (w/o `-L`) to hardlink the pump's output files into the expected directry hierarchy it needs to run. That said, the `/path/to/pump/output` must not be a `parent` directory to the unifier's working directory, or else the unifier will exhibit undefined behavior. The two directories must be on the same filesystem but they should be kept separate as far as neither should be a subdirectory of the other. 215 | 216 | An example of the two might be: 217 | 218 | * `/data/recount-pump/output/study1` where the output from recount pump run on "study1" is stored as defined above 219 | * `/data/recount-unify/study1_working` where the unifier is run for "study1" 220 | 221 | 222 | To run the Unifier: 223 | 224 | ``` 225 | /bin/bash run_recount_unify.sh /path/to/recount-unifier-singularity.simg /path/to/references /path/to/working/directory /path/to/pump/output /path/to/sample_metadata.tsv 226 | ``` 227 | 228 | `/path/to/references` here may be the same path as used in recount-pump, but it must contain an additional directory: `_unify`. 229 | 230 | where `reference_version` is either `hg38` or `grcm38`. 231 | 232 | `sample_metadata.tsv` *must* have a header line and at least the following first 2 columns in exactly this order (it can have as many additional columns as desired): 233 | 234 | ``` 235 | study_idsample_id... 236 | TAB... 237 | ... 238 | ``` 239 | 240 | `` and `` can be anything that is unique within the set, however, `` must match what the study was when `recount-pump` was run, or the filenames of the output files from pump must be changed to reflect the different ``, e.g. 241 | 242 | pump file prefix: `'ERR248710!ERP002416!grcm38!sra'` 243 | 244 | either run with `` set to `ERP002416` or change all the filenames to have the prefix `'ERR248710!!grcm38!sra'` for all affected samples. 245 | 246 | Finally, you (now) must pass in a project short name (string) and a project_id (integer) to be compatible with recount3. 247 | The project short name should be unique for your organization or 'sra' if you're pulling from the Sequence Read Archive. 248 | The project_id should also be unique among projects in your organization. The project ID should also be between 100 and 250 (exclusive). 249 | 250 | Example: 251 | 252 | `sra:101` 253 | 254 | If you only want to run one of the 2 steps in the unifier (either gene+exon sums OR junction counts), you can skip the other operation: 255 | 256 | ```export SKIP_JUNCTIONS=1 && /bin/bash run_recount_unify.sh ...``` 257 | to run only gene+exon sums 258 | 259 | or 260 | 261 | ```export SKIP_SUMS=1 && /bin/bash run_recount_unify.sh ...``` 262 | to run only junction counts 263 | 264 | ### Unifier outputs 265 | 266 | #### Multi study mode (default) 267 | 268 | recount3 compatible sums/counts matrix output directories are in the `/path/to/working/directory` under: 269 | 270 | * `gene_sums_per_study` (content goes into `gene_sums` in recount3 layout section below) 271 | * `exon_sums_per_study` (content goes into `exon_sums` in recount3 layout section below) 272 | * `junction_counts_per_study` (content goes into `junctions` in recount3 layout section below) 273 | 274 | The first 2 are run together and then the junctions are aggregated. 275 | 276 | The outputs are further organized by: 277 | `study_loworder/study/` 278 | 279 | where `study_loworder` is the last 2 characters of the study ID, e.g. if study is ERP001942, then the output for gene sums will be saved under: 280 | `gene_sums_per_study/42/ERP001942` 281 | 282 | Additionally, the Unifier outputs Snaptron ready junction databases and indices: 283 | 284 | * `junctions.bgz` 285 | * `junctions.bgz.tbi` 286 | * `junctions.sqlite` 287 | 288 | `rail_id`s are also created for every sample_id submitted in the `/path/to/sample_metadata.tsv` file and stored in: 289 | 290 | `samples.tsv` 291 | 292 | Further, the Unifier will generate Lucene metadata indices based on the `samples.tsv` file for Snaptron: 293 | 294 | * `samples.fields.tsv` 295 | * `lucene_full_standard` 296 | * `lucene_full_ws` 297 | * `lucene_indexed_numeric_types.tsv` 298 | 299 | Taken together, the above junctions block gzipped files & indices along with the Lucene indices is enough for a minimally viable Snaptron instance. 300 | 301 | Intermediate and log files for the Unifier run can be found in `run_files` 302 | 303 | ### Loading custom Unifier runs into recount3 304 | 305 | recount3 http://bioconductor.org/packages/release/bioc/html/recount3.html requires a specific directory/path/folder layout to be present, either on 1) local filesystem from which the R package can load from or a 2) URL using HTTP (not HTTPS). 306 | 307 | We suggest installing the latest version of the recount3 package direct from github (in R, requires Bioconductor): 308 | `remotes::install_github("LieberInstitute/recount3")` 309 | 310 | An example layout that loads into recount3 is rooted here (DO NOT USE ANY DATA AT THIS URL FOR REAL ANALYSES): 311 | http://snaptron.cs.jhu.edu/data/temp/recount3test 312 | 313 | You should browse the subdirectories of that URL as an example of how to layout your own custom recount3 data directory hierarchy. 314 | 315 | To load that test custom recount3 root URL (but it could be either URL or local directory) in R after installing the recount3 package: 316 | 317 | ``` 318 | library(recount3) 319 | recount3_cache_rm() 320 | options(recount3_url = "http://snaptron.cs.jhu.edu/data/temp/recount3test") 321 | hp<-available_projects() 322 | rse_gene = create_rse(hp[hp$project == 'ERP001942',]) 323 | ``` 324 | 325 | You will see warnings about the following metadata files being missing (they'll be errors if on an earlier version of recount3): 326 | ``` 327 | Warning messages: 328 | 1: The 'url' does not exist or is not available. 329 | 2: The 'url' does not exist or is not available. 330 | ``` 331 | 332 | This is expected and will happen with your own custom studies as well. While these files were generated for recount3, they were never part of Monorail proper (i.e. not an automated part). You should still be able to successfully load your custom data into recount3. 333 | 334 | The `gene_sums`, `exon_sums`, and `junctions` directories can be populated by the output of the Unifier (see the layout above) using the naming as output by the unifier, expcept in the case of the junctions files where the case of `.all.` and `.unique.` needs to be changed to all upper case for recount3 to work with them (this will be fixed shortly). 335 | 336 | The `base_sums` directory can be populated by renamed `*.all.bw` files in the *pump* output, one per sample (the Unifier doesn't do anything with these files). 337 | 338 | To populate the `annotation` directories for each organism, the default recount3 annotation files (fixed as of Unifier version 1.0.4 as noted above) are at these URLs: 339 | 340 | Human Gencode V26: 341 | * http://duffel.rail.bio/recount3/human/new_annotations/exon_sums/human.exon_sums.G026.gtf.gz 342 | * http://duffel.rail.bio/recount3/human/new_annotations/gene_sums/human.gene_sums.G026.gtf.gz 343 | 344 | Mouse Gencode V23: 345 | * http://duffel.rail.bio/recount3/mouse/new_annotations/exon_sums/mouse.exon_sums.M023.gtf.gz 346 | * http://duffel.rail.bio/recount3/mouse/new_annotations/gene_sums/mouse.gene_sums.M023.gtf.gz 347 | 348 | Simply replace `G026` in the URLs above with one or more of the following to get the rest of the annotations (if so desired): 349 | 350 | Human: 351 | * `G029` (Gencode V29) 352 | * `R109` (RefSeq release 109) 353 | * `F006` (FANTOM-CAT release 6) 354 | * `SIRV` (synthetic spike-in alt. splicing isoforms) 355 | * `ERCC` (synthetic spike-in genes) 356 | 357 | Mouse: 358 | * `SIRV` (synthetic spike-in alt. splicing isoforms) 359 | * `ERCC` (synthetic spike-in genes) 360 | 361 | Finally, the `.recount_project.MD.gz` (e.g. `sra.recount_project.MD.gz`) file is cross-study for a given data_source so it falls outside of the running of any given study. If you run another study through Monorail (assuming the same datasource, e.g. "internal" or your own version of `sra`) you'd want to append the new runs/samples to `.recount_project.MD.gz` rather than overwrite it as it's the main list of all runs/studies in the data source used by recount3 to load them. 362 | 363 | ### Download from SRA/dbGaP/GDC Details 364 | 365 | Monorail already downloads from SRA automatically if given an SRA accession, however for dbGaP protected downloads: 366 | 367 | * Need to have the study-specific dbGaP key file (`prj_.ngc`) 368 | * Have the key file in a container accessible path (e.g. `/path/to/monorail/refs/.ncbi/`) 369 | * Specify this as an environmental variable before running `run_recount_pump.sh`: `export NGC=/container-mounts/recount/ref/.ncbi/prj_.ngc` 370 | 371 | 372 | Direct GDC downloads are not currently supported by Monorail-external. However, following the instructions below you can download the data to a local filesystem separately and then run Monorail-external on the files locally: 373 | 374 | Details for downloading from the GDC (TCGA/TARGET) are [here](https://github.com/langmead-lab/monorail-external/blob/master/gdc/README.md) 375 | 376 | Pre-SRAToolKit 3.0.0 info for SRA and dbGaP downloads: 377 | 378 | Specific help in downloading from SRA can be found [here](https://github.com/langmead-lab/monorail-external/blob/master/sra/README.md) 379 | 380 | Additional details for dbGaP are [here](https://github.com/langmead-lab/monorail-external/blob/master/dbgap/README.md) 381 | 382 | 383 | ### [Historical background] Layout of links to recount-pump output for recount-unifier 384 | 385 | If compatibility with recount3 gene/exon/junction matrix formats is required, the output of recount-pump needs to be organized in a specific way for the Unifier to properly produce per-study level matrices as in recount3. 386 | 387 | For example, if you find that you're getting blanks instead of actual integers in the `all.exon_bw_count.pasted.gz` file, it's likely a sign that the input directory hierarchy was not laid out correctly. 388 | 389 | Assuming your top level directory for input is called `recount_pump_full`, the expected directory hierarchy for each sequencing run/sample is: 390 | 391 | `pump_output_full/study_loworder/study/sample_loworder/sample/monorail_assigned_unique_sample_attempt_id/` 392 | 393 | e.g. 394 | `pump_output_full/42/ERP001942/25/ERR204925/ERP001942_in3_att0` 395 | 396 | where `ERP001942_in3_att0` contains the original output of recount-pump for the `ERR204925` sample. 397 | 398 | The format of the `monorail_assigned_unique_sample_attempt_id` is `originalsampleid_in#_att0`. `in#` should be unique across all samples. 399 | 400 | `study_loworder` and `sample_loworder` are *always* the last 2 characters of the study and sample IDs respectively. 401 | 402 | Your study and sample IDs may be very different the SRA example here, but they should still work in this setup. However, single letter studies/runs probably won't. 403 | --------------------------------------------------------------------------------