├── README.md └── scripts ├── cleanup.sh ├── invokeBwaLambdas.py ├── lambda_handler.py ├── multidownload.sh ├── runPipeline.sh ├── runSplit.sh ├── runUploadSplitFiles.sh └── uploadSplitFiles.sh /README.md: -------------------------------------------------------------------------------- 1 | # RNA-seq-lambda 2 | 3 | 4 | ## Introduction 5 | 6 | The scripts stored in this repository enable the use of AWS lambda functions to align 640 million reads with bwa in under 3 minutes - a process that takes bwa 20 hours when executing optimised software using a single thread. An entire UMI RNA-seq pipeline converting fastq files to transcript counts takes under 20 minutes on a m4 4x large instance. This is 100x faster than the 30 hours needed for the original unoptimised pipeline and 12x faster than an optimised pipeline executed with 16 threads. 7 | 8 | The repo contains the scripts used to execute the pipeline. Additional files that need to be uploaded to the s3 bucket are here: 9 | 10 | [https://drive.google.com/open?id=1nj6IoltH77i_Ikd04ey-jBcT_QRlpsk4 11 | ](https://drive.google.com/open?id=1nj6IoltH77i_Ikd04ey-jBcT_QRlpsk4) 12 | 13 | This includes the executables and human reference files. 14 | 15 | The RNA-seq pipeline itself is described in **Holistic optimization of an RNA-seq workflow for multi-threaded environments** [https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz169/5374759](https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz169/5374759) 16 | ## Executables 17 | 18 | The executables used in the lambda workflow are bwa [https://github.com/lh3/bwa](https://github.com/lh3/bwa) and 3 binaries umisplit, umimerge_filter, umi_merge [https://github.com/lhhunghimself/LINCS_RNAseq_cpp](https://github.com/lhhunghimself/LINCS_RNAseq_cpp) 19 | 20 | ### umisplit 21 | The executable demultiplexes the reads and separates them into smaller files (maximum size can be set by user). This needs to be compiled on the EC2 instance that will run umisplit. 22 | 23 | ### umifilter 24 | This converts a SAM file formatted alignment into a hash value based on the gene that the read is aligned to and the position of the alignment. This is run on the Lambda functions and should be compiled on an instance with Amazon Linux (the OS used by the Lambda functions). 25 | 26 | ### umimerge-parallel 27 | This dedups the reads with identical barcodes that map to the same position and produces the final set of transcript counts. Fhis needs to be compiled on the EC2 instance that will run umimerge-parallel. 28 | 29 | ## Scripts 30 | 31 | ### runPipeline.sh 32 | This is the master shell script runs the pipeline on an EC2 instance and launches the Lambda functions. It times and launches the other component scripts that are described next. 33 | 34 | ### multidownload.sh 35 | The script uses multiple threads to download the fastq files from S3 to the EC2 instance. 36 | 37 | ### runSplit.sh 38 | This script is a simple wrapper around the umisplit executable on EC2. It is meant to be launched as a background process so that the upload of files can proceed as soon as a split file has been generated. 39 | 40 | ### runUploadSplitFiles.sh and uploadSplitFiles.sh 41 | 42 | These two scripts are used to upload the split files to S3. runUploadSplitFiles.sh is a script that looks for complete files generated by umisplit and uploads them. It also checks when umisplit is finished executing. umisplit signals that a file is ready for transfer or that umisplit is finished splitting by writing special files. runUploadSplitFiles.sh checks whether umisplit has finished by looking for these files. If so it will call uploadSplitFiles.sh to upload the remaining split files to S3 and then terminate. Otherwise it will call uploadSplitFiles.sh to upload existing complete files, sleep for 1 second and check again whether umisplit is finished. 43 | 44 | ### invokeBwaLambdas.py 45 | This script launches the Lambda functions, assigning to each Lambda function, one small demultiplexed fastq file. It monitors the alignment output files to determine which lambdas have finished processing. When all files have been aligned, the script exits. 46 | 47 | ### cleanup.sh 48 | This script removes all the files generated by the pipeline. It is used between duplicate timing runs of runPipeline.sh 49 | 50 | ### lambda_handler.py 51 | This Python script is executed by each of the Lambda functions. A json payload informs the function as to which file should be aligned. 52 | 53 | -------------------------------------------------------------------------------- /scripts/cleanup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | rm Aligns/* -rf 4 | rm Counts/* -rf 5 | rm Seqs_local/* -rf 6 | rm Outputs/* -rf 7 | aws s3 rm s3://myBucket/Outputs/ --recursive 8 | aws s3 rm s3://myBucket/Aligns/ --recursive 9 | -------------------------------------------------------------------------------- /scripts/invokeBwaLambdas.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | #lhhung 013119 - cleaned up code from Dimitar Kumar 3 | #lhhung 031019 - added timing code 4 | import os 5 | import sys 6 | import json 7 | import glob 8 | import boto3 9 | import datetime,time 10 | from timeit import default_timer as timer 11 | 12 | def checkS3Output(splitFiles,startTimes,finishTimes,outputName,s3Files): 13 | doneFlag=1 14 | for splitFile in splitFiles: 15 | if splitFile not in finishTimes: 16 | if outputName[splitFile] in s3Files: 17 | finishTimes[splitFile]=timer() 18 | else: 19 | doneFlag=0 20 | return doneFlag 21 | 22 | def waitOnLambdas(splitFiles,startTimes,finishTimes,timeout=600): 23 | waitStartTime=timer() 24 | outputName={} 25 | for splitFile in splitFiles: 26 | outputName[splitFile]=os.path.splitext(os.path.basename(splitFile))[0] 27 | while (1): 28 | s3Files=[] 29 | s3Lines=os.popen("aws s3 ls s3://myBucket/Outputs/ --recursive | awk {'print $4'}").read().split( ) 30 | for s3Line in s3Lines: 31 | s3Files.append(os.path.splitext(os.path.basename(s3Line))[0]) 32 | if checkS3Output(splitFiles,startTimes,finishTimes,outputName,s3Files) or timer()-waitStartTime > timeout: 33 | return 34 | time.sleep(1) 35 | 36 | def getSplitFilenames(directory,suffix): 37 | sys.stderr.write("{}/*.{}\n".format(directory,suffix)) 38 | return glob.glob("{}/*.{}".format(directory,suffix)) 39 | 40 | def startLambdas(splitFiles,awsAccessKeyId,awsSecretAccessKey,region,functionName,startTimes): 41 | lambdaClient = boto3.client('lambda',aws_access_key_id=awsAccessKeyId,aws_secret_access_key=awsSecretAccessKey,region_name=region) 42 | for splitFile in splitFiles: 43 | startTimes[splitFile]=timer() 44 | sys.stderr.write('working on {}\n'.format(splitFile)) 45 | lambdaClient.invoke(FunctionName=functionName,InvocationType="Event",Payload=json.dumps({"splitFile": splitFile})) 46 | 47 | def main(): 48 | # Change these parameters 49 | functionName = "your_function_name" 50 | awsAccessKeyId = "ABCDEFGHIJKLMNOPQRST" 51 | awsSecretAccessKey = "SomeAwsSecretAccessKeyShouldGoHere123456" 52 | region = "us-east-1" 53 | 54 | #where your reads reside 55 | # directory='/home/ubuntu/LINCS/Aligns/*' 56 | directory='/home/ubuntu/LINCS/Aligns/*' 57 | # suffix='fq.gz' 58 | suffix='fq' 59 | splitFiles=getSplitFilenames(directory,suffix) 60 | startTimes={} 61 | finishTimes={} 62 | start = timer() 63 | startLambdas(splitFiles,awsAccessKeyId,awsSecretAccessKey,region,functionName,startTimes) 64 | sys.stderr.write('Time elapsed for launch is {}\n'.format(timer()-start)) 65 | waitOnLambdas(splitFiles,startTimes,finishTimes) 66 | sys.stderr.write('Time elapsed for lambdas is {}\n'.format(timer()-start)) 67 | for splitFile in splitFiles: 68 | print ('{} {} {} {}'.format(startTimes[splitFile],finishTimes[splitFile],finishTimes[splitFile]-startTimes[splitFile],splitFile)) 69 | 70 | if __name__ == "__main__": 71 | main() 72 | -------------------------------------------------------------------------------- /scripts/lambda_handler.py: -------------------------------------------------------------------------------- 1 | #lhhung 013119 - refactored and extended Dimitar Kumar's original script 2 | 3 | from subprocess import call 4 | import sys 5 | import glob 6 | import stat 7 | import os 8 | import boto3 9 | import botocore 10 | import time 11 | import datetime 12 | import json 13 | 14 | # A utility function to run a bash command from python 15 | def runCmd(cmd): 16 | sys.stderr.write("#{}\n".format(cmd)) 17 | call([cmd], shell=True) 18 | 19 | #utilities to remove directories and files except those in the whitelist 20 | def removeDirectoriesExcept(rootDirectory,whiteList): 21 | for directory in os.popen('find {} -type d -mindepth 1 -maxdepth 1 '.format(rootDirectory)).read().split('\n')[0:-1]: 22 | if directory not in whiteList: 23 | sys.stderr.write("removing {}\n".format(directory)) 24 | runCmd("rm {} -rf".format(directory)) 25 | 26 | def removeFilesExcept(rootDirectory,whiteList): 27 | for myFile in os.popen('find {} -type f '.format(rootDirectory)).read().split('\n')[0:-1]: 28 | if myFile not in whiteList: 29 | sys.stderr.write("removing {}\n".format(myFile)) 30 | try: 31 | os.remove(myFile) 32 | except Exception as e: 33 | sys.stderr.write('unable to remove {}\n'.format(myFile)) 34 | 35 | 36 | def downloadFiles(sourceFile,destFile,bucketName,overwrite=True,verbose=True): 37 | sourceFile=sourceFile.replace("/home/ubuntu/LINCS/","") 38 | s3 = boto3.resource('s3') 39 | if overwrite or not os.path.exists(destFile): 40 | try: 41 | if verbose: 42 | sys.stderr.write("Downloading {} to {}\n".format(sourceFile,destFile)) 43 | s3.Bucket(bucketName).download_file(sourceFile, destFile) 44 | except botocore.exceptions.ClientError as e: 45 | if e.response['Error']['Code'] == "404": 46 | sys.stderr.write("The object does not exist\n") 47 | else: 48 | raise 49 | 50 | # Performs BWA for the given splitFile, filterCmd, and outputFile 51 | def runBwa(splitFile,outputFile,filterCmd): 52 | cmdStr="/tmp/bwa aln -l 24 -t 2 /tmp/Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa /tmp/{} | /tmp/bwa samse -n 20 /tmp/Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa - /tmp/{} | {} > {} ".format(splitFile,splitFile,filterCmd,outputFile) 53 | sys.stderr.write("running cmd:\n{}\n".format(cmdStr)) 54 | runCmd(cmdStr) 55 | 56 | def uploadResultsTest(sourceFile,destFile,bucketName): 57 | sys.stderr.write("cp {} {}\n".format(sourceFile,destFile)) 58 | # Uploads the result to the appropriate S3 Aligns/splitFile folder 59 | def uploadResults(sourceFile,destFile,bucketName): 60 | s3 = boto3.resource('s3') 61 | destFile=destFile.replace("/home/ubuntu/LINCS/Aligns","Outputs") 62 | return s3.meta.client.upload_file(sourceFile, bucketName,destFile) 63 | 64 | # Lambda's entry point. 65 | def lambda_handler(event, context): 66 | 67 | #### List of parameters to customize #### 68 | 69 | #bwa doesn't actually need the sequence information - just the name to figure out where the indices are 70 | #these files are empty to save space - probably should add the chrM.fa file 71 | 72 | fakeFiles=['/tmp/Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa'] 73 | 74 | #sourceFiles and directories used in other places 75 | alignDir='/tmp/Aligns' 76 | refDir='/tmp/Human_RefSeq' 77 | barcodeFile="/tmp/barcodes_trugrade_96_set4.dat" #in References/BroadUMI directory 78 | erccFile="/tmp/ERCC92.fa" 79 | symToRefFile="/tmp/refGene.hg19.sym2ref.dat" 80 | 81 | 82 | sourceFiles= ["umimerge_filter", "bwa", "Human_RefSeq/chrM.fa", "barcodes_trugrade_96_set4.dat","ERCC92.fa" ,"refGene.hg19.sym2ref.dat", "Human_RefSeq/refGene.hg19.txt", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.amb", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.ann", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.bwt", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.fai", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.pac", "Human_RefSeq/refMrna_ERCC_polyAstrip.hg19.fa.sa"] 83 | 84 | #change bucketName as necessary - could pass it through event in json payload 85 | bucketName = "myBucket" 86 | 87 | #### End parameters list #### 88 | 89 | #get splitFile from json payload - may want to load s3 bucket from here also instead of hardcoding 90 | fullPathSplitFile = event["splitFile"] 91 | splitFile=os.path.basename(fullPathSplitFile) 92 | 93 | sys.stderr.write("Running handler for splitFile [{}] at time {}\n" 94 | .format( 95 | splitFile, 96 | datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S') 97 | ) 98 | ) 99 | 100 | sys.stderr.write("Json dump:\n{}\n".format(json.dumps(event, indent=4, sort_keys=True))) 101 | 102 | #cleanup files - keep RefSeq and binaries if already there 103 | whiteListFiles=[] 104 | for sourceFile in sourceFiles: 105 | destFile='/tmp/'+sourceFile 106 | whiteListFiles.append(destFile) 107 | whiteListFiles=whiteListFiles + fakeFiles 108 | 109 | removeDirectoriesExcept('/tmp',['/tmp/Human_RefSeq']) 110 | removeFilesExcept('/tmp',whiteListFiles) 111 | 112 | #create directories 113 | for directory in [alignDir,refDir]: 114 | runCmd('mkdir -p {}'.format(directory)) 115 | 116 | #make empty fakeFiles 117 | for fakeFile in fakeFiles: 118 | if not os.path.exists(fakeFile): 119 | runCmd('touch {}'.format(fakeFile)) 120 | 121 | #download source files 122 | for sourceFile in sourceFiles: 123 | destFile='/tmp/'+sourceFile 124 | downloadFiles(sourceFile,destFile,bucketName,overwrite=False,verbose=True) 125 | 126 | #download splitFile 127 | downloadFiles(fullPathSplitFile,'/tmp/' + splitFile, bucketName,overwrite=True,verbose=True) 128 | 129 | 130 | #make sure that executables have correct permissions 131 | for executable in ('/tmp/bwa','/tmp/umimerge_filter'): 132 | runCmd('chmod +x {}'.format(executable)) 133 | 134 | #run bwa 135 | outputFile='{}/{}.saf'.format(alignDir,os.path.splitext(splitFile)[0]) 136 | filterCmd="/tmp/umimerge_filter -s {} -b {} -e {}".format(symToRefFile,barcodeFile,erccFile) 137 | #filterCmd="grep -v '^\@'" 138 | sys.stderr.write("Starting bwa at {}".format(datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))) 139 | runBwa(splitFile,outputFile,filterCmd) 140 | 141 | #upload results 142 | uploadFile=os.path.dirname(fullPathSplitFile)+'/'+os.path.basename(outputFile) 143 | sys.stderr.write("Starting bwa at {}".format(datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))) 144 | uploadResults(outputFile,uploadFile,bucketName) 145 | #uploadResultsTest(outputFile,uploadFile,bucketName) 146 | #write done time 147 | sys.stderr.write("Finished at {}".format(datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S'))) 148 | 149 | -------------------------------------------------------------------------------- /scripts/multidownload.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | SEQS_DIR=$1 3 | nThreads=$2 4 | lockDir=/tmp/locks.$$ 5 | mkdir -p $lockDir 6 | 7 | runJob(){ 8 | lasti=$((${#dirs[@]} - 1)) 9 | for i in $(seq 0 ${lasti}); do 10 | if (mkdir $lockDir/lock$i 2> /dev/null ); then 11 | dir=${dirs[$i]} 12 | echo thread $1 working on $dir 13 | echo aws s3 cp s3://egria/Seqs/$dir $SEQS_DIR/$dir 14 | aws s3 cp s3://egria/Seqs/$dir $SEQS_DIR/$dir 15 | 16 | fi 17 | done 18 | exit 19 | } 20 | dirs=( $(aws s3 ls s3://myBucket/Seqs/ | awk '{print $4}')) 21 | 22 | for i in $(seq 2 $nThreads); do 23 | runJob $i & 24 | done 25 | runJob 1 & 26 | wait 27 | rm -rf $lockDir 28 | -------------------------------------------------------------------------------- /scripts/runPipeline.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | echo "seqs s3 to ec2 start" >> lambdaLogs/time.txt 3 | date >>lambdaLogs/time.txt 4 | ./multidownload.sh /home/ubuntu/LINCS/Seqs_local 12 &> lambdaLogs/download_log 5 | echo "seqs s3 to ec2 finish" >> lambdaLogs/time.txt 6 | date >>lambdaLogs/time.txt 7 | 8 | echo "start split" >> lambdaLogs/time.txt 9 | date >>lambdaLogs/time.txt 10 | ./runSplit.sh &> lambdaLogs/splitLog & 11 | 12 | echo "upload split gz ec2 to s3 begin" >>lambdaLogs/time.txt 13 | date >>lambdaLogs/time.txt 14 | ./runUploadSplitFiles.sh /home/ubuntu/LINCS/Aligns /home/ubuntu/LINCS/Seqs_local s3://myBucket/Aligns/ 7 8 >& lambdaLogs/uploadLog 15 | echo "upload split gz ec2 to s3 finish" >>lambdaLogs/time.txt 16 | date >>lambdaLogs/time.txt 17 | 18 | echo "invoke lambda begin" >>lambdaLogs/time.txt 19 | date >>lambdaLogs/time.txt 20 | python3 invokeBwaLambdas.py &> lambdaLogs/lambdaLog 21 | echo "finish lambda\n">>lambdaLogs/time.txt 22 | date >>lambdaLogs/time.txt 23 | 24 | echo "download saf s3 to ec2 begin\n">>lambdaLogs/time.txt 25 | date>>lambdaLogs/time.txt 26 | aws s3 cp s3://myBucket/Outputs Outputs --recursive &> downloadSafLog 27 | echo "download saf s3 to ec2 finish">>lambdaLogs/time.txt 28 | date >>lambdaLogs/time.txt 29 | 30 | echo "merge sam files begin">>lambdaLogs/time.txt 31 | date >>lambdaLogs/time.txt 32 | sudo ./umimerge_parallel -p 0 -f -i RNAseq_20150409 -s References/Broad_UMI/Human_RefSeq/refGene.hg19.sym2ref.dat -e References/Broad_UMI/ERCC92.fa -b References/Broad_UMI/barcodes_trugrade_96_set4.dat -a Outputs -o Counts -t 16 &> lambdaLogs/mergeLog 33 | echo "merge sam files finish">>lambdaLogs/time.txt 34 | date >>lambdaLogs/time.txt 35 | -------------------------------------------------------------------------------- /scripts/runSplit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | ./umisplit -s 150000 -d -v -l 16 -m 0 -N 0 -f -o Aligns -t 6 -b References/Broad_UMI/barcodes_trugrade_96_set4.dat Seqs_local/RNAseq_20150409_Lane1_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane1_R2.fastq.gz Seqs_local/RNAseq_20150409_Lane2_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane2_R2.fastq.gz Seqs_local/RNAseq_20150409_Lane3_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane3_R2.fastq.gz Seqs_local/RNAseq_20150409_Lane4_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane4_R2.fastq.gz Seqs_local/RNAseq_20150409_Lane5_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane5_R2.fastq.gz Seqs_local/RNAseq_20150409_Lane6_R1.fastq.gz Seqs_local/RNAseq_20150409_Lane6_R2.fastq.gz 3 | date 4 | -------------------------------------------------------------------------------- /scripts/runUploadSplitFiles.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | ALIGNS_DIR=$1 3 | SEQS_DIR=$2 4 | S3Path=$3 5 | #Threads before split is done 6 | NTHREADS1=$4 7 | #Threads after split is done 8 | NTHREADS2=$5 9 | 10 | 11 | while [ 1 ]; do 12 | nseq=`ls $SEQS_DIR | wc -l` 13 | ndone=`ls ${ALIGNS_DIR}/*.done | wc -l` 14 | if [ "${nseq}" == "${ndone}" ]; then 15 | ./uploadSplitFiles.sh $ALIGNS_DIR $S3Path $NTHREADS2 16 | ./ 17 | date 18 | exit 19 | fi 20 | ./uploadSplitFiles.sh $ALIGNS_DIR $S3Path $NTHREADS1 21 | sleep 1 22 | done 23 | -------------------------------------------------------------------------------- /scripts/uploadSplitFiles.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #will look at all the files in the ALIGN DIR 3 | #S3Path must end in / 4 | ALIGN_DIR=$1 5 | S3Path=$2 6 | nThreads=$3 7 | 8 | lockDir=/tmp/locks.$$ 9 | mkdir -p $lockDir 10 | 11 | 12 | runJob(){ 13 | lasti=$((${#files[@]} - 1)) 14 | for i in $(seq 0 ${lasti}); do 15 | if (mkdir $lockDir/lock$i 2> /dev/null ); then 16 | fileDone=${files[$i]} 17 | file=${fileDone::(-5)} 18 | echo thread $1 working on $file 19 | echo "cd $ALIGN_DIR && aws s3 cp $file $S3Path$file" 20 | cd $ALIGN_DIR && nice aws s3 cp $file $S3Path$file && nice rm $fileDone 21 | fi 22 | done 23 | exit 24 | } 25 | files=( $(cd $ALIGN_DIR && ls -d */*.done) ) 26 | 27 | for i in $(seq 2 $nThreads); do 28 | runJob $i & 29 | done 30 | runJob 1 & 31 | wait 32 | rm -rf $lockDir 33 | --------------------------------------------------------------------------------