└── README.rst /README.rst: -------------------------------------------------------------------------------- 1 | ####################################################### 2 | Real-time analysis of Oxford Nanopore MinION sequencing 3 | ####################################################### 4 | 5 | .. sectnum:: 6 | 7 | This directory contains information for setting up real-time analysis 8 | of Oxford Nanopore sequencing data, as described in the paper: 9 | 10 | Streaming algorithms to identify pathogens and antibiotic 11 | resistance potential from real-time MinION sequencing 12 | 13 | Cao, M. D., Ganesamoorthy, D., Elliott, A. G., Zhang, H., Cooper, M. A., & Coin, L. J. M. (2016). Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing. GigaScience, 5(1), 32. http://doi.org/10.1186/s13742-016-0137-2 14 | 15 | 16 | ===================== 17 | Software installation 18 | ===================== 19 | 20 | The streamline agorithms and auxiliary programs for setting up the pipeline are provided 21 | in japsa package (https://github.com/mdcao/japsa). In addition, the following 22 | free-software dependencies are required: 23 | 24 | 1. Java >=1.7 25 | 2. bwa >=7.10 (0.7.10-r858-dirty recommended) 26 | 3. R with rJava and MultinomialCI installed 27 | 4. kalign2 (http://msa.sbc.su.se/cgi-bin/msa.cgi) 28 | 5. HDF5 (https://hdfgroup.org/HDF5/release/obtain5.html) -- optional, needed for translating native fast5 files to fastq (npReader (https://github.com/mdcao/npReader). 29 | 30 | ------------------------ 31 | Quick installation guide 32 | ------------------------ 33 | :: 34 | 35 | git clone https://github.com/mdcao/japsa 36 | cd japsa 37 | make install \ 38 | [INSTALL_DIR=~/.usr/local \] 39 | [MXMEM=7000m \] 40 | [SERVER=true \] 41 | [JLP=/usr/lib/jni:/usr/lib/R/site-library/rJava/jri] 42 | 43 | (Note: the directives in squared brackets are optional. If you use those, remove the brackets and set the values accordingly to suit your computer settings). 44 | 45 | This will install japsa according the directives: 46 | 47 | * *INSTALL_DIR*: specifies the directory to install japsa, make sure you have write permission to this directory 48 | * *MXMEM*: specifies the default memory allocated to the java virtual machine 49 | * *SERVER*: specifies whether to launch the java virtual machine in server mode 50 | * *JLP*: points to where HDF libraries and JRI are installed (e.g, /usr/local/lib:/usr/lib/R/site-library/rJava/jri). The path to HDF is only needed for creating the pipelines to analyse directly from fast5 files or simultaneously from the MinION sequencing. For your convenience, add INSTALL_DIR/bin to your PATH envirenment e.g.,:: 51 | 52 | export PATH=~/.usr/local/bin:$PATH 53 | 54 | For more detailed information for installing japsa, please refer to Japsa installation guide on 55 | http://japsa.readthedocs.org/en/latest/install.html 56 | 57 | ================== 58 | Databases and data 59 | ================== 60 | 61 | The analyses described in the paper require accessing to some pre-processed databases. We make avaibale these 62 | databases on http://data.genomicsresearch.org/Projects/npAnalysis/ (and a back up storage at https://swift.rc.nectar.org.au:8888/v1/AUTH_15574c7fb24c44b3b34069185efba190/npAnalysis/) 63 | Setting up these databases for use as follows. You can choose to download the databases relevant to your desired analyese. 64 | 65 | -------------------------------- 66 | Bacterial species identification 67 | -------------------------------- 68 | 69 | The species identification pipeline requires a database of genomes of 70 | interest which is simply the concatenation of all genomes in fasta format. 71 | Prepare an index file which specifies the species of each sequence in the 72 | database. For example: 73 | 74 | Content of genomeDB.fasta:: 75 | 76 | >NC_0000011 Chromosome of species Genus1 species1 77 | ACGTACGTACGT 78 | >NC_00000012 Plasmid 1 of species Genus1 species1 79 | ACGTACGTACGT 80 | >NC_00000013 Plasmid 2 of species Genus1 species1 81 | ACGTACGTACGT 82 | >NC_00000021 Chromsome of species Genus1 species2 83 | ACGTACGTACGT 84 | >NC_00000031 Chromsome of species Genus2 species3 85 | ACGTACGTACGT 86 | 87 | 88 | Content of speciesIndex:: 89 | 90 | Genus1_species1 >NC_0000011 Chromosome of species Genus1 species1 91 | Genus1_species1 >NC_00000012 Plasmid 1 of species Genus1 species1 92 | Genus1_species1 >NC_00000013 Plasmid 2 of species Genus1 species1 93 | Genus1_species2 >NC_00000021 Chromsome of of species Genus1 species2 94 | Genus2_species3 >NC_00000031 Chromsome of of species Genus2 species3 95 | 96 | 97 | Finally, build a bwa index of the database:: 98 | 99 | bwa index genomeDB.fasta 100 | 101 | 102 | We pre-compile the database of all bacterial genomes obtained from NCBI genbank, with the 103 | addition of two K. quasipneumoniae strains (to be updated in the manuscript). Download the 104 | database (~2.8GB), and make an bwa index of the database as follows.:: 105 | 106 | wget http://data.genomicsresearch.org/Projects/npAnalysis/SpeciesTyping.tar.gz 107 | tar zxvf SpeciesTyping.tar.gz 108 | gunzip SpeciesTyping/Bacteria/genomeDB.fasta.gz 109 | bwa index SpeciesTyping/Bacteria/genomeDB.fasta 110 | 111 | Note that it might take a while to build the bwa index for this 9Gb database. 112 | 113 | ----------------------- 114 | Strain typing with MLST 115 | ----------------------- 116 | 117 | 118 | The database for MLST typing for three species, K. pneumoniae, E. coli and 119 | S. aureus are make avaibale. Download (208KB) and unzip them:: 120 | 121 | wget http://data.genomicsresearch.org/Projects/npAnalysis/MLST.tar.gz 122 | tar zxvf MLST.tar.gz 123 | 124 | 125 | -------------------------------------------- 126 | Strain typing with gene presence and absence 127 | -------------------------------------------- 128 | 129 | The database for gene presence and absence strain typing for K. pneumoniae, E. coli and 130 | S. aureus can be obtained as follows:: 131 | 132 | wget http://data.genomicsresearch.org/Projects/npAnalysis/StrainTyping.tar.gz 133 | tar zxvf StrainTyping.tar.gz 134 | 135 | ------------------------------ 136 | Resistance gene identification 137 | ------------------------------ 138 | 139 | A database of antibiotic resistance gene obtained from resFinder (https://cge.cbs.dtu.dk/services/ResFinder/) and pre-processed and provided from:: 140 | 141 | wget http://data.genomicsresearch.org/Projects/npAnalysis/ResGene.tar.gz 142 | tar zxvf ResGene.tar.gz 143 | 144 | 145 | ====================================== 146 | Setting up real-time analysis pipeline 147 | ====================================== 148 | 149 | The framework makes use of the `interprocess communication mechanism pipe `_ as well as network channels to set up the real-time pipeline. The japsa package provides `jsa.util.streamServer `_ and `jsa.util.streamClient `_ to facilitate setting a pipeline distributed on a computer cluser. You can prepare one or more analyses to run in real-time. 150 | 151 | For bacterial species typing:: 152 | 153 | jsa.util.streamServer -port 3456 \ 154 | | bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 SpeciesTyping/Bacteria/genomeDB.fasta - 2> /dev/null \ 155 | | jsa.np.rtSpeciesTyping -bam - -index SpeciesTyping/Bacteria/speciesIndex --read 50 -time 60 -out speciesTypingResults.out 2> speciesTypingResults.log & 156 | 157 | This will create a pipeline to identify species which reports every 60 seconds, with at least 50 more reads from the last report. The pipeline waits for input on port 3456 for incoming data. 158 | 159 | 160 | For strain typing gene presence/absense for K. pneumoniae:: 161 | 162 | jsa.util.streamServer -port 3457 \ 163 | | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a StrainTyping/Klebsiella_pneumoniae/geneFam.fasta - 2> /dev/null \ 164 | | jsa.np.rtStrainTyping -bam - -geneDB StrainTyping/Klebsiella_pneumoniae/ -read 0 -time 20 --out kPStrainTyping.dat 2> kPStrainTyping.log & 165 | 166 | You can run strain typing pipelines for other species (e.g., E. coli and S. aureus) 167 | if you have reasons to believe the sample may contain these species. If these pipeline 168 | run on the same computer, make sure they listen to different ports. 169 | 170 | For strain typing with MLST:: 171 | 172 | jsa.util.streamServer -port 3458 \ 173 | | bwa mem -t 8 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y MLST/Klebsiella_pneumoniae/bwaIndex/genes.fasta - \ 174 | | jsa.np.rtMLST -bam - -mlst MLST/Klebsiella_pneumoniae/ -read 1000 -time 600 --out KpMLST.dat & 175 | 176 | Again, you can set up MLST for E. coli and/or S. aureus as well. However, due to high error rate of the current 177 | Oxford Nanopore sequencing, this analysis may require a large amount of data. The presence/absence analysis above is recommended. 178 | 179 | 180 | For resistance gene identification:: 181 | 182 | jsa.util.streamServer -port 3459 \ 183 | | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a ResGene/resFinder/DB.fasta - 2> /dev/null \ 184 | | jsa.np.rtResistGenes -bam - -score=0.0001 -time 120 -read 50 --resDB ResGene/resFinder/ -tmp _tmp_ -o resGene.dat -thread 4 2> resGene.log & 185 | 186 | 187 | You can run these sub-pipeline on one computer (they have to listen on different port) or over a number of computer. You can even split a sub-pipeline to run over two computers. For example, you can run the gene resistance analysis on one computer:: 188 | 189 | jsa.util.streamServer -port 3460 \ 190 | | jsa.np.rtResistGenes -bam - -score=0.0001 -time 120 -read 50 --resDB ResGene/resFinder/ -tmp _tmp_ -o resGene.dat -thread 4 2> resGene.log & 191 | 192 | and run bwa on another:: 193 | 194 | jsa.util.streamServer -port 3461 \ 195 | | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a ResGene/resFinder/DB.fasta - 2> /dev/null \ 196 | | jsa.util.streamClient -input - -server computer1:3460 197 | 198 | which listens for streaming data in fastq format from port 3461, aligns to the resistance gene database, and forwards the alignments in sam format the resistance gene analysis via the network. 199 | 200 | In these sub-pipelines, you may want to modify the parameter -port for jsa.util.streamServer and -t for bwa to suit your computer systems. 201 | 202 | Once these `daemons `_ are ready for their analyses, you can start npReader to streamline data into the integrated pipeline:: 203 | 204 | jsa.np.npreader -GUI -realtime -folder -fail -output data.fastq -stream server1:port1,server2:port2,server3:port3 205 | 206 | in which the -folder parameter specifies the downloads folder from the Metrichor base-calling, and the -stream parameter lists the computer addresses and port numbers that the analyses are listening on. At this point, you can start the MinION and Metrichor to start the real-time analyse. 207 | 208 | ======================= 209 | Retro-realtime analysis 210 | ======================= 211 | 212 | Note npreader no longer support extracting time information. Please use an earlier version for this. 213 | 214 | If your data have been sequenced, and depending on what processing steps have been done. 215 | 216 | * If your data have not been base-cased, you can start the pipeline as above, and run Metrichor for base-calling your data. 217 | 218 | * If your data have been base-called, and are still in fast5 format, you can run npReader as above to stream data to the pipeline. 219 | 220 | * If your data have been converted to fastq format, you can run jsa.util.streamClient to stream to the pipeline:: 221 | 222 | jsa.util.streamClient -input reads.fastq -server server1:port1,server2:port2,server3:port3 223 | 224 | * If you want to emulate the timing of your sequenced data, first convert the data to fastq format and extract the timing information (make sure parameter -time is turned on):: 225 | 226 | jsa.np.npreader -folder -fail -number -stat -time -out dataT.fastq 227 | 228 | Next sort the reads in the order they were generated:: 229 | 230 | jsa.seq.sort -i dataT.fastq -o dataS.fastq --sortKey=timestamp 231 | 232 | Finally, stream the data using jsa.np.timeEmulate:: 233 | 234 | jsa.np.timeEmulate -input dataS.fastq -scale 1 -output - |jsa.util.streamClient -input - -server server1:port1,server2:port2,server3:port3 235 | 236 | You can crease the value in -scale to test higher throughput. 237 | 238 | We provides the data from our four MinION runs in fastq format, sorted in the order 239 | of sequencing (key=cTime). To re-run our analyses, set up the analysis pipeline as above, 240 | and then stream our data through the pipeline, eg.,:: 241 | 242 | wget http://data.genomicsresearch.org/Projects/npAnalysis/data.tar.gz 243 | tar zxvf data.tar.gz 244 | jsa.np.timeEmulate -input data/nGN_045_R7_X4S.fastq -scale 120 -output - |jsa.util.streamClient -input - -server server1:port1,server2:port2,server3:port3 245 | 246 | =================== 247 | Data from the study 248 | =================== 249 | 250 | The sequencing data for the experiments in the paper have been deposited 251 | to ENA under Accession `PRJEB14532 `_. 252 | 253 | 254 | ====================== 255 | Further documentations 256 | ====================== 257 | 258 | More details of usage of the discussed programs are provided in `ReadTheDocs for Japsa `_. More specificially: 259 | 260 | * `npReader `_ 261 | * `jsa.util.streamServer `_ 262 | * `jsa.util.streamClient `_ 263 | * `jsa.np.filter `_ 264 | * `jsa.np.rtSpeciesTyping `_ 265 | * `jsa.np.rtStrainTyping `_ 266 | * `jsa.np.rtMLST `_ 267 | * `jsa.np.rtResistGenes `_ 268 | 269 | ======= 270 | Contact 271 | ======= 272 | Minh Duc Cao -- m.cao1@uq.edu.au 273 | 274 | 275 | 276 | 277 | --------------------------------------------------------------------------------