└── README.rst


/README.rst:
--------------------------------------------------------------------------------
  1 | #######################################################
  2 | Real-time analysis of Oxford Nanopore MinION sequencing
  3 | #######################################################
  4 | 
  5 | .. sectnum::
  6 | 
  7 | This directory contains information for setting up real-time analysis
  8 | of Oxford Nanopore sequencing data, as described in the paper:
  9 | 
 10 | Streaming algorithms to identify pathogens and antibiotic
 11 | resistance potential from real-time MinION sequencing
 12 | 
 13 | Cao, M. D., Ganesamoorthy, D., Elliott, A. G., Zhang, H., Cooper, M. A., & Coin, L. J. M. (2016). Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing. GigaScience, 5(1), 32. http://doi.org/10.1186/s13742-016-0137-2
 14 | 
 15 | 
 16 | =====================
 17 | Software installation
 18 | =====================
 19 | 
 20 | The streamline agorithms and auxiliary programs for setting up the pipeline are provided
 21 | in japsa package (https://github.com/mdcao/japsa). In addition, the following
 22 | free-software dependencies are required:
 23 | 
 24 | 1. Java >=1.7
 25 | 2. bwa >=7.10 (0.7.10-r858-dirty recommended)
 26 | 3. R with rJava and MultinomialCI installed
 27 | 4. kalign2 (http://msa.sbc.su.se/cgi-bin/msa.cgi)
 28 | 5. HDF5 (https://hdfgroup.org/HDF5/release/obtain5.html) -- optional, needed for translating native fast5 files to fastq (npReader (https://github.com/mdcao/npReader).
 29 | 
 30 | ------------------------
 31 | Quick installation guide
 32 | ------------------------
 33 | ::
 34 | 
 35 |    git clone https://github.com/mdcao/japsa
 36 |    cd japsa
 37 |    make install \
 38 |      [INSTALL_DIR=~/.usr/local \]
 39 |      [MXMEM=7000m \]
 40 |      [SERVER=true \]
 41 |      [JLP=/usr/lib/jni:/usr/lib/R/site-library/rJava/jri]
 42 | 
 43 | (Note: the directives in squared brackets are optional. If you use those, remove the brackets and set the values accordingly to suit your computer settings).
 44 | 
 45 | This will install japsa according the directives:
 46 | 
 47 | * *INSTALL_DIR*: specifies the directory to install japsa, make sure you have write permission to this directory
 48 | * *MXMEM*: specifies the default memory allocated to the java virtual machine
 49 | * *SERVER*: specifies whether to launch the java virtual machine in server mode
 50 | * *JLP*: points to where HDF libraries and JRI are installed (e.g, /usr/local/lib:/usr/lib/R/site-library/rJava/jri). The path to HDF is only needed for creating the pipelines to analyse directly from fast5 files or simultaneously  from the MinION sequencing. For your convenience, add INSTALL_DIR/bin to your PATH envirenment e.g.,::
 51 | 
 52 |    export PATH=~/.usr/local/bin:$PATH
 53 | 
 54 | For more detailed information for installing japsa, please refer to  Japsa installation guide on
 55 | http://japsa.readthedocs.org/en/latest/install.html
 56 | 
 57 | ==================
 58 | Databases and data
 59 | ==================
 60 | 
 61 | The analyses described in the paper require accessing to some pre-processed databases. We make avaibale these
 62 | databases on  http://data.genomicsresearch.org/Projects/npAnalysis/ (and a back up storage at https://swift.rc.nectar.org.au:8888/v1/AUTH_15574c7fb24c44b3b34069185efba190/npAnalysis/)
 63 | Setting up these databases for use as follows. You can choose to download the databases relevant to your desired analyese.
 64 | 
 65 | --------------------------------
 66 | Bacterial species identification
 67 | --------------------------------
 68 | 
 69 | The species identification pipeline requires a database of genomes of
 70 | interest which is simply the concatenation of all genomes in fasta format.
 71 | Prepare an index file which specifies the species of each sequence in the
 72 | database. For example:
 73 | 
 74 | Content of genomeDB.fasta::
 75 | 
 76 |   >NC_0000011 Chromosome of species Genus1 species1
 77 |   ACGTACGTACGT
 78 |   >NC_00000012 Plasmid 1  of species Genus1 species1
 79 |   ACGTACGTACGT
 80 |   >NC_00000013 Plasmid 2  of species Genus1 species1
 81 |   ACGTACGTACGT
 82 |   >NC_00000021 Chromsome of species Genus1 species2
 83 |   ACGTACGTACGT
 84 |   >NC_00000031 Chromsome of species Genus2 species3
 85 |   ACGTACGTACGT
 86 | 
 87 | 
 88 | Content of speciesIndex::
 89 | 
 90 |   Genus1_species1 >NC_0000011 Chromosome of species Genus1 species1
 91 |   Genus1_species1 >NC_00000012 Plasmid 1  of species Genus1 species1
 92 |   Genus1_species1 >NC_00000013 Plasmid 2  of species Genus1 species1
 93 |   Genus1_species2 >NC_00000021 Chromsome of of species Genus1 species2
 94 |   Genus2_species3 >NC_00000031 Chromsome of of species Genus2 species3
 95 | 
 96 | 
 97 | Finally, build a bwa index of the database::
 98 | 
 99 |   bwa index genomeDB.fasta
100 | 
101 | 
102 | We pre-compile the database of all bacterial genomes obtained from NCBI genbank, with the
103 | addition of two K. quasipneumoniae strains (to be updated in the manuscript). Download the
104 | database (~2.8GB), and make an bwa index of the database as follows.::
105 | 
106 |    wget http://data.genomicsresearch.org/Projects/npAnalysis/SpeciesTyping.tar.gz
107 |    tar zxvf SpeciesTyping.tar.gz
108 |    gunzip SpeciesTyping/Bacteria/genomeDB.fasta.gz
109 |    bwa index SpeciesTyping/Bacteria/genomeDB.fasta
110 | 
111 | Note that it might take a while to build the bwa index for this 9Gb database.
112 | 
113 | -----------------------
114 | Strain typing with MLST
115 | -----------------------
116 | 
117 | 
118 | The database for MLST typing for three species,  K. pneumoniae, E. coli and
119 | S. aureus are make avaibale. Download (208KB) and unzip them::
120 | 
121 |    wget http://data.genomicsresearch.org/Projects/npAnalysis/MLST.tar.gz
122 |    tar zxvf MLST.tar.gz
123 | 
124 | 
125 | --------------------------------------------
126 | Strain typing with gene presence and absence
127 | --------------------------------------------
128 | 
129 | The database for gene presence and absence strain typing for K. pneumoniae, E. coli and
130 | S. aureus can be obtained as follows::
131 | 
132 |   wget http://data.genomicsresearch.org/Projects/npAnalysis/StrainTyping.tar.gz
133 |   tar zxvf StrainTyping.tar.gz
134 | 
135 | ------------------------------
136 | Resistance gene identification
137 | ------------------------------
138 | 
139 | A database of antibiotic resistance gene obtained from resFinder (https://cge.cbs.dtu.dk/services/ResFinder/) and pre-processed and provided from::
140 | 
141 |   wget http://data.genomicsresearch.org/Projects/npAnalysis/ResGene.tar.gz
142 |   tar zxvf ResGene.tar.gz
143 | 
144 | 
145 | ======================================
146 | Setting up real-time analysis pipeline
147 | ======================================
148 | 
149 | The framework makes use of the `interprocess communication mechanism pipe <https://en.wikipedia.org/wiki/Pipeline_(Unix)>`_ as well as network channels to set up the real-time pipeline. The japsa package provides `jsa.util.streamServer <http://japsa.readthedocs.org/en/latest/tools/jsa.util.streamServer.html>`_ and `jsa.util.streamClient <http://japsa.readthedocs.org/en/latest/tools/jsa.util.streamClient.html>`_ to facilitate setting a pipeline distributed on a computer cluser. You can prepare one or more analyses to run in real-time.
150 | 
151 | For bacterial species typing::
152 | 
153 |    jsa.util.streamServer -port 3456 \
154 |      | bwa mem -t 10 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 SpeciesTyping/Bacteria/genomeDB.fasta - 2> /dev/null \
155 |      | jsa.np.rtSpeciesTyping -bam - -index SpeciesTyping/Bacteria/speciesIndex --read 50 -time 60 -out speciesTypingResults.out 2>  speciesTypingResults.log &
156 | 
157 | This will create a pipeline to identify species which reports every 60 seconds, with at least 50 more reads from the last report. The pipeline waits for input on port 3456 for incoming data.
158 | 
159 | 
160 | For strain typing gene presence/absense for K. pneumoniae::
161 | 
162 |    jsa.util.streamServer -port 3457 \
163 |      | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a StrainTyping/Klebsiella_pneumoniae/geneFam.fasta - 2> /dev/null \
164 |      | jsa.np.rtStrainTyping -bam -  -geneDB StrainTyping/Klebsiella_pneumoniae/ -read 0 -time 20 --out kPStrainTyping.dat 2>  kPStrainTyping.log &
165 | 
166 | You can run strain typing pipelines for other species (e.g., E. coli and S. aureus)
167 | if you have reasons to believe the sample may contain these species. If these pipeline
168 | run on the same computer, make sure they listen to different ports.
169 | 
170 | For strain typing with MLST::
171 | 
172 |    jsa.util.streamServer -port 3458 \ 
173 |      | bwa mem -t 8 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -a -Y MLST/Klebsiella_pneumoniae/bwaIndex/genes.fasta - \
174 |      | jsa.np.rtMLST -bam - -mlst MLST/Klebsiella_pneumoniae/ -read 1000 -time 600  --out KpMLST.dat &
175 | 
176 | Again, you can set up MLST for E. coli and/or S. aureus as well. However, due to high error rate of the current 
177 | Oxford Nanopore sequencing, this analysis may require a large amount of data. The presence/absence analysis above is recommended.
178 | 
179 | 
180 | For resistance gene identification::
181 | 
182 |    jsa.util.streamServer -port 3459 \ 
183 |      | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a ResGene/resFinder/DB.fasta - 2> /dev/null \
184 |      | jsa.np.rtResistGenes -bam - -score=0.0001 -time 120 -read 50 --resDB  ResGene/resFinder/  -tmp _tmp_ -o resGene.dat -thread 4  2> resGene.log &
185 | 
186 | 
187 | You can run these sub-pipeline on one computer (they have to listen on different port) or over a number of computer. You can even split a sub-pipeline to run over two computers. For example, you can run the gene resistance analysis on one computer::
188 | 
189 |    jsa.util.streamServer -port 3460 \ 
190 |     | jsa.np.rtResistGenes -bam - -score=0.0001 -time 120 -read 50 --resDB  ResGene/resFinder/ -tmp _tmp_ -o resGene.dat -thread 4  2> resGene.log &
191 | 
192 | and run bwa on another::
193 | 
194 |    jsa.util.streamServer -port 3461 \ 
195 |     | bwa mem -t 2 -k11 -W20 -r10 -A1 -B1 -O1 -E1 -L0 -Y -K 10000 -a ResGene/resFinder/DB.fasta - 2> /dev/null \
196 |     | jsa.util.streamClient -input - -server computer1:3460
197 | 
198 | which listens for streaming data in fastq format from port 3461, aligns to the resistance gene database, and forwards the alignments in sam format the resistance gene analysis via the network.
199 | 
200 | In these sub-pipelines, you may want to modify the parameter -port for  jsa.util.streamServer and -t for bwa to suit your computer systems.
201 | 
202 | Once these `daemons <https://en.wikipedia.org/wiki/Daemon_(computing)>`_ are ready for their analyses, you can start npReader to streamline data into the integrated pipeline::
203 | 
204 |    jsa.np.npreader -GUI -realtime -folder <DownloadFolder> -fail -output data.fastq -stream server1:port1,server2:port2,server3:port3
205 |  
206 | in which the -folder parameter specifies the downloads folder from the Metrichor base-calling, and the -stream parameter lists the computer addresses and port numbers that the analyses are listening on. At this point, you can start the MinION and Metrichor to start the real-time analyse.
207 | 
208 | =======================
209 | Retro-realtime analysis
210 | =======================
211 | 
212 | Note npreader no longer support extracting time information. Please use an earlier version for this.
213 | 
214 | If your data have been sequenced, and depending on what processing steps have been done.
215 | 
216 | * If your data have not been base-cased, you can start the pipeline as above, and run Metrichor for base-calling your   data.
217 | 
218 | * If your data have been base-called, and are still in fast5 format, you can run npReader as above to stream data to    the pipeline.
219 | 
220 | * If your data have been converted to fastq format, you can run jsa.util.streamClient to stream to the pipeline::
221 | 
222 |     jsa.util.streamClient -input reads.fastq -server server1:port1,server2:port2,server3:port3
223 |   
224 | * If you want to emulate the timing of your sequenced data, first convert the data to fastq format and extract the timing information (make sure parameter -time is turned on)::
225 | 
226 |    jsa.np.npreader -folder <downloads> -fail -number -stat -time -out dataT.fastq
227 |   
228 | Next sort the reads in the order they were generated::
229 |   
230 |    jsa.seq.sort -i dataT.fastq -o dataS.fastq --sortKey=timestamp
231 |   
232 | Finally, stream the data using jsa.np.timeEmulate::
233 |   
234 |    jsa.np.timeEmulate -input dataS.fastq -scale 1 -output - |jsa.util.streamClient -input - -server  server1:port1,server2:port2,server3:port3
235 | 
236 | You can crease the value in -scale to test higher throughput.
237 | 
238 | We provides the data from our four MinION runs in fastq format, sorted in the order
239 | of sequencing (key=cTime). To re-run our analyses, set up the analysis pipeline as above,
240 | and then stream our data through the pipeline, eg.,::
241 | 
242 |    wget http://data.genomicsresearch.org/Projects/npAnalysis/data.tar.gz
243 |    tar zxvf data.tar.gz
244 |    jsa.np.timeEmulate -input data/nGN_045_R7_X4S.fastq -scale 120 -output - |jsa.util.streamClient -input - -server  server1:port1,server2:port2,server3:port3
245 | 
246 | ===================
247 | Data from the study
248 | ===================
249 | 
250 | The sequencing data for the experiments in the paper have been deposited
251 | to ENA under Accession `PRJEB14532 <http://www.ebi.ac.uk/ena/data/view/PRJEB14532>`_.
252 | 
253 | 
254 | ======================
255 | Further documentations
256 | ======================
257 | 
258 | More details of usage of the discussed programs are provided in `ReadTheDocs for Japsa <http://japsa.readthedocs.org/en/latest/>`_. More specificially:
259 | 
260 | * `npReader <http://japsa.readthedocs.org/en/latest/tools/jsa.np.npreader.html>`_
261 | * `jsa.util.streamServer <http://japsa.readthedocs.org/en/latest/tools/jsa.util.streamServer.html>`_
262 | * `jsa.util.streamClient <http://japsa.readthedocs.org/en/latest/tools/jsa.util.streamClient.html>`_
263 | * `jsa.np.filter <http://japsa.readthedocs.org/en/latest/tools/jsa.np.filter.html>`_
264 | * `jsa.np.rtSpeciesTyping <http://japsa.readthedocs.org/en/latest/tools/jsa.np.rtSpeciesTyping.html>`_
265 | * `jsa.np.rtStrainTyping <http://japsa.readthedocs.org/en/latest/tools/jsa.np.rtStrainTyping.html>`_
266 | * `jsa.np.rtMLST <http://japsa.readthedocs.org/en/latest/tools/jsa.np.rtMLST.html>`_
267 | * `jsa.np.rtResistGenes <http://japsa.readthedocs.org/en/latest/tools/jsa.np.rtResistGenes.html>`_
268 | 
269 | =======
270 | Contact
271 | =======
272 | Minh Duc Cao -- m.cao1@uq.edu.au
273 | 
274 | 
275 | 
276 | 
277 | 


--------------------------------------------------------------------------------