├── images
    └── fetch.jpg
├── .github
    └── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   └── feature_request.md
├── LICENSE
├── batch_tater.py
├── fast5_fetcher.py
└── README.md


/images/fetch.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Psy-Fer/fast5_fetcher/HEAD/images/fetch.jpg


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | 
 5 | ---
 6 | 
 7 | **Describe the bug**
 8 | A clear and concise description of what the bug is.
 9 | 
10 | **To Reproduce**
11 | Steps to reproduce the behavior:
12 | 
13 | **Expected behavior**
14 | A clear and concise description of what you expected to happen.
15 | 
16 | **Screenshots**
17 | If applicable, add screenshots to help explain your problem.
18 | 
19 | **Desktop (please complete the following information):**
20 |  - OS: [e.g. MacOS, Ubuntu)
21 | 
22 | **Additional context**
23 | Add any other context about the problem here.
24 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | 
 5 | ---
 6 | 
 7 | **Is your feature request related to a problem? Please describe.**
 8 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
 9 | 
10 | **Describe the solution you'd like**
11 | A clear and concise description of what you want to happen.
12 | 
13 | **Describe alternatives you've considered**
14 | A clear and concise description of any alternative solutions or features you've considered.
15 | 
16 | **Additional context**
17 | Add any other context or screenshots about the feature request here.
18 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 James Ferguson
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/batch_tater.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import subprocess
 3 | '''
 4 |     Potato scripting engaged.
 5 | 
 6 |     James M. Ferguson (j.ferguson@garvan.org.au)
 7 |     Genomic Technologies
 8 |     Garvan Institute
 9 |     Copyright 2017
10 | 
11 |     batch_tater.py takes list/s of files to extract, and speeds it up a bit, by only opening
12 |     one tar file at a time and extracting what is needed.
13 | 
14 |     To run on sun grid engine using array jobs as a hacky way of doing multiprocessing.
15 |     Also, helps check when things go wrong, and easy to relaunch failed jobs.
16 |     Some things left in from running on some tasty nanopore single cell data.
17 | 
18 |     sge file:
19 | 
20 |     source ~/work/venv2714/bin/activate
21 | 
22 |     FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p)
23 |     BLAH=fast5/${FILE}
24 | 
25 |     mkdir ${TMPDIR}/fast5
26 | 
27 |     time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/
28 | 
29 |     echo "size of files:" >&2
30 |     du -shc ${TMPDIR}/fast5/ >&2
31 |     echo "extraction complete!" >&2
32 |     echo "Number of files:" >&2
33 |     ls ${TMPDIR}/fast5/ | wc -l >&2
34 | 
35 |     echo "copying data..." >&2
36 | 
37 |     tar -cf ${TMPDIR}/f5f.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
38 |     cp ${TMPDIR}/f5f.${SGE_TASK_ID}.tar ./clean_f5s/
39 | 
40 |     CMD:
41 | 
42 |     CMD="qsub -cwd -V -pe smp 1 -N batchCln -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=20G ../batch.sge"
43 | 
44 |     Launch:
45 | 
46 |     echo $CMD && $CMD
47 | 
48 | 
49 |     stats:
50 | 
51 |     fastq: 27491304
52 |     mapped: 11740093
53 |     z mode time: 10min
54 |     batch_tater total time: 21min
55 |     per job time: ~28s
56 |     number of CPUs: 100
57 | '''
58 | 
59 | # being lazy and using sys.argv...i mean, it is pretty lit
60 | master = sys.argv[1]
61 | tar_list = sys.argv[2]
62 | save_path = sys.argv[3]
63 | 
64 | # this will probs need to be changed based on naming convention
65 | # I think i was a little tired when I wrote this
66 | list_name = tar_list.split('/')[-1]
67 | 
68 | PATH = 0
69 | 
70 | # not elegent, but gets it done
71 | with open(master, 'r') as f:
72 |     for l in f:
73 |         l = l.strip('\n')
74 |         l = l.split('\t')
75 |         if l[0] == list_name:
76 |             PATH = l[1]
77 |             break
78 | 
79 | # for stats later and easy job relauncing
80 | print >> sys.stderr, "extracting:", tar_list
81 | # do the thing. That --transform hack is awesome. Blows away all the leading folders.
82 | if PATH:
83 |     cmd = "tar -xf {} --transform='s/.*\///' -C {} -T {}".format(
84 |         PATH, save_path, tar_list)
85 |     subprocess.call(cmd, shell=True, executable='/bin/bash')
86 | 
87 | else:
88 |     print >> sys.stderr, "PATH not found! check index nooblet"
89 |     print >> sys.stderr, "inputs:", master, tar_list, save_path
90 | 


--------------------------------------------------------------------------------
/fast5_fetcher.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import gzip
  4 | import io
  5 | import subprocess
  6 | import traceback
  7 | import argparse
  8 | import platform
  9 | from functools import partial
 10 | '''
 11 | 
 12 |     James M. Ferguson (j.ferguson@garvan.org.au)
 13 |     Genomic Technologies
 14 |     Garvan Institute
 15 |     Copyright 2017
 16 | 
 17 |     fast5_fetcher is designed to help manage fast5 file data storage and organisation.
 18 |     It takes 3 files as input: fastq/paf/flat, sequencing_summary, index
 19 | 
 20 |     --------------------------------------------------------------------------------------
 21 |     version 0.0   - initial
 22 |     version 0.2   - added argparser and buffered gz streams
 23 |     version 0.3   - added paf input
 24 |     version 0.4   - added read id flat file input
 25 |     version 0.5   - pppp print output instead of extracting
 26 |     version 0.6   - did a dumb. changed x in s to set/dic entries O(n) vs O(1)
 27 |     version 0.7   - cleaned up a bit to share and removed some hot and steamy features
 28 |     version 0.8   - Added functionality for un-tarred file structures and seq_sum only
 29 |     version 1.0   - First release
 30 |     version 1.1   - refactor with dicswitch and batch_tater updates
 31 |     version 1.1.1 - Bug fix on --transform method, added OS detection
 32 |     version 1.2.0 - Added file trimming to fully segment selection
 33 | 
 34 |     TODO:
 35 |         - Python 3 compatibility
 36 |         - autodetect file structures
 37 |         - autobuild index file - make it a sub script as well
 38 |         - Consider using csv.DictReader() instead of wheel building
 39 |         - flesh out batch_tater and give better examples and clearer how-to
 40 |         - options to build new index of fetched fast5s
 41 | 
 42 |     -----------------------------------------------------------------------------
 43 |     MIT License
 44 | 
 45 |     Copyright (c) 2017 James Ferguson
 46 | 
 47 |     Permission is hereby granted, free of charge, to any person obtaining a copy
 48 |     of this software and associated documentation files (the "Software"), to deal
 49 |     in the Software without restriction, including without limitation the rights
 50 |     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 51 |     copies of the Software, and to permit persons to whom the Software is
 52 |     furnished to do so, subject to the following conditions:
 53 | MyParser
 54 |     The above copyright notice and this permission notice shall be included in all
 55 |     copies or substantial portions of the Software.
 56 | 
 57 |     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 58 |     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 59 |     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 60 |     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 61 |     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 62 |     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 63 |     SOFTWARE.
 64 | '''
 65 | 
 66 | 
 67 | class MyParser(argparse.ArgumentParser):
 68 |     def error(self, message):
 69 |         sys.stderr.write('error: %s\n' % message)
 70 |         self.print_help()
 71 |         sys.exit(2)
 72 | 
 73 | 
 74 | def main():
 75 |     '''
 76 |     do the thing
 77 |     '''
 78 |     parser = MyParser(
 79 |         description="fast_fetcher - extraction of specific nanopore fast5 files")
 80 |     group = parser.add_mutually_exclusive_group()
 81 |     group.add_argument("-q", "--fastq",
 82 |                        help="fastq.gz for read ids")
 83 |     group.add_argument("-p", "--paf",
 84 |                        help="paf alignment file for read ids")
 85 |     group.add_argument("-f", "--flat",
 86 |                        help="flat file of read ids")
 87 |     parser.add_argument("--OSystem", default=platform.system(),
 88 |                         help="running operating system - leave default unless doing odd stuff")
 89 |     parser.add_argument("-s", "--seq_sum",
 90 |                         help="sequencing_summary.txt.gz file")
 91 |     parser.add_argument("-i", "--index",
 92 |                         help="index.gz file mapping fast5 files in tar archives")
 93 |     parser.add_argument("-o", "--output",
 94 |                         help="output directory for extracted fast5s")
 95 |     parser.add_argument("-t", "--trim", action="store_true",
 96 |                         help="trim files as if standalone experiment, (fq, SS)")
 97 |     parser.add_argument("-l", "--trim_list",
 98 |                         help="list of file names to trim, comma separated. fastq only needed for -p and -f modes")
 99 |     parser.add_argument("-x", "--prefix", default="default",
100 |                         help="trim file prefix, eg: barcode_01, output: barcode_01.fastq, barcode_01_seq_sum.txt")
101 |     # parser.add_argument("-t", "--procs", type=int,
102 |     #                    help="Number of CPUs to use - TODO: NOT YET IMPLEMENTED")
103 |     parser.add_argument("-z", "--pppp", action="store_true",
104 |                         help="Print out tar commands in batches for further processing")
105 |     args = parser.parse_args()
106 | 
107 |     # print help if no arguments given
108 |     if len(sys.argv) == 1:
109 |         parser.print_help(sys.stderr)
110 |         sys.exit(1)
111 | 
112 |     print >> sys.stderr, "Starting things up!"
113 | 
114 |     p_dic = {}
115 |     if args.pppp:
116 |         print >> sys.stderr, "PPPP state! Not extracting, exporting tar commands"
117 | 
118 |     trim_pass = False
119 |     if args.trim:
120 |         SS = False
121 |         FQ = False
122 |         if args.trim_list:
123 |             A = args.trim_list.split(',')
124 |             for a in A:
125 |                 if "fastq" in a:
126 |                     FQ = a
127 |                 elif "txt" in a:
128 |                     SS = a
129 |                 else:
130 |                     print >> sys.stderr, "Unknown trim input. detects 'fastq' or 'txt' for files. Input:", a
131 |         else:
132 |             print >> sys.stderr, "No extra files given. Compatible with -q fastq input only"
133 | 
134 |         if args.fastq:
135 |             FQ = args.fastq
136 |         if args.seq_sum:
137 |             SS = args.seq_sum
138 | 
139 |         # final check
140 |         if FQ and SS:
141 |             trim_pass = True
142 |             print >> sys.stderr, "Trim setting detected. Writing to working direcory"
143 |         else:
144 |             print >> sys.stderr, "Unable to verify both fastq and sequencing_summary files. Please check filenames and try again. Exiting..."
145 |             sys.exit()
146 | 
147 |     ids = []
148 |     if args.fastq:
149 |         ids = get_fq_reads(args.fastq)
150 |         if trim_pass:
151 |             trim_SS(args, ids, SS)
152 |     elif args.paf:
153 |         ids = get_paf_reads(args.paf)
154 |         if trim_pass:
155 |             trim_both(args, ids, FQ, SS)
156 |     elif args.flat:
157 |         ids = get_flat_reads(args.flat)
158 |         if trim_pass:
159 |             trim_both(args, ids, FQ, SS)
160 |     if not ids and trim_pass:
161 |         filenames, ids = get_filenames(args.seq_sum, ids)
162 |         trim_both(args, ids, FQ, SS)
163 |     else:
164 |         filenames, ids = get_filenames(args.seq_sum, ids)
165 | 
166 |     paths = get_paths(args.index, filenames)
167 |     print >> sys.stderr, "extracting..."
168 |     # place multiprocessing pool here
169 |     for p, f in paths:
170 |         if args.pppp:
171 |             if p in p_dic:
172 |                 p_dic[p].append(f)
173 |             else:
174 |                 p_dic[p] = [f]
175 |             continue
176 |         else:
177 |             try:
178 |                 extract_file(args, p, f)
179 |             except:
180 |                 traceback.print_exc()
181 |                 print >> sys.stderr, "Failed to extract:", p, f
182 |     # For each .tar file, write a file with the tarball name as filename.tar.txt
183 |     # and contains a list of files to extract - input for batch_tater.py
184 |     if args.pppp:
185 |         with open("tater_master.txt", 'w') as m:
186 |             for i in p_dic:
187 |                 fname = "tater_" + i.split('/')[-1] + ".txt"
188 |                 m_entry = "{}\t{}".format(fname, i)
189 |                 fname = args.output + "/tater_" + i.split('/')[-1] + ".txt"
190 |                 m.write(m_entry)
191 |                 m.write('\n')
192 |                 with open(fname, 'w') as f:
193 |                     for j in p_dic[i]:
194 |                         f.write(j)
195 |                         f.write('\n')
196 | 
197 |     print >> sys.stderr, "done!"
198 | 
199 | 
200 | def dicSwitch(i):
201 |     '''
202 |     A switch to handle file opening and reduce duplicated code
203 |     '''
204 |     open_method = {
205 |         "gz": gzip.open,
206 |         "norm": open
207 |     }
208 |     return open_method[i]
209 | 
210 | 
211 | def get_fq_reads(fastq):
212 |     '''
213 |     read fastq file and extract read ids
214 |     quick and dirty to limit library requirements - still bullet fast
215 |     '''
216 |     c = 0
217 |     read_ids = set()
218 |     if fastq.endswith('.gz'):
219 |         f_read = dicSwitch('gz')
220 |     else:
221 |         f_read = dicSwitch('norm')
222 |     with f_read(fastq, 'rb') as fq:
223 |         if fastq.endswith('.gz'):
224 |             fq = io.BufferedReader(fq)
225 |         for line in fq:
226 |             c += 1
227 |             line = line.strip('\n')
228 |             if c == 1:
229 |                 idx = line.split()[0][1:]
230 |                 read_ids.add(idx)
231 |             elif c >= 4:
232 |                 c = 0
233 |     return read_ids
234 | 
235 | 
236 | def get_paf_reads(reads):
237 |     '''
238 |     Parse paf file to pull read ids (from minimap2 alignment)
239 |     '''
240 |     read_ids = set()
241 |     if reads.endswith('.gz'):
242 |         f_read = dicSwitch('gz')
243 |     else:
244 |         f_read = dicSwitch('norm')
245 |     with f_read(reads, 'rb') as fq:
246 |         if reads.endswith('.gz'):
247 |             fq = io.BufferedReader(fq)
248 |         for line in fq:
249 |             line = line.strip('\n')
250 |             line = line.split()
251 |             read_ids.add(line[0])
252 |     return read_ids
253 | 
254 | 
255 | def get_flat_reads(filename):
256 |     '''
257 |     Parse a flat file separated by line breaks \n
258 |     TODO: make @ symbol check once, as they should all be the same
259 |     '''
260 |     read_ids = set()
261 |     check = True
262 |     if filename.endswith('.gz'):
263 |         f_read = dicSwitch('gz')
264 |     else:
265 |         f_read = dicSwitch('norm')
266 |     with f_read(filename, 'rb') as fq:
267 |         if filename.endswith('.gz'):
268 |             fq = io.BufferedReader(fq)
269 |         for line in fq:
270 |             line = line.strip('\n')
271 |             if check:
272 |                 if line[0] == '@':
273 |                     x = 1
274 |                 else:
275 |                     x = 0
276 |                 check = False
277 |             idx = line[x:]
278 |             read_ids.add(idx)
279 |     return read_ids
280 | 
281 | 
282 | def trim_SS(args, ids, SS):
283 |     '''
284 |     Trims the sequencing_summary.txt file to only the input IDs
285 |     '''
286 |     if args.prefix:
287 |         pre = args.prefix + "_seq_sum.txt"
288 |     else:
289 |         pre = "trimmed_seq_sum.txt"
290 |     head = True
291 |     if SS.endswith('.gz'):
292 |         f_read = dicSwitch('gz')
293 |     else:
294 |         f_read = dicSwitch('norm')
295 |     # make this compatible with dicSwitch
296 |     with open(pre, "w") as w:
297 |         with f_read(SS, 'rb') as sz:
298 |             if SS.endswith('.gz'):
299 |                 sz = io.BufferedReader(sz)
300 |             for line in sz:
301 |                 if head:
302 |                     w.write(line)
303 |                     head = False
304 |                     continue
305 |                 l = line.split()
306 |                 if l[1] in ids:
307 |                     w.write(line)
308 | 
309 | 
310 | def trim_both(args, ids, FQ, SS):
311 |     '''
312 |     Trims the sequencing_summary.txt and fastq files to only the input IDs
313 |     '''
314 |     # trim the SS
315 |     trim_SS(args, ids, SS)
316 |     if args.prefix:
317 |         pre = args.prefix + ".fastq"
318 |     else:
319 |         pre = "trimmed.fastq"
320 | 
321 |     # trim the fastq
322 |     c = 0
323 |     P = False
324 |     if FQ.endswith('.gz'):
325 |         f_read = dicSwitch('gz')
326 |     else:
327 |         f_read = dicSwitch('norm')
328 |     with open(pre, "w") as w:
329 |         with f_read(FQ, 'rb') as fq:
330 |             if FQ.endswith('.gz'):
331 |                 fq = io.BufferedReader(fq)
332 |             for line in fq:
333 |                 c += 1
334 |                 if c == 1:
335 |                     if line.split()[0][1:] in ids:
336 |                         P = True
337 |                         w.write(line)
338 |                 elif P and c < 4:
339 |                     w.write(line)
340 |                 elif c >= 4:
341 |                     if P:
342 |                         w.write(line)
343 |                     c = 0
344 |                     P = False
345 | 
346 | 
347 | def get_filenames(seq_sum, ids):
348 |     '''
349 |     match read ids with seq_sum to pull filenames
350 |     '''
351 |     # for when using seq_sum for filtering, and not fq,paf,flat
352 |     ss_only = False
353 |     if not ids:
354 |         ss_only = True
355 |         ids = set()
356 |     head = True
357 |     files = set()
358 |     if seq_sum.endswith('.gz'):
359 |         f_read = dicSwitch('gz')
360 |     else:
361 |         f_read = dicSwitch('norm')
362 |     with f_read(seq_sum, 'rb') as sz:
363 |         if seq_sum.endswith('.gz'):
364 |             sz = io.BufferedReader(sz)
365 |         for line in sz:
366 |             if head:
367 |                 head = False
368 |                 continue
369 |             line = line.strip('\n')
370 |             line = line.split()
371 |             if ss_only:
372 |                 files.add(line[0])
373 |                 ids.add(line[1])
374 |             else:
375 |                 if line[1] in ids:
376 |                     files.add(line[0])
377 |     return files, ids
378 | 
379 | 
380 | def get_paths(index_file, filenames, f5=None):
381 |     '''
382 |     Read index and extract full paths for file extraction
383 |     '''
384 |     tar = False
385 |     paths = []
386 |     c = 0
387 |     if index_file.endswith('.gz'):
388 |         f_read = dicSwitch('gz')
389 |     else:
390 |         f_read = dicSwitch('norm')
391 |     # detect normal or tars
392 |     with f_read(index_file, 'rb') as idz:
393 |         if index_file.endswith('.gz'):
394 |             idz = io.BufferedReader(idz)
395 |         for line in idz:
396 |             line = line.strip('\n')
397 |             c += 1
398 |             if c > 10:
399 |                 break
400 |             if line.endswith('.tar'):
401 |                 tar = True
402 |                 break
403 |     # extract paths
404 |     with f_read(index_file, 'rb') as idz:
405 |         if index_file.endswith('.gz'):
406 |             idz = io.BufferedReader(idz)
407 |         for line in idz:
408 |             line = line.strip('\n')
409 |             if tar:
410 |                 if line.endswith('.tar'):
411 |                     path = line
412 |                 elif line.endswith('.fast5'):
413 |                     f = line.split('/')[-1]
414 |                     if f in filenames:
415 |                         paths.append([path, line])
416 |                 else:
417 |                     continue
418 |             else:
419 |                 if line.endswith('.fast5'):
420 |                     f = line.split('/')[-1]
421 |                     if f in filenames:
422 |                         paths.append(['', line])
423 |                 else:
424 |                     continue
425 | 
426 |     return paths
427 | 
428 | 
429 | def extract_file(args, path, filename):
430 |     '''
431 |     Do the extraction.
432 |     I was using the tarfile python lib, but honestly, it sucks and was too volatile.
433 |     if you have a better suggestion, let me know :)
434 |     That --transform hack is awesome btw. Blows away all the leading folders. use
435 |     cp for when using untarred structures. Not recommended, but here for completeness.
436 | 
437 |     --transform not working on MacOS. Need to use gtar
438 |     Thanks to Kai Martin for picking that one up!
439 | 
440 |     '''
441 |     OSystem = ""
442 |     OSystem = args.OSystem
443 |     save_path = args.output
444 |     if path.endswith('.tar'):
445 |         if OSystem in ["Linux", "Windows"]:
446 |             cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format(
447 |                 path, save_path, filename)
448 |         elif OSystem == "Darwin":
449 |             cmd = "gtar -xf {} --transform='s/.*\///' -C {} {}".format(
450 |                 path, save_path, filename)
451 |         else:
452 |             print >> sys.stderr, "Unsupported OSystem, trying Tar anyway, OS:", OSystem
453 |             cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format(
454 |                 path, save_path, filename)
455 |     else:
456 |         cmd = "cp {} {}".format(filename, save_path)
457 |     subprocess.call(cmd, shell=True, executable='/bin/bash')
458 | 
459 | 
460 | if __name__ == '__main__':
461 |     main()
462 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # fast5_fetcher
  2 | 
  3 | #### Doing the heavy lifting for you.
  4 | 
  5 | <p align="left"><img src="images/fetch.jpg" alt="fast5_fetcher" width="30%" height="30%"></p>
  6 | 
  7 | **fast5_fetcher** is a tool for fetching nanopore fast5 files to save time and simplify downstream analysis.
  8 | 
  9 | 
 10 | ## **fast5_fetcher is now part of SquiggleKit located [here](https://github.com/Psy-Fer/SquiggleKit)**
 11 | ### Please use and cite SquiggleKit as it is the most up to date
 12 | 
 13 | 
 14 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)
 15 | 
 16 | ## Contents
 17 | 
 18 | <!--ts-->
 19 | 
 20 | -   [Background](#background)
 21 | -   [Requirements](#requirements)
 22 | -   [Installation](#installation)
 23 | -   [Getting Started](<#getting started>)
 24 |     -   [File structures](<#file structures>)
 25 |         -   [1. Raw structure (not preferred)](<#1. Raw structure>)
 26 |         -   [2. Local basecalled structure](<#2. Local basecalled structure>)
 27 |         -   [3. Parallel basecalled structure](<#3. Parallel basecalled structure>)
 28 |     -   [Inputs](#inputs)
 29 | -   [Instructions for use](<#Instructions for use>)
 30 |     -   [Quick start](<#Quick start>)
 31 |     -   [fast5_fetcher.py](#fast5_fetcher.py)
 32 |         -   [Examples](#Examples)
 33 |     -   [batch_tater.py](#batch_tater.py)
 34 | -   [Acknowledgements](#acknowledgements)
 35 | -   [Cite](#cite)
 36 | -   [License](#license)
 37 |     <!--te-->
 38 | 
 39 | # Background
 40 | 
 41 | Reducing the number of fast5 files per folder in a single experiment was a welcomed addition to MinKnow. However this also made it rather useful for manual basecalling on a cluster, using array jobs, where each folder is basecalled individually, producing its own `sequencing_summary.txt`, `reads.fastq`, and reads folder containing the newly basecalled fast5s. Taring those fast5 files up into a single file was needed to keep the sys admins at bay, complaining about our millions of individual files on their drives. This meant, whenever there was a need to use the fast5 files from an experiment, or many experiments, unpacking the fast5 files was a significant hurdle both in time and disk space.
 42 | 
 43 | **fast5_fetcher** was built to  address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be  extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.
 44 | 
 45 | # Requirements
 46 | 
 47 | Following a self imposed guideline, most things written to handle nanopore data or bioinformatics in general, will use as little 3rd party libraries as possible, aiming for only core libraries, or have all included files in the package.
 48 | 
 49 | In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next big update)
 50 | 
 51 | ##### Operating system:
 52 | 
 53 | There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed (see below), there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration.
 54 | 
 55 | # Getting Started
 56 | 
 57 | Building an index of fast5 files and their paths, as well as a simple bash script to control the workflow, be it on a local machine, or HPC, will depend on the starting file structure.
 58 | 
 59 | ## File structures
 60 | 
 61 | The file structure is not overly important, however it will modify some of the commands used in the examples. I have endeavoured to include a few diverse uses, starting from different file states, but of course, I can't think of everything, so if there is something you wish to accomplished with `fast5_fetcher.py`, but can't quite get it to work for you, let me know, and perhaps I can make it easier for you.
 62 | 
 63 | #### 1. Raw structure (not preferred)
 64 | 
 65 | This is the most basic structure, where all files are present in an accessible state.
 66 | 
 67 |     ├── huntsman.fastq
 68 |     ├── sequencing_summary.txt       
 69 |     ├── huntsman_reads/              # Read folder
 70 |     │   ├── 0/                       # individual folders containing ~4000 fast5s
 71 |     |   |   ├── huntsman_read1.fast5
 72 |     |   |   └── huntsman_read2.fast5
 73 |     |   |   └── ...
 74 |     |   ├── 1/
 75 |     |   |   ├── huntsman_read#.fast5
 76 |     |   |   └── ...
 77 |     └── ├── ...
 78 | 
 79 | #### 2. Local basecalled structure
 80 | 
 81 | This structure is the typical structure post local basecalling
 82 | fastq and sequencing_summary files have been gzipped and the folders in the reads folder have been tarballed into one large file
 83 | 
 84 |     ├── huntsman.fastq.gz            # gzipped
 85 |     ├── sequencing_summary.txt.gz    # gzipped
 86 |     ├── huntsman_reads.tar           # Tarballed read folder
 87 |         |                            # Tarball expanded
 88 |         |-->│   ├── 0/               # individual folders inside tarball
 89 |             |   |   ├── huntsman_read1.fast5
 90 |             |   |   └── huntsman_read2.fast5
 91 |             |   |   └── ...
 92 |             |   ├── 1/
 93 |             |   |   ├── huntsman_read#.fast5
 94 |             |   |   └── ...
 95 |             └── ├── ...
 96 | 
 97 | #### 3. Parallel basecalled structure
 98 | 
 99 | This structure is post massively parallel basecalling, and looks like multiples of the above structure.
100 | 
101 |     ├── fastq/
102 |     |   ├── huntsman.1.fastq.gz
103 |     |   └── huntsman.2.fastq.gz
104 |     |   └── huntsman.3.fastq.gz
105 |     |   └── ...
106 |     ├── logs/
107 |     |    ├── sequencing_summary.1.txt.gz
108 |     |    └── sequencing_summary.2.txt.gz
109 |     |    └── sequencing_summary.3.txt.gz
110 |     |    └── ...
111 |     ├── fast5/
112 |     |    ├── 1.tar
113 |     |    └── 2.tar
114 |     |    └── 3.tar
115 |     |    └── ...
116 | 
117 | With this structure, combining the `.fastq` and `sequencing_summary.txt.gz` files is needed.
118 | 
119 | ##### Combine fastq.gz files
120 | 
121 | ```bash
122 | for file in fastq/*.fastq.gz; do cat $file; done >> huntsman.fastq.gz
123 | ```
124 | 
125 | ##### Combine sequencing_summary.txt.gz files
126 | 
127 | ```bash
128 | # create header
129 | zcat $(ls logs/sequencing_summary*.txt.gz | head -1) | head -1 > sequencing_summary.txt
130 | 
131 | # combine all files, skipping first line header
132 | for file in logs/sequencing_summary*.txt.gz; do zcat $file | tail -n +2; done >> sequencing_summary.txt
133 | 
134 | gzip sequencing_summary.txt
135 | ```
136 | 
137 | You should then have something like this:
138 | 
139 |     ├── huntsman.fastq.gz            # gzipped
140 |     ├── sequencing_summary.txt.gz    # gzipped
141 |     ├── fast5/                       # fast5 folder
142 |     |    ├── 1.tar                   # each tar contains ~4000 fast5 files
143 |     |    └── 2.tar
144 |     |    └── 3.tar
145 |     |    └── ...
146 | 
147 | ## Inputs
148 | 
149 | It takes 3 files as input:
150 | 
151 | 1.  fastq, paf, or flat (.gz)
152 | 2.  sequencing_summary.txt(.gz)
153 | 3.  name.index(.gz)
154 | 
155 | #### 1. fastq, paf, or flat
156 | 
157 | This is where the readIDs are collected, to be matched with their respective fast5 files for fetching. The idea being, that some form of selection has occurred to generate the files.
158 | 
159 | In the case of a **fastq**, it may be filtered for all the reads above a certain quality, or from a particular barcode after running barcode detection.
160 | 
161 | For the **paf** file, it is an alignment output of minimap2. This can be used to fetch only the fast5 files that align to some reference, or has been filtered to only contain the reads that align to a particular region of interest.
162 | 
163 | A **flat** file in this case is just a file that contains a list of readIDs, one on each line. This allows the user to generate any list of reads to fetch from any other desired method.
164 | 
165 | Each of these files can be gzipped or not.
166 | 
167 | See examples below for example test cases.
168 | 
169 | #### 2. Sequencing summary
170 | 
171 | The `sequencing_summary.txt` file is created by the basecalling software, (Albacore, Guppy), and contains information about each read, including the readID and fast5 file name, along with length, quality scores, and potentially barcode information.
172 | 
173 | There is a shortcut method in which you can use the `sequencing_summary.txt` only, without the need for a fastq, paf, or flat file. In this case, leave the `-q`, `-f`, `-r` fields empty.
174 | 
175 | This file can be gzipped or not.
176 | 
177 | #### 3. Building the index
178 | 
179 | How the index is built depends on which file structure you are using. It will work with both tarred and un-tarred file structures. Tarred is preferred.
180 | 
181 | ##### - Raw structure (not preferred)
182 | 
183 | ```bash
184 | for file in $(pwd)/reads/*/*;do echo $file; done >> name.index
185 | 
186 | gzip name.index
187 | ```
188 | 
189 | ##### - Local basecalled structure
190 | 
191 | ```bash
192 | for file in $(pwd)/reads.tar; do echo $file; tar -tf $file; done >> name.index
193 | 
194 | gzip name.index
195 | ```
196 | 
197 | ##### - Parallel basecalled structure
198 | 
199 | ```bash
200 | for file in $(pwd)/fast5/*fast5.tar; do echo $file; tar -tf $file; done >> name.index
201 | ```
202 | 
203 | If you have multiple experiments, then cat them all together and gzip.
204 | 
205 | ```bash
206 | for file in ./*.index; do cat $file; done >> ../all.name.index
207 | 
208 | gzip all.name.index
209 | ```
210 | 
211 | ## Instructions for use
212 | 
213 | Download the repository:
214 | 
215 |     git clone https://github.com/Psy-Fer/fast5_fetcher.git
216 | 
217 | If using MacOS, and NOT using homebrew, install it here:
218 | 
219 |     https://brew.sh/
220 | 
221 | then install gnu-tar with:
222 | 
223 |     brew install gnu-tar
224 | 
225 | ### Quick start
226 | 
227 | Basic use on a local computer
228 | 
229 | **fastq**
230 | 
231 | ```bash
232 | python fast5_fetcher.py -q my.fastq.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5
233 | ```
234 | 
235 | **paf**
236 | 
237 | ```bash
238 | python fast5_fetcher.py -p my.paf -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5
239 | ```
240 | 
241 | **flat**
242 | 
243 | ```bash
244 | python fast5_fetcher.py -f my_flat.txt.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5
245 | ```
246 | 
247 | **sequencing_summary.txt only**
248 | 
249 | ```bash
250 | python fast5_fetcher.py -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5
251 | ```
252 | 
253 | See examples below for use on an **HPC** using **SGE**
254 | 
255 | ## fast5_fetcher.py
256 | 
257 | #### Full usage
258 | 
259 |     usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [--OSystem OSYSTEM]
260 |                             [-s SEQ_SUM] [-i INDEX] [-o OUTPUT] [-t]
261 |                             [-l TRIM_LIST] [-x PREFIX] [-z]
262 | 
263 |     fast_fetcher - extraction of specific nanopore fast5 files
264 | 
265 |     optional arguments:
266 |       -h, --help            show this help message and exit
267 |       -q FASTQ, --fastq FASTQ
268 |                             fastq.gz for read ids
269 |       -p PAF, --paf PAF     paf alignment file for read ids
270 |       -f FLAT, --flat FLAT  flat file of read ids
271 |       --OSystem OSYSTEM     running operating system - leave default unless doing
272 |                             odd stuff
273 |       -s SEQ_SUM, --seq_sum SEQ_SUM
274 |                             sequencing_summary.txt.gz file
275 |       -i INDEX, --index INDEX
276 |                             index.gz file mapping fast5 files in tar archives
277 |       -o OUTPUT, --output OUTPUT
278 |                             output directory for extracted fast5s
279 |       -t, --trim            trim files as if standalone experiment, (fq, SS)
280 |       -l TRIM_LIST, --trim_list TRIM_LIST
281 |                             list of file names to trim, comma separated. fastq
282 |                             only needed for -p and -f modes
283 |       -x PREFIX, --prefix PREFIX
284 |                             trim file prefix, eg: barcode_01, output:
285 |                             barcode_01.fastq, barcode_01_seq_sum.txt
286 |       -z, --pppp            Print out tar commands in batches for further
287 |                             processing
288 | 
289 | ## Examples
290 | 
291 | Fast5 Fetcher was originally built to work with **Sun Grid Engine** (SGE), exploiting the heck out of array jobs. Although it can work locally and on untarred file structures, when operating on multiple sequencing experiments, with file structures scattered across a file system, is when fast5 fetcher starts to make a difference.
292 | 
293 | ### SGE examples
294 | 
295 | After creating the fastq/paf/flat, sequencing_summary, and index files, create an SGE file.
296 | 
297 | Note the use of `${SGE_TASK_ID}` to use the array job as the pointer to a particular file
298 | 
299 | #### After barcode demultiplexing
300 | 
301 | Given a similar structure and naming convention, it is possible to group the fast5 files by barcode in the following manner.
302 | 
303 |     ├── BC_1.fastq.gz                # Barcode 1
304 |     ├── BC_2.fastq.gz                # Barcode 2
305 |     ├── BC_3.fastq.gz                # ...
306 |     ├── BC_4.fastq.gz          
307 |     ├── BC_5.fastq.gz            
308 |     ├── BC_6.fastq.gz           
309 |     ├── BC_7.fastq.gz           
310 |     ├── BC_8.fastq.gz           
311 |     ├── BC_9.fastq.gz           
312 |     ├── BC_10.fastq.gz           
313 |     ├── BC_11.fastq.gz            
314 |     ├── BC_12.fastq.gz            
315 |     ├── unclassified.fastq.gz        # unclassified reads (skipped by fast5_fetcher in this example, rename BC_13 to simple fold it into the example)        
316 |     ├── sequencing_summary.txt.gz    # gzipped
317 |     ├── barcoded.index.gz            # index file containing fast5 file paths
318 |     ├── fast5/                       # fast5 folder, unsorted
319 |     |    ├── 1.tar                   # each tar contains ~4000 fast5 files
320 |     |    └── 2.tar
321 |     |    └── 3.tar
322 |     |    └── ...
323 | 
324 | #### fetch.sge
325 | 
326 | ```bash
327 | # activate virtual python environment
328 | # most HPC will use something like "module load"
329 | source ~/work/venv2714/bin/activate
330 | 
331 | # Creaete output directory to take advantage of NVME drives on cluster local
332 | mkdir ${TMPDIR}/fast5
333 | 
334 | # Run fast_fetcher on each barcode after demultiplexing
335 | time python fast5_fetcher.py -r ./BC_${SGE_TASK_ID}.fastq.gz -s sequencing_summary.txt.gz -i barcoded.index.gz -o ${TMPDIR}/fast5/
336 | 
337 | # tarball the extracted reads into a single tar file
338 | # Can also split the reads into groups of ~4000 if needed
339 | tar -cf ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
340 | # Copy from HPC drives to working dir.
341 | cp ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar ./
342 | ```
343 | 
344 | #### Create CMD and launch
345 | 
346 | ```bash
347 | # current working dir, with 1 CPU, array jobs 1 to 12
348 | # Modify memory settings as required
349 | CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_vmem=20G,tmp_requested=500G ./fetch.sge"
350 | 
351 | echo $CMD && $CMD
352 | ```
353 | 
354 | ## Trimming fastq and sequencing_summary files
355 | 
356 | By using the `-t, --trim` option, each barcode will also have its own sequencing_summary file for downstream analysis. This is particularly useful if each barcode is a different sample or experiment, as the output is as if it was it's own individual flowcell.
357 | 
358 | This method can also trim fastq, and sequencing_summary files when using the **paf** or **flat** methods. By using the prefix option, you can label the output names, otherwise generic defaults will be used.
359 | 
360 | ## batch_tater.py
361 | 
362 | Potato scripting engaged
363 | 
364 | This is designed to run on the output files from `fast5_fetcher.py` using option `-z`. This writes out file lists for each tarball that contains reads you want to process. Then `batch_tater.py` can read those files, to open the individual tar files, and extract the files, meaning the file is only opened once.
365 | 
366 | A recent test using the -z option on ~2.2Tb of data, across ~11/27 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 20s per array job to extract and repackage with batch_tater.py.
367 | 
368 | This is best used when you want to do something all at once and filter your reads. Other approaches may be better when you are demultiplexing.
369 | 
370 | #### Usage:
371 | 
372 | Run on SGE using array jobs as a hacky way of doing multiprocessing.
373 | Also, helps check when things go wrong, and easy to relaunch failed jobs.
374 | 
375 | #### batch.sge
376 | 
377 | ```bash
378 | source ~/work/venv2714/bin/activate
379 | 
380 | FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p)
381 | BLAH=fast5/${FILE}
382 | 
383 | mkdir ${TMPDIR}/fast5
384 | 
385 | time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/
386 | 
387 | echo "size of files:" >&2
388 | du -shc ${TMPDIR}/fast5/ >&2
389 | echo "extraction complete!" >&2
390 | echo "Number of files:" >&2
391 | ls ${TMPDIR}/fast5/ | wc -l >&2
392 | 
393 | echo "copying data..." >&2
394 | 
395 | tar -cf ${TMPDIR}/batch.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
396 | cp ${TMPDIR}/batch.${SGE_TASK_ID}.tar ./batched_fast5/
397 | ```
398 | 
399 | #### Create CMD and launch
400 | 
401 | ```bash
402 | CMD="qsub -cwd -V -pe smp 1 -N batch -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=200G ../batch.sge"
403 | 
404 | echo $CMD && $CMD
405 | ```
406 | 
407 | ## Acknowledgements
408 | 
409 | I would like to thank the rest of my lab (Shaun Carswell, Kirston Barton, Kai Martin) in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool.
410 | 
411 | ## Cite
412 | 
413 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903)
414 | 
415 | James M. Ferguson, & Martin A. Smith. (2018, September 12). Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher (Version v1.0). Zenodo. <http://doi.org/10.5281/zenodo.1413903>
416 | 
417 | ## License
418 | 
419 | [The MIT License](https://opensource.org/licenses/MIT)
420 | 


--------------------------------------------------------------------------------