├── images └── fetch.jpg ├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── LICENSE ├── batch_tater.py ├── fast5_fetcher.py └── README.md /images/fetch.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Psy-Fer/fast5_fetcher/HEAD/images/fetch.jpg -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | 5 | --- 6 | 7 | **Describe the bug** 8 | A clear and concise description of what the bug is. 9 | 10 | **To Reproduce** 11 | Steps to reproduce the behavior: 12 | 13 | **Expected behavior** 14 | A clear and concise description of what you expected to happen. 15 | 16 | **Screenshots** 17 | If applicable, add screenshots to help explain your problem. 18 | 19 | **Desktop (please complete the following information):** 20 | - OS: [e.g. MacOS, Ubuntu) 21 | 22 | **Additional context** 23 | Add any other context about the problem here. 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | 5 | --- 6 | 7 | **Is your feature request related to a problem? Please describe.** 8 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 9 | 10 | **Describe the solution you'd like** 11 | A clear and concise description of what you want to happen. 12 | 13 | **Describe alternatives you've considered** 14 | A clear and concise description of any alternative solutions or features you've considered. 15 | 16 | **Additional context** 17 | Add any other context or screenshots about the feature request here. 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 James Ferguson 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /batch_tater.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import subprocess 3 | ''' 4 | Potato scripting engaged. 5 | 6 | James M. Ferguson (j.ferguson@garvan.org.au) 7 | Genomic Technologies 8 | Garvan Institute 9 | Copyright 2017 10 | 11 | batch_tater.py takes list/s of files to extract, and speeds it up a bit, by only opening 12 | one tar file at a time and extracting what is needed. 13 | 14 | To run on sun grid engine using array jobs as a hacky way of doing multiprocessing. 15 | Also, helps check when things go wrong, and easy to relaunch failed jobs. 16 | Some things left in from running on some tasty nanopore single cell data. 17 | 18 | sge file: 19 | 20 | source ~/work/venv2714/bin/activate 21 | 22 | FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p) 23 | BLAH=fast5/${FILE} 24 | 25 | mkdir ${TMPDIR}/fast5 26 | 27 | time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/ 28 | 29 | echo "size of files:" >&2 30 | du -shc ${TMPDIR}/fast5/ >&2 31 | echo "extraction complete!" >&2 32 | echo "Number of files:" >&2 33 | ls ${TMPDIR}/fast5/ | wc -l >&2 34 | 35 | echo "copying data..." >&2 36 | 37 | tar -cf ${TMPDIR}/f5f.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5 38 | cp ${TMPDIR}/f5f.${SGE_TASK_ID}.tar ./clean_f5s/ 39 | 40 | CMD: 41 | 42 | CMD="qsub -cwd -V -pe smp 1 -N batchCln -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=20G ../batch.sge" 43 | 44 | Launch: 45 | 46 | echo $CMD && $CMD 47 | 48 | 49 | stats: 50 | 51 | fastq: 27491304 52 | mapped: 11740093 53 | z mode time: 10min 54 | batch_tater total time: 21min 55 | per job time: ~28s 56 | number of CPUs: 100 57 | ''' 58 | 59 | # being lazy and using sys.argv...i mean, it is pretty lit 60 | master = sys.argv[1] 61 | tar_list = sys.argv[2] 62 | save_path = sys.argv[3] 63 | 64 | # this will probs need to be changed based on naming convention 65 | # I think i was a little tired when I wrote this 66 | list_name = tar_list.split('/')[-1] 67 | 68 | PATH = 0 69 | 70 | # not elegent, but gets it done 71 | with open(master, 'r') as f: 72 | for l in f: 73 | l = l.strip('\n') 74 | l = l.split('\t') 75 | if l[0] == list_name: 76 | PATH = l[1] 77 | break 78 | 79 | # for stats later and easy job relauncing 80 | print >> sys.stderr, "extracting:", tar_list 81 | # do the thing. That --transform hack is awesome. Blows away all the leading folders. 82 | if PATH: 83 | cmd = "tar -xf {} --transform='s/.*\///' -C {} -T {}".format( 84 | PATH, save_path, tar_list) 85 | subprocess.call(cmd, shell=True, executable='/bin/bash') 86 | 87 | else: 88 | print >> sys.stderr, "PATH not found! check index nooblet" 89 | print >> sys.stderr, "inputs:", master, tar_list, save_path 90 | -------------------------------------------------------------------------------- /fast5_fetcher.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import gzip 4 | import io 5 | import subprocess 6 | import traceback 7 | import argparse 8 | import platform 9 | from functools import partial 10 | ''' 11 | 12 | James M. Ferguson (j.ferguson@garvan.org.au) 13 | Genomic Technologies 14 | Garvan Institute 15 | Copyright 2017 16 | 17 | fast5_fetcher is designed to help manage fast5 file data storage and organisation. 18 | It takes 3 files as input: fastq/paf/flat, sequencing_summary, index 19 | 20 | -------------------------------------------------------------------------------------- 21 | version 0.0 - initial 22 | version 0.2 - added argparser and buffered gz streams 23 | version 0.3 - added paf input 24 | version 0.4 - added read id flat file input 25 | version 0.5 - pppp print output instead of extracting 26 | version 0.6 - did a dumb. changed x in s to set/dic entries O(n) vs O(1) 27 | version 0.7 - cleaned up a bit to share and removed some hot and steamy features 28 | version 0.8 - Added functionality for un-tarred file structures and seq_sum only 29 | version 1.0 - First release 30 | version 1.1 - refactor with dicswitch and batch_tater updates 31 | version 1.1.1 - Bug fix on --transform method, added OS detection 32 | version 1.2.0 - Added file trimming to fully segment selection 33 | 34 | TODO: 35 | - Python 3 compatibility 36 | - autodetect file structures 37 | - autobuild index file - make it a sub script as well 38 | - Consider using csv.DictReader() instead of wheel building 39 | - flesh out batch_tater and give better examples and clearer how-to 40 | - options to build new index of fetched fast5s 41 | 42 | ----------------------------------------------------------------------------- 43 | MIT License 44 | 45 | Copyright (c) 2017 James Ferguson 46 | 47 | Permission is hereby granted, free of charge, to any person obtaining a copy 48 | of this software and associated documentation files (the "Software"), to deal 49 | in the Software without restriction, including without limitation the rights 50 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 51 | copies of the Software, and to permit persons to whom the Software is 52 | furnished to do so, subject to the following conditions: 53 | MyParser 54 | The above copyright notice and this permission notice shall be included in all 55 | copies or substantial portions of the Software. 56 | 57 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 58 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 59 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 60 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 61 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 62 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 63 | SOFTWARE. 64 | ''' 65 | 66 | 67 | class MyParser(argparse.ArgumentParser): 68 | def error(self, message): 69 | sys.stderr.write('error: %s\n' % message) 70 | self.print_help() 71 | sys.exit(2) 72 | 73 | 74 | def main(): 75 | ''' 76 | do the thing 77 | ''' 78 | parser = MyParser( 79 | description="fast_fetcher - extraction of specific nanopore fast5 files") 80 | group = parser.add_mutually_exclusive_group() 81 | group.add_argument("-q", "--fastq", 82 | help="fastq.gz for read ids") 83 | group.add_argument("-p", "--paf", 84 | help="paf alignment file for read ids") 85 | group.add_argument("-f", "--flat", 86 | help="flat file of read ids") 87 | parser.add_argument("--OSystem", default=platform.system(), 88 | help="running operating system - leave default unless doing odd stuff") 89 | parser.add_argument("-s", "--seq_sum", 90 | help="sequencing_summary.txt.gz file") 91 | parser.add_argument("-i", "--index", 92 | help="index.gz file mapping fast5 files in tar archives") 93 | parser.add_argument("-o", "--output", 94 | help="output directory for extracted fast5s") 95 | parser.add_argument("-t", "--trim", action="store_true", 96 | help="trim files as if standalone experiment, (fq, SS)") 97 | parser.add_argument("-l", "--trim_list", 98 | help="list of file names to trim, comma separated. fastq only needed for -p and -f modes") 99 | parser.add_argument("-x", "--prefix", default="default", 100 | help="trim file prefix, eg: barcode_01, output: barcode_01.fastq, barcode_01_seq_sum.txt") 101 | # parser.add_argument("-t", "--procs", type=int, 102 | # help="Number of CPUs to use - TODO: NOT YET IMPLEMENTED") 103 | parser.add_argument("-z", "--pppp", action="store_true", 104 | help="Print out tar commands in batches for further processing") 105 | args = parser.parse_args() 106 | 107 | # print help if no arguments given 108 | if len(sys.argv) == 1: 109 | parser.print_help(sys.stderr) 110 | sys.exit(1) 111 | 112 | print >> sys.stderr, "Starting things up!" 113 | 114 | p_dic = {} 115 | if args.pppp: 116 | print >> sys.stderr, "PPPP state! Not extracting, exporting tar commands" 117 | 118 | trim_pass = False 119 | if args.trim: 120 | SS = False 121 | FQ = False 122 | if args.trim_list: 123 | A = args.trim_list.split(',') 124 | for a in A: 125 | if "fastq" in a: 126 | FQ = a 127 | elif "txt" in a: 128 | SS = a 129 | else: 130 | print >> sys.stderr, "Unknown trim input. detects 'fastq' or 'txt' for files. Input:", a 131 | else: 132 | print >> sys.stderr, "No extra files given. Compatible with -q fastq input only" 133 | 134 | if args.fastq: 135 | FQ = args.fastq 136 | if args.seq_sum: 137 | SS = args.seq_sum 138 | 139 | # final check 140 | if FQ and SS: 141 | trim_pass = True 142 | print >> sys.stderr, "Trim setting detected. Writing to working direcory" 143 | else: 144 | print >> sys.stderr, "Unable to verify both fastq and sequencing_summary files. Please check filenames and try again. Exiting..." 145 | sys.exit() 146 | 147 | ids = [] 148 | if args.fastq: 149 | ids = get_fq_reads(args.fastq) 150 | if trim_pass: 151 | trim_SS(args, ids, SS) 152 | elif args.paf: 153 | ids = get_paf_reads(args.paf) 154 | if trim_pass: 155 | trim_both(args, ids, FQ, SS) 156 | elif args.flat: 157 | ids = get_flat_reads(args.flat) 158 | if trim_pass: 159 | trim_both(args, ids, FQ, SS) 160 | if not ids and trim_pass: 161 | filenames, ids = get_filenames(args.seq_sum, ids) 162 | trim_both(args, ids, FQ, SS) 163 | else: 164 | filenames, ids = get_filenames(args.seq_sum, ids) 165 | 166 | paths = get_paths(args.index, filenames) 167 | print >> sys.stderr, "extracting..." 168 | # place multiprocessing pool here 169 | for p, f in paths: 170 | if args.pppp: 171 | if p in p_dic: 172 | p_dic[p].append(f) 173 | else: 174 | p_dic[p] = [f] 175 | continue 176 | else: 177 | try: 178 | extract_file(args, p, f) 179 | except: 180 | traceback.print_exc() 181 | print >> sys.stderr, "Failed to extract:", p, f 182 | # For each .tar file, write a file with the tarball name as filename.tar.txt 183 | # and contains a list of files to extract - input for batch_tater.py 184 | if args.pppp: 185 | with open("tater_master.txt", 'w') as m: 186 | for i in p_dic: 187 | fname = "tater_" + i.split('/')[-1] + ".txt" 188 | m_entry = "{}\t{}".format(fname, i) 189 | fname = args.output + "/tater_" + i.split('/')[-1] + ".txt" 190 | m.write(m_entry) 191 | m.write('\n') 192 | with open(fname, 'w') as f: 193 | for j in p_dic[i]: 194 | f.write(j) 195 | f.write('\n') 196 | 197 | print >> sys.stderr, "done!" 198 | 199 | 200 | def dicSwitch(i): 201 | ''' 202 | A switch to handle file opening and reduce duplicated code 203 | ''' 204 | open_method = { 205 | "gz": gzip.open, 206 | "norm": open 207 | } 208 | return open_method[i] 209 | 210 | 211 | def get_fq_reads(fastq): 212 | ''' 213 | read fastq file and extract read ids 214 | quick and dirty to limit library requirements - still bullet fast 215 | ''' 216 | c = 0 217 | read_ids = set() 218 | if fastq.endswith('.gz'): 219 | f_read = dicSwitch('gz') 220 | else: 221 | f_read = dicSwitch('norm') 222 | with f_read(fastq, 'rb') as fq: 223 | if fastq.endswith('.gz'): 224 | fq = io.BufferedReader(fq) 225 | for line in fq: 226 | c += 1 227 | line = line.strip('\n') 228 | if c == 1: 229 | idx = line.split()[0][1:] 230 | read_ids.add(idx) 231 | elif c >= 4: 232 | c = 0 233 | return read_ids 234 | 235 | 236 | def get_paf_reads(reads): 237 | ''' 238 | Parse paf file to pull read ids (from minimap2 alignment) 239 | ''' 240 | read_ids = set() 241 | if reads.endswith('.gz'): 242 | f_read = dicSwitch('gz') 243 | else: 244 | f_read = dicSwitch('norm') 245 | with f_read(reads, 'rb') as fq: 246 | if reads.endswith('.gz'): 247 | fq = io.BufferedReader(fq) 248 | for line in fq: 249 | line = line.strip('\n') 250 | line = line.split() 251 | read_ids.add(line[0]) 252 | return read_ids 253 | 254 | 255 | def get_flat_reads(filename): 256 | ''' 257 | Parse a flat file separated by line breaks \n 258 | TODO: make @ symbol check once, as they should all be the same 259 | ''' 260 | read_ids = set() 261 | check = True 262 | if filename.endswith('.gz'): 263 | f_read = dicSwitch('gz') 264 | else: 265 | f_read = dicSwitch('norm') 266 | with f_read(filename, 'rb') as fq: 267 | if filename.endswith('.gz'): 268 | fq = io.BufferedReader(fq) 269 | for line in fq: 270 | line = line.strip('\n') 271 | if check: 272 | if line[0] == '@': 273 | x = 1 274 | else: 275 | x = 0 276 | check = False 277 | idx = line[x:] 278 | read_ids.add(idx) 279 | return read_ids 280 | 281 | 282 | def trim_SS(args, ids, SS): 283 | ''' 284 | Trims the sequencing_summary.txt file to only the input IDs 285 | ''' 286 | if args.prefix: 287 | pre = args.prefix + "_seq_sum.txt" 288 | else: 289 | pre = "trimmed_seq_sum.txt" 290 | head = True 291 | if SS.endswith('.gz'): 292 | f_read = dicSwitch('gz') 293 | else: 294 | f_read = dicSwitch('norm') 295 | # make this compatible with dicSwitch 296 | with open(pre, "w") as w: 297 | with f_read(SS, 'rb') as sz: 298 | if SS.endswith('.gz'): 299 | sz = io.BufferedReader(sz) 300 | for line in sz: 301 | if head: 302 | w.write(line) 303 | head = False 304 | continue 305 | l = line.split() 306 | if l[1] in ids: 307 | w.write(line) 308 | 309 | 310 | def trim_both(args, ids, FQ, SS): 311 | ''' 312 | Trims the sequencing_summary.txt and fastq files to only the input IDs 313 | ''' 314 | # trim the SS 315 | trim_SS(args, ids, SS) 316 | if args.prefix: 317 | pre = args.prefix + ".fastq" 318 | else: 319 | pre = "trimmed.fastq" 320 | 321 | # trim the fastq 322 | c = 0 323 | P = False 324 | if FQ.endswith('.gz'): 325 | f_read = dicSwitch('gz') 326 | else: 327 | f_read = dicSwitch('norm') 328 | with open(pre, "w") as w: 329 | with f_read(FQ, 'rb') as fq: 330 | if FQ.endswith('.gz'): 331 | fq = io.BufferedReader(fq) 332 | for line in fq: 333 | c += 1 334 | if c == 1: 335 | if line.split()[0][1:] in ids: 336 | P = True 337 | w.write(line) 338 | elif P and c < 4: 339 | w.write(line) 340 | elif c >= 4: 341 | if P: 342 | w.write(line) 343 | c = 0 344 | P = False 345 | 346 | 347 | def get_filenames(seq_sum, ids): 348 | ''' 349 | match read ids with seq_sum to pull filenames 350 | ''' 351 | # for when using seq_sum for filtering, and not fq,paf,flat 352 | ss_only = False 353 | if not ids: 354 | ss_only = True 355 | ids = set() 356 | head = True 357 | files = set() 358 | if seq_sum.endswith('.gz'): 359 | f_read = dicSwitch('gz') 360 | else: 361 | f_read = dicSwitch('norm') 362 | with f_read(seq_sum, 'rb') as sz: 363 | if seq_sum.endswith('.gz'): 364 | sz = io.BufferedReader(sz) 365 | for line in sz: 366 | if head: 367 | head = False 368 | continue 369 | line = line.strip('\n') 370 | line = line.split() 371 | if ss_only: 372 | files.add(line[0]) 373 | ids.add(line[1]) 374 | else: 375 | if line[1] in ids: 376 | files.add(line[0]) 377 | return files, ids 378 | 379 | 380 | def get_paths(index_file, filenames, f5=None): 381 | ''' 382 | Read index and extract full paths for file extraction 383 | ''' 384 | tar = False 385 | paths = [] 386 | c = 0 387 | if index_file.endswith('.gz'): 388 | f_read = dicSwitch('gz') 389 | else: 390 | f_read = dicSwitch('norm') 391 | # detect normal or tars 392 | with f_read(index_file, 'rb') as idz: 393 | if index_file.endswith('.gz'): 394 | idz = io.BufferedReader(idz) 395 | for line in idz: 396 | line = line.strip('\n') 397 | c += 1 398 | if c > 10: 399 | break 400 | if line.endswith('.tar'): 401 | tar = True 402 | break 403 | # extract paths 404 | with f_read(index_file, 'rb') as idz: 405 | if index_file.endswith('.gz'): 406 | idz = io.BufferedReader(idz) 407 | for line in idz: 408 | line = line.strip('\n') 409 | if tar: 410 | if line.endswith('.tar'): 411 | path = line 412 | elif line.endswith('.fast5'): 413 | f = line.split('/')[-1] 414 | if f in filenames: 415 | paths.append([path, line]) 416 | else: 417 | continue 418 | else: 419 | if line.endswith('.fast5'): 420 | f = line.split('/')[-1] 421 | if f in filenames: 422 | paths.append(['', line]) 423 | else: 424 | continue 425 | 426 | return paths 427 | 428 | 429 | def extract_file(args, path, filename): 430 | ''' 431 | Do the extraction. 432 | I was using the tarfile python lib, but honestly, it sucks and was too volatile. 433 | if you have a better suggestion, let me know :) 434 | That --transform hack is awesome btw. Blows away all the leading folders. use 435 | cp for when using untarred structures. Not recommended, but here for completeness. 436 | 437 | --transform not working on MacOS. Need to use gtar 438 | Thanks to Kai Martin for picking that one up! 439 | 440 | ''' 441 | OSystem = "" 442 | OSystem = args.OSystem 443 | save_path = args.output 444 | if path.endswith('.tar'): 445 | if OSystem in ["Linux", "Windows"]: 446 | cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format( 447 | path, save_path, filename) 448 | elif OSystem == "Darwin": 449 | cmd = "gtar -xf {} --transform='s/.*\///' -C {} {}".format( 450 | path, save_path, filename) 451 | else: 452 | print >> sys.stderr, "Unsupported OSystem, trying Tar anyway, OS:", OSystem 453 | cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format( 454 | path, save_path, filename) 455 | else: 456 | cmd = "cp {} {}".format(filename, save_path) 457 | subprocess.call(cmd, shell=True, executable='/bin/bash') 458 | 459 | 460 | if __name__ == '__main__': 461 | main() 462 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # fast5_fetcher 2 | 3 | #### Doing the heavy lifting for you. 4 | 5 |

fast5_fetcher

6 | 7 | **fast5_fetcher** is a tool for fetching nanopore fast5 files to save time and simplify downstream analysis. 8 | 9 | 10 | ## **fast5_fetcher is now part of SquiggleKit located [here](https://github.com/Psy-Fer/SquiggleKit)** 11 | ### Please use and cite SquiggleKit as it is the most up to date 12 | 13 | 14 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903) 15 | 16 | ## Contents 17 | 18 | 19 | 20 | - [Background](#background) 21 | - [Requirements](#requirements) 22 | - [Installation](#installation) 23 | - [Getting Started](<#getting started>) 24 | - [File structures](<#file structures>) 25 | - [1. Raw structure (not preferred)](<#1. Raw structure>) 26 | - [2. Local basecalled structure](<#2. Local basecalled structure>) 27 | - [3. Parallel basecalled structure](<#3. Parallel basecalled structure>) 28 | - [Inputs](#inputs) 29 | - [Instructions for use](<#Instructions for use>) 30 | - [Quick start](<#Quick start>) 31 | - [fast5_fetcher.py](#fast5_fetcher.py) 32 | - [Examples](#Examples) 33 | - [batch_tater.py](#batch_tater.py) 34 | - [Acknowledgements](#acknowledgements) 35 | - [Cite](#cite) 36 | - [License](#license) 37 | 38 | 39 | # Background 40 | 41 | Reducing the number of fast5 files per folder in a single experiment was a welcomed addition to MinKnow. However this also made it rather useful for manual basecalling on a cluster, using array jobs, where each folder is basecalled individually, producing its own `sequencing_summary.txt`, `reads.fastq`, and reads folder containing the newly basecalled fast5s. Taring those fast5 files up into a single file was needed to keep the sys admins at bay, complaining about our millions of individual files on their drives. This meant, whenever there was a need to use the fast5 files from an experiment, or many experiments, unpacking the fast5 files was a significant hurdle both in time and disk space. 42 | 43 | **fast5_fetcher** was built to address this bottleneck. By building an index file of the tarballs, and using the `sequencing_summary.txt` file to match readIDs with fast5 filenames, only the fast5 files you need can be extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows. 44 | 45 | # Requirements 46 | 47 | Following a self imposed guideline, most things written to handle nanopore data or bioinformatics in general, will use as little 3rd party libraries as possible, aiming for only core libraries, or have all included files in the package. 48 | 49 | In the case of `fast5_fetcher.py` and `batch_tater.py`, only core python libraries are used. So as long as **Python 2.7+** is present, everything should work with no extra steps. (Python 3 compatibility is coming in the next big update) 50 | 51 | ##### Operating system: 52 | 53 | There is one catch. Everything is written primarily for use with **Linux**. Due to **MacOS** running on Unix, so long as the GNU tools are installed (see below), there should be minimal issues running it. **Windows 10** however may require more massaging to work with the new Linux integration. 54 | 55 | # Getting Started 56 | 57 | Building an index of fast5 files and their paths, as well as a simple bash script to control the workflow, be it on a local machine, or HPC, will depend on the starting file structure. 58 | 59 | ## File structures 60 | 61 | The file structure is not overly important, however it will modify some of the commands used in the examples. I have endeavoured to include a few diverse uses, starting from different file states, but of course, I can't think of everything, so if there is something you wish to accomplished with `fast5_fetcher.py`, but can't quite get it to work for you, let me know, and perhaps I can make it easier for you. 62 | 63 | #### 1. Raw structure (not preferred) 64 | 65 | This is the most basic structure, where all files are present in an accessible state. 66 | 67 | ├── huntsman.fastq 68 | ├── sequencing_summary.txt 69 | ├── huntsman_reads/ # Read folder 70 | │ ├── 0/ # individual folders containing ~4000 fast5s 71 | | | ├── huntsman_read1.fast5 72 | | | └── huntsman_read2.fast5 73 | | | └── ... 74 | | ├── 1/ 75 | | | ├── huntsman_read#.fast5 76 | | | └── ... 77 | └── ├── ... 78 | 79 | #### 2. Local basecalled structure 80 | 81 | This structure is the typical structure post local basecalling 82 | fastq and sequencing_summary files have been gzipped and the folders in the reads folder have been tarballed into one large file 83 | 84 | ├── huntsman.fastq.gz # gzipped 85 | ├── sequencing_summary.txt.gz # gzipped 86 | ├── huntsman_reads.tar # Tarballed read folder 87 | | # Tarball expanded 88 | |-->│ ├── 0/ # individual folders inside tarball 89 | | | ├── huntsman_read1.fast5 90 | | | └── huntsman_read2.fast5 91 | | | └── ... 92 | | ├── 1/ 93 | | | ├── huntsman_read#.fast5 94 | | | └── ... 95 | └── ├── ... 96 | 97 | #### 3. Parallel basecalled structure 98 | 99 | This structure is post massively parallel basecalling, and looks like multiples of the above structure. 100 | 101 | ├── fastq/ 102 | | ├── huntsman.1.fastq.gz 103 | | └── huntsman.2.fastq.gz 104 | | └── huntsman.3.fastq.gz 105 | | └── ... 106 | ├── logs/ 107 | | ├── sequencing_summary.1.txt.gz 108 | | └── sequencing_summary.2.txt.gz 109 | | └── sequencing_summary.3.txt.gz 110 | | └── ... 111 | ├── fast5/ 112 | | ├── 1.tar 113 | | └── 2.tar 114 | | └── 3.tar 115 | | └── ... 116 | 117 | With this structure, combining the `.fastq` and `sequencing_summary.txt.gz` files is needed. 118 | 119 | ##### Combine fastq.gz files 120 | 121 | ```bash 122 | for file in fastq/*.fastq.gz; do cat $file; done >> huntsman.fastq.gz 123 | ``` 124 | 125 | ##### Combine sequencing_summary.txt.gz files 126 | 127 | ```bash 128 | # create header 129 | zcat $(ls logs/sequencing_summary*.txt.gz | head -1) | head -1 > sequencing_summary.txt 130 | 131 | # combine all files, skipping first line header 132 | for file in logs/sequencing_summary*.txt.gz; do zcat $file | tail -n +2; done >> sequencing_summary.txt 133 | 134 | gzip sequencing_summary.txt 135 | ``` 136 | 137 | You should then have something like this: 138 | 139 | ├── huntsman.fastq.gz # gzipped 140 | ├── sequencing_summary.txt.gz # gzipped 141 | ├── fast5/ # fast5 folder 142 | | ├── 1.tar # each tar contains ~4000 fast5 files 143 | | └── 2.tar 144 | | └── 3.tar 145 | | └── ... 146 | 147 | ## Inputs 148 | 149 | It takes 3 files as input: 150 | 151 | 1. fastq, paf, or flat (.gz) 152 | 2. sequencing_summary.txt(.gz) 153 | 3. name.index(.gz) 154 | 155 | #### 1. fastq, paf, or flat 156 | 157 | This is where the readIDs are collected, to be matched with their respective fast5 files for fetching. The idea being, that some form of selection has occurred to generate the files. 158 | 159 | In the case of a **fastq**, it may be filtered for all the reads above a certain quality, or from a particular barcode after running barcode detection. 160 | 161 | For the **paf** file, it is an alignment output of minimap2. This can be used to fetch only the fast5 files that align to some reference, or has been filtered to only contain the reads that align to a particular region of interest. 162 | 163 | A **flat** file in this case is just a file that contains a list of readIDs, one on each line. This allows the user to generate any list of reads to fetch from any other desired method. 164 | 165 | Each of these files can be gzipped or not. 166 | 167 | See examples below for example test cases. 168 | 169 | #### 2. Sequencing summary 170 | 171 | The `sequencing_summary.txt` file is created by the basecalling software, (Albacore, Guppy), and contains information about each read, including the readID and fast5 file name, along with length, quality scores, and potentially barcode information. 172 | 173 | There is a shortcut method in which you can use the `sequencing_summary.txt` only, without the need for a fastq, paf, or flat file. In this case, leave the `-q`, `-f`, `-r` fields empty. 174 | 175 | This file can be gzipped or not. 176 | 177 | #### 3. Building the index 178 | 179 | How the index is built depends on which file structure you are using. It will work with both tarred and un-tarred file structures. Tarred is preferred. 180 | 181 | ##### - Raw structure (not preferred) 182 | 183 | ```bash 184 | for file in $(pwd)/reads/*/*;do echo $file; done >> name.index 185 | 186 | gzip name.index 187 | ``` 188 | 189 | ##### - Local basecalled structure 190 | 191 | ```bash 192 | for file in $(pwd)/reads.tar; do echo $file; tar -tf $file; done >> name.index 193 | 194 | gzip name.index 195 | ``` 196 | 197 | ##### - Parallel basecalled structure 198 | 199 | ```bash 200 | for file in $(pwd)/fast5/*fast5.tar; do echo $file; tar -tf $file; done >> name.index 201 | ``` 202 | 203 | If you have multiple experiments, then cat them all together and gzip. 204 | 205 | ```bash 206 | for file in ./*.index; do cat $file; done >> ../all.name.index 207 | 208 | gzip all.name.index 209 | ``` 210 | 211 | ## Instructions for use 212 | 213 | Download the repository: 214 | 215 | git clone https://github.com/Psy-Fer/fast5_fetcher.git 216 | 217 | If using MacOS, and NOT using homebrew, install it here: 218 | 219 | https://brew.sh/ 220 | 221 | then install gnu-tar with: 222 | 223 | brew install gnu-tar 224 | 225 | ### Quick start 226 | 227 | Basic use on a local computer 228 | 229 | **fastq** 230 | 231 | ```bash 232 | python fast5_fetcher.py -q my.fastq.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5 233 | ``` 234 | 235 | **paf** 236 | 237 | ```bash 238 | python fast5_fetcher.py -p my.paf -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5 239 | ``` 240 | 241 | **flat** 242 | 243 | ```bash 244 | python fast5_fetcher.py -f my_flat.txt.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5 245 | ``` 246 | 247 | **sequencing_summary.txt only** 248 | 249 | ```bash 250 | python fast5_fetcher.py -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5 251 | ``` 252 | 253 | See examples below for use on an **HPC** using **SGE** 254 | 255 | ## fast5_fetcher.py 256 | 257 | #### Full usage 258 | 259 | usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [--OSystem OSYSTEM] 260 | [-s SEQ_SUM] [-i INDEX] [-o OUTPUT] [-t] 261 | [-l TRIM_LIST] [-x PREFIX] [-z] 262 | 263 | fast_fetcher - extraction of specific nanopore fast5 files 264 | 265 | optional arguments: 266 | -h, --help show this help message and exit 267 | -q FASTQ, --fastq FASTQ 268 | fastq.gz for read ids 269 | -p PAF, --paf PAF paf alignment file for read ids 270 | -f FLAT, --flat FLAT flat file of read ids 271 | --OSystem OSYSTEM running operating system - leave default unless doing 272 | odd stuff 273 | -s SEQ_SUM, --seq_sum SEQ_SUM 274 | sequencing_summary.txt.gz file 275 | -i INDEX, --index INDEX 276 | index.gz file mapping fast5 files in tar archives 277 | -o OUTPUT, --output OUTPUT 278 | output directory for extracted fast5s 279 | -t, --trim trim files as if standalone experiment, (fq, SS) 280 | -l TRIM_LIST, --trim_list TRIM_LIST 281 | list of file names to trim, comma separated. fastq 282 | only needed for -p and -f modes 283 | -x PREFIX, --prefix PREFIX 284 | trim file prefix, eg: barcode_01, output: 285 | barcode_01.fastq, barcode_01_seq_sum.txt 286 | -z, --pppp Print out tar commands in batches for further 287 | processing 288 | 289 | ## Examples 290 | 291 | Fast5 Fetcher was originally built to work with **Sun Grid Engine** (SGE), exploiting the heck out of array jobs. Although it can work locally and on untarred file structures, when operating on multiple sequencing experiments, with file structures scattered across a file system, is when fast5 fetcher starts to make a difference. 292 | 293 | ### SGE examples 294 | 295 | After creating the fastq/paf/flat, sequencing_summary, and index files, create an SGE file. 296 | 297 | Note the use of `${SGE_TASK_ID}` to use the array job as the pointer to a particular file 298 | 299 | #### After barcode demultiplexing 300 | 301 | Given a similar structure and naming convention, it is possible to group the fast5 files by barcode in the following manner. 302 | 303 | ├── BC_1.fastq.gz # Barcode 1 304 | ├── BC_2.fastq.gz # Barcode 2 305 | ├── BC_3.fastq.gz # ... 306 | ├── BC_4.fastq.gz 307 | ├── BC_5.fastq.gz 308 | ├── BC_6.fastq.gz 309 | ├── BC_7.fastq.gz 310 | ├── BC_8.fastq.gz 311 | ├── BC_9.fastq.gz 312 | ├── BC_10.fastq.gz 313 | ├── BC_11.fastq.gz 314 | ├── BC_12.fastq.gz 315 | ├── unclassified.fastq.gz # unclassified reads (skipped by fast5_fetcher in this example, rename BC_13 to simple fold it into the example) 316 | ├── sequencing_summary.txt.gz # gzipped 317 | ├── barcoded.index.gz # index file containing fast5 file paths 318 | ├── fast5/ # fast5 folder, unsorted 319 | | ├── 1.tar # each tar contains ~4000 fast5 files 320 | | └── 2.tar 321 | | └── 3.tar 322 | | └── ... 323 | 324 | #### fetch.sge 325 | 326 | ```bash 327 | # activate virtual python environment 328 | # most HPC will use something like "module load" 329 | source ~/work/venv2714/bin/activate 330 | 331 | # Creaete output directory to take advantage of NVME drives on cluster local 332 | mkdir ${TMPDIR}/fast5 333 | 334 | # Run fast_fetcher on each barcode after demultiplexing 335 | time python fast5_fetcher.py -r ./BC_${SGE_TASK_ID}.fastq.gz -s sequencing_summary.txt.gz -i barcoded.index.gz -o ${TMPDIR}/fast5/ 336 | 337 | # tarball the extracted reads into a single tar file 338 | # Can also split the reads into groups of ~4000 if needed 339 | tar -cf ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5 340 | # Copy from HPC drives to working dir. 341 | cp ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar ./ 342 | ``` 343 | 344 | #### Create CMD and launch 345 | 346 | ```bash 347 | # current working dir, with 1 CPU, array jobs 1 to 12 348 | # Modify memory settings as required 349 | CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_vmem=20G,tmp_requested=500G ./fetch.sge" 350 | 351 | echo $CMD && $CMD 352 | ``` 353 | 354 | ## Trimming fastq and sequencing_summary files 355 | 356 | By using the `-t, --trim` option, each barcode will also have its own sequencing_summary file for downstream analysis. This is particularly useful if each barcode is a different sample or experiment, as the output is as if it was it's own individual flowcell. 357 | 358 | This method can also trim fastq, and sequencing_summary files when using the **paf** or **flat** methods. By using the prefix option, you can label the output names, otherwise generic defaults will be used. 359 | 360 | ## batch_tater.py 361 | 362 | Potato scripting engaged 363 | 364 | This is designed to run on the output files from `fast5_fetcher.py` using option `-z`. This writes out file lists for each tarball that contains reads you want to process. Then `batch_tater.py` can read those files, to open the individual tar files, and extract the files, meaning the file is only opened once. 365 | 366 | A recent test using the -z option on ~2.2Tb of data, across ~11/27 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 20s per array job to extract and repackage with batch_tater.py. 367 | 368 | This is best used when you want to do something all at once and filter your reads. Other approaches may be better when you are demultiplexing. 369 | 370 | #### Usage: 371 | 372 | Run on SGE using array jobs as a hacky way of doing multiprocessing. 373 | Also, helps check when things go wrong, and easy to relaunch failed jobs. 374 | 375 | #### batch.sge 376 | 377 | ```bash 378 | source ~/work/venv2714/bin/activate 379 | 380 | FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p) 381 | BLAH=fast5/${FILE} 382 | 383 | mkdir ${TMPDIR}/fast5 384 | 385 | time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/ 386 | 387 | echo "size of files:" >&2 388 | du -shc ${TMPDIR}/fast5/ >&2 389 | echo "extraction complete!" >&2 390 | echo "Number of files:" >&2 391 | ls ${TMPDIR}/fast5/ | wc -l >&2 392 | 393 | echo "copying data..." >&2 394 | 395 | tar -cf ${TMPDIR}/batch.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5 396 | cp ${TMPDIR}/batch.${SGE_TASK_ID}.tar ./batched_fast5/ 397 | ``` 398 | 399 | #### Create CMD and launch 400 | 401 | ```bash 402 | CMD="qsub -cwd -V -pe smp 1 -N batch -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=200G ../batch.sge" 403 | 404 | echo $CMD && $CMD 405 | ``` 406 | 407 | ## Acknowledgements 408 | 409 | I would like to thank the rest of my lab (Shaun Carswell, Kirston Barton, Kai Martin) in Genomic Technologies team from the [Garvan Institute](https://www.garvan.org.au/) for their feedback on the development of this tool. 410 | 411 | ## Cite 412 | 413 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1413903.svg)](https://doi.org/10.5281/zenodo.1413903) 414 | 415 | James M. Ferguson, & Martin A. Smith. (2018, September 12). Psy-Fer/fast5_fetcher: Initial release of fast5_fetcher (Version v1.0). Zenodo. 416 | 417 | ## License 418 | 419 | [The MIT License](https://opensource.org/licenses/MIT) 420 | --------------------------------------------------------------------------------