├── images └── fetch.jpg ├── .github └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── LICENSE ├── batch_tater.py ├── fast5_fetcher.py └── README.md /images/fetch.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Psy-Fer/fast5_fetcher/HEAD/images/fetch.jpg -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | 5 | --- 6 | 7 | **Describe the bug** 8 | A clear and concise description of what the bug is. 9 | 10 | **To Reproduce** 11 | Steps to reproduce the behavior: 12 | 13 | **Expected behavior** 14 | A clear and concise description of what you expected to happen. 15 | 16 | **Screenshots** 17 | If applicable, add screenshots to help explain your problem. 18 | 19 | **Desktop (please complete the following information):** 20 | - OS: [e.g. MacOS, Ubuntu) 21 | 22 | **Additional context** 23 | Add any other context about the problem here. 24 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | 5 | --- 6 | 7 | **Is your feature request related to a problem? Please describe.** 8 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 9 | 10 | **Describe the solution you'd like** 11 | A clear and concise description of what you want to happen. 12 | 13 | **Describe alternatives you've considered** 14 | A clear and concise description of any alternative solutions or features you've considered. 15 | 16 | **Additional context** 17 | Add any other context or screenshots about the feature request here. 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 James Ferguson 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /batch_tater.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import subprocess 3 | ''' 4 | Potato scripting engaged. 5 | 6 | James M. Ferguson (j.ferguson@garvan.org.au) 7 | Genomic Technologies 8 | Garvan Institute 9 | Copyright 2017 10 | 11 | batch_tater.py takes list/s of files to extract, and speeds it up a bit, by only opening 12 | one tar file at a time and extracting what is needed. 13 | 14 | To run on sun grid engine using array jobs as a hacky way of doing multiprocessing. 15 | Also, helps check when things go wrong, and easy to relaunch failed jobs. 16 | Some things left in from running on some tasty nanopore single cell data. 17 | 18 | sge file: 19 | 20 | source ~/work/venv2714/bin/activate 21 | 22 | FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p) 23 | BLAH=fast5/${FILE} 24 | 25 | mkdir ${TMPDIR}/fast5 26 | 27 | time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/ 28 | 29 | echo "size of files:" >&2 30 | du -shc ${TMPDIR}/fast5/ >&2 31 | echo "extraction complete!" >&2 32 | echo "Number of files:" >&2 33 | ls ${TMPDIR}/fast5/ | wc -l >&2 34 | 35 | echo "copying data..." >&2 36 | 37 | tar -cf ${TMPDIR}/f5f.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5 38 | cp ${TMPDIR}/f5f.${SGE_TASK_ID}.tar ./clean_f5s/ 39 | 40 | CMD: 41 | 42 | CMD="qsub -cwd -V -pe smp 1 -N batchCln -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=20G ../batch.sge" 43 | 44 | Launch: 45 | 46 | echo $CMD && $CMD 47 | 48 | 49 | stats: 50 | 51 | fastq: 27491304 52 | mapped: 11740093 53 | z mode time: 10min 54 | batch_tater total time: 21min 55 | per job time: ~28s 56 | number of CPUs: 100 57 | ''' 58 | 59 | # being lazy and using sys.argv...i mean, it is pretty lit 60 | master = sys.argv[1] 61 | tar_list = sys.argv[2] 62 | save_path = sys.argv[3] 63 | 64 | # this will probs need to be changed based on naming convention 65 | # I think i was a little tired when I wrote this 66 | list_name = tar_list.split('/')[-1] 67 | 68 | PATH = 0 69 | 70 | # not elegent, but gets it done 71 | with open(master, 'r') as f: 72 | for l in f: 73 | l = l.strip('\n') 74 | l = l.split('\t') 75 | if l[0] == list_name: 76 | PATH = l[1] 77 | break 78 | 79 | # for stats later and easy job relauncing 80 | print >> sys.stderr, "extracting:", tar_list 81 | # do the thing. That --transform hack is awesome. Blows away all the leading folders. 82 | if PATH: 83 | cmd = "tar -xf {} --transform='s/.*\///' -C {} -T {}".format( 84 | PATH, save_path, tar_list) 85 | subprocess.call(cmd, shell=True, executable='/bin/bash') 86 | 87 | else: 88 | print >> sys.stderr, "PATH not found! check index nooblet" 89 | print >> sys.stderr, "inputs:", master, tar_list, save_path 90 | -------------------------------------------------------------------------------- /fast5_fetcher.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import gzip 4 | import io 5 | import subprocess 6 | import traceback 7 | import argparse 8 | import platform 9 | from functools import partial 10 | ''' 11 | 12 | James M. Ferguson (j.ferguson@garvan.org.au) 13 | Genomic Technologies 14 | Garvan Institute 15 | Copyright 2017 16 | 17 | fast5_fetcher is designed to help manage fast5 file data storage and organisation. 18 | It takes 3 files as input: fastq/paf/flat, sequencing_summary, index 19 | 20 | -------------------------------------------------------------------------------------- 21 | version 0.0 - initial 22 | version 0.2 - added argparser and buffered gz streams 23 | version 0.3 - added paf input 24 | version 0.4 - added read id flat file input 25 | version 0.5 - pppp print output instead of extracting 26 | version 0.6 - did a dumb. changed x in s to set/dic entries O(n) vs O(1) 27 | version 0.7 - cleaned up a bit to share and removed some hot and steamy features 28 | version 0.8 - Added functionality for un-tarred file structures and seq_sum only 29 | version 1.0 - First release 30 | version 1.1 - refactor with dicswitch and batch_tater updates 31 | version 1.1.1 - Bug fix on --transform method, added OS detection 32 | version 1.2.0 - Added file trimming to fully segment selection 33 | 34 | TODO: 35 | - Python 3 compatibility 36 | - autodetect file structures 37 | - autobuild index file - make it a sub script as well 38 | - Consider using csv.DictReader() instead of wheel building 39 | - flesh out batch_tater and give better examples and clearer how-to 40 | - options to build new index of fetched fast5s 41 | 42 | ----------------------------------------------------------------------------- 43 | MIT License 44 | 45 | Copyright (c) 2017 James Ferguson 46 | 47 | Permission is hereby granted, free of charge, to any person obtaining a copy 48 | of this software and associated documentation files (the "Software"), to deal 49 | in the Software without restriction, including without limitation the rights 50 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 51 | copies of the Software, and to permit persons to whom the Software is 52 | furnished to do so, subject to the following conditions: 53 | MyParser 54 | The above copyright notice and this permission notice shall be included in all 55 | copies or substantial portions of the Software. 56 | 57 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 58 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 59 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 60 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 61 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 62 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 63 | SOFTWARE. 64 | ''' 65 | 66 | 67 | class MyParser(argparse.ArgumentParser): 68 | def error(self, message): 69 | sys.stderr.write('error: %s\n' % message) 70 | self.print_help() 71 | sys.exit(2) 72 | 73 | 74 | def main(): 75 | ''' 76 | do the thing 77 | ''' 78 | parser = MyParser( 79 | description="fast_fetcher - extraction of specific nanopore fast5 files") 80 | group = parser.add_mutually_exclusive_group() 81 | group.add_argument("-q", "--fastq", 82 | help="fastq.gz for read ids") 83 | group.add_argument("-p", "--paf", 84 | help="paf alignment file for read ids") 85 | group.add_argument("-f", "--flat", 86 | help="flat file of read ids") 87 | parser.add_argument("--OSystem", default=platform.system(), 88 | help="running operating system - leave default unless doing odd stuff") 89 | parser.add_argument("-s", "--seq_sum", 90 | help="sequencing_summary.txt.gz file") 91 | parser.add_argument("-i", "--index", 92 | help="index.gz file mapping fast5 files in tar archives") 93 | parser.add_argument("-o", "--output", 94 | help="output directory for extracted fast5s") 95 | parser.add_argument("-t", "--trim", action="store_true", 96 | help="trim files as if standalone experiment, (fq, SS)") 97 | parser.add_argument("-l", "--trim_list", 98 | help="list of file names to trim, comma separated. fastq only needed for -p and -f modes") 99 | parser.add_argument("-x", "--prefix", default="default", 100 | help="trim file prefix, eg: barcode_01, output: barcode_01.fastq, barcode_01_seq_sum.txt") 101 | # parser.add_argument("-t", "--procs", type=int, 102 | # help="Number of CPUs to use - TODO: NOT YET IMPLEMENTED") 103 | parser.add_argument("-z", "--pppp", action="store_true", 104 | help="Print out tar commands in batches for further processing") 105 | args = parser.parse_args() 106 | 107 | # print help if no arguments given 108 | if len(sys.argv) == 1: 109 | parser.print_help(sys.stderr) 110 | sys.exit(1) 111 | 112 | print >> sys.stderr, "Starting things up!" 113 | 114 | p_dic = {} 115 | if args.pppp: 116 | print >> sys.stderr, "PPPP state! Not extracting, exporting tar commands" 117 | 118 | trim_pass = False 119 | if args.trim: 120 | SS = False 121 | FQ = False 122 | if args.trim_list: 123 | A = args.trim_list.split(',') 124 | for a in A: 125 | if "fastq" in a: 126 | FQ = a 127 | elif "txt" in a: 128 | SS = a 129 | else: 130 | print >> sys.stderr, "Unknown trim input. detects 'fastq' or 'txt' for files. Input:", a 131 | else: 132 | print >> sys.stderr, "No extra files given. Compatible with -q fastq input only" 133 | 134 | if args.fastq: 135 | FQ = args.fastq 136 | if args.seq_sum: 137 | SS = args.seq_sum 138 | 139 | # final check 140 | if FQ and SS: 141 | trim_pass = True 142 | print >> sys.stderr, "Trim setting detected. Writing to working direcory" 143 | else: 144 | print >> sys.stderr, "Unable to verify both fastq and sequencing_summary files. Please check filenames and try again. Exiting..." 145 | sys.exit() 146 | 147 | ids = [] 148 | if args.fastq: 149 | ids = get_fq_reads(args.fastq) 150 | if trim_pass: 151 | trim_SS(args, ids, SS) 152 | elif args.paf: 153 | ids = get_paf_reads(args.paf) 154 | if trim_pass: 155 | trim_both(args, ids, FQ, SS) 156 | elif args.flat: 157 | ids = get_flat_reads(args.flat) 158 | if trim_pass: 159 | trim_both(args, ids, FQ, SS) 160 | if not ids and trim_pass: 161 | filenames, ids = get_filenames(args.seq_sum, ids) 162 | trim_both(args, ids, FQ, SS) 163 | else: 164 | filenames, ids = get_filenames(args.seq_sum, ids) 165 | 166 | paths = get_paths(args.index, filenames) 167 | print >> sys.stderr, "extracting..." 168 | # place multiprocessing pool here 169 | for p, f in paths: 170 | if args.pppp: 171 | if p in p_dic: 172 | p_dic[p].append(f) 173 | else: 174 | p_dic[p] = [f] 175 | continue 176 | else: 177 | try: 178 | extract_file(args, p, f) 179 | except: 180 | traceback.print_exc() 181 | print >> sys.stderr, "Failed to extract:", p, f 182 | # For each .tar file, write a file with the tarball name as filename.tar.txt 183 | # and contains a list of files to extract - input for batch_tater.py 184 | if args.pppp: 185 | with open("tater_master.txt", 'w') as m: 186 | for i in p_dic: 187 | fname = "tater_" + i.split('/')[-1] + ".txt" 188 | m_entry = "{}\t{}".format(fname, i) 189 | fname = args.output + "/tater_" + i.split('/')[-1] + ".txt" 190 | m.write(m_entry) 191 | m.write('\n') 192 | with open(fname, 'w') as f: 193 | for j in p_dic[i]: 194 | f.write(j) 195 | f.write('\n') 196 | 197 | print >> sys.stderr, "done!" 198 | 199 | 200 | def dicSwitch(i): 201 | ''' 202 | A switch to handle file opening and reduce duplicated code 203 | ''' 204 | open_method = { 205 | "gz": gzip.open, 206 | "norm": open 207 | } 208 | return open_method[i] 209 | 210 | 211 | def get_fq_reads(fastq): 212 | ''' 213 | read fastq file and extract read ids 214 | quick and dirty to limit library requirements - still bullet fast 215 | ''' 216 | c = 0 217 | read_ids = set() 218 | if fastq.endswith('.gz'): 219 | f_read = dicSwitch('gz') 220 | else: 221 | f_read = dicSwitch('norm') 222 | with f_read(fastq, 'rb') as fq: 223 | if fastq.endswith('.gz'): 224 | fq = io.BufferedReader(fq) 225 | for line in fq: 226 | c += 1 227 | line = line.strip('\n') 228 | if c == 1: 229 | idx = line.split()[0][1:] 230 | read_ids.add(idx) 231 | elif c >= 4: 232 | c = 0 233 | return read_ids 234 | 235 | 236 | def get_paf_reads(reads): 237 | ''' 238 | Parse paf file to pull read ids (from minimap2 alignment) 239 | ''' 240 | read_ids = set() 241 | if reads.endswith('.gz'): 242 | f_read = dicSwitch('gz') 243 | else: 244 | f_read = dicSwitch('norm') 245 | with f_read(reads, 'rb') as fq: 246 | if reads.endswith('.gz'): 247 | fq = io.BufferedReader(fq) 248 | for line in fq: 249 | line = line.strip('\n') 250 | line = line.split() 251 | read_ids.add(line[0]) 252 | return read_ids 253 | 254 | 255 | def get_flat_reads(filename): 256 | ''' 257 | Parse a flat file separated by line breaks \n 258 | TODO: make @ symbol check once, as they should all be the same 259 | ''' 260 | read_ids = set() 261 | check = True 262 | if filename.endswith('.gz'): 263 | f_read = dicSwitch('gz') 264 | else: 265 | f_read = dicSwitch('norm') 266 | with f_read(filename, 'rb') as fq: 267 | if filename.endswith('.gz'): 268 | fq = io.BufferedReader(fq) 269 | for line in fq: 270 | line = line.strip('\n') 271 | if check: 272 | if line[0] == '@': 273 | x = 1 274 | else: 275 | x = 0 276 | check = False 277 | idx = line[x:] 278 | read_ids.add(idx) 279 | return read_ids 280 | 281 | 282 | def trim_SS(args, ids, SS): 283 | ''' 284 | Trims the sequencing_summary.txt file to only the input IDs 285 | ''' 286 | if args.prefix: 287 | pre = args.prefix + "_seq_sum.txt" 288 | else: 289 | pre = "trimmed_seq_sum.txt" 290 | head = True 291 | if SS.endswith('.gz'): 292 | f_read = dicSwitch('gz') 293 | else: 294 | f_read = dicSwitch('norm') 295 | # make this compatible with dicSwitch 296 | with open(pre, "w") as w: 297 | with f_read(SS, 'rb') as sz: 298 | if SS.endswith('.gz'): 299 | sz = io.BufferedReader(sz) 300 | for line in sz: 301 | if head: 302 | w.write(line) 303 | head = False 304 | continue 305 | l = line.split() 306 | if l[1] in ids: 307 | w.write(line) 308 | 309 | 310 | def trim_both(args, ids, FQ, SS): 311 | ''' 312 | Trims the sequencing_summary.txt and fastq files to only the input IDs 313 | ''' 314 | # trim the SS 315 | trim_SS(args, ids, SS) 316 | if args.prefix: 317 | pre = args.prefix + ".fastq" 318 | else: 319 | pre = "trimmed.fastq" 320 | 321 | # trim the fastq 322 | c = 0 323 | P = False 324 | if FQ.endswith('.gz'): 325 | f_read = dicSwitch('gz') 326 | else: 327 | f_read = dicSwitch('norm') 328 | with open(pre, "w") as w: 329 | with f_read(FQ, 'rb') as fq: 330 | if FQ.endswith('.gz'): 331 | fq = io.BufferedReader(fq) 332 | for line in fq: 333 | c += 1 334 | if c == 1: 335 | if line.split()[0][1:] in ids: 336 | P = True 337 | w.write(line) 338 | elif P and c < 4: 339 | w.write(line) 340 | elif c >= 4: 341 | if P: 342 | w.write(line) 343 | c = 0 344 | P = False 345 | 346 | 347 | def get_filenames(seq_sum, ids): 348 | ''' 349 | match read ids with seq_sum to pull filenames 350 | ''' 351 | # for when using seq_sum for filtering, and not fq,paf,flat 352 | ss_only = False 353 | if not ids: 354 | ss_only = True 355 | ids = set() 356 | head = True 357 | files = set() 358 | if seq_sum.endswith('.gz'): 359 | f_read = dicSwitch('gz') 360 | else: 361 | f_read = dicSwitch('norm') 362 | with f_read(seq_sum, 'rb') as sz: 363 | if seq_sum.endswith('.gz'): 364 | sz = io.BufferedReader(sz) 365 | for line in sz: 366 | if head: 367 | head = False 368 | continue 369 | line = line.strip('\n') 370 | line = line.split() 371 | if ss_only: 372 | files.add(line[0]) 373 | ids.add(line[1]) 374 | else: 375 | if line[1] in ids: 376 | files.add(line[0]) 377 | return files, ids 378 | 379 | 380 | def get_paths(index_file, filenames, f5=None): 381 | ''' 382 | Read index and extract full paths for file extraction 383 | ''' 384 | tar = False 385 | paths = [] 386 | c = 0 387 | if index_file.endswith('.gz'): 388 | f_read = dicSwitch('gz') 389 | else: 390 | f_read = dicSwitch('norm') 391 | # detect normal or tars 392 | with f_read(index_file, 'rb') as idz: 393 | if index_file.endswith('.gz'): 394 | idz = io.BufferedReader(idz) 395 | for line in idz: 396 | line = line.strip('\n') 397 | c += 1 398 | if c > 10: 399 | break 400 | if line.endswith('.tar'): 401 | tar = True 402 | break 403 | # extract paths 404 | with f_read(index_file, 'rb') as idz: 405 | if index_file.endswith('.gz'): 406 | idz = io.BufferedReader(idz) 407 | for line in idz: 408 | line = line.strip('\n') 409 | if tar: 410 | if line.endswith('.tar'): 411 | path = line 412 | elif line.endswith('.fast5'): 413 | f = line.split('/')[-1] 414 | if f in filenames: 415 | paths.append([path, line]) 416 | else: 417 | continue 418 | else: 419 | if line.endswith('.fast5'): 420 | f = line.split('/')[-1] 421 | if f in filenames: 422 | paths.append(['', line]) 423 | else: 424 | continue 425 | 426 | return paths 427 | 428 | 429 | def extract_file(args, path, filename): 430 | ''' 431 | Do the extraction. 432 | I was using the tarfile python lib, but honestly, it sucks and was too volatile. 433 | if you have a better suggestion, let me know :) 434 | That --transform hack is awesome btw. Blows away all the leading folders. use 435 | cp for when using untarred structures. Not recommended, but here for completeness. 436 | 437 | --transform not working on MacOS. Need to use gtar 438 | Thanks to Kai Martin for picking that one up! 439 | 440 | ''' 441 | OSystem = "" 442 | OSystem = args.OSystem 443 | save_path = args.output 444 | if path.endswith('.tar'): 445 | if OSystem in ["Linux", "Windows"]: 446 | cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format( 447 | path, save_path, filename) 448 | elif OSystem == "Darwin": 449 | cmd = "gtar -xf {} --transform='s/.*\///' -C {} {}".format( 450 | path, save_path, filename) 451 | else: 452 | print >> sys.stderr, "Unsupported OSystem, trying Tar anyway, OS:", OSystem 453 | cmd = "tar -xf {} --transform='s/.*\///' -C {} {}".format( 454 | path, save_path, filename) 455 | else: 456 | cmd = "cp {} {}".format(filename, save_path) 457 | subprocess.call(cmd, shell=True, executable='/bin/bash') 458 | 459 | 460 | if __name__ == '__main__': 461 | main() 462 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # fast5_fetcher 2 | 3 | #### Doing the heavy lifting for you. 4 | 5 |
