├── README.md ├── aggregate_stats.py ├── copyunprocessedfiles.py ├── dups.py ├── go_albacore.sh ├── launchbasecalling.py ├── moveprocessedfiles.py ├── reportflowcells.py └── stageflowcells.py /README.md: -------------------------------------------------------------------------------- 1 | # nanopore-basecalling-scripts 2 | 3 | Nick Loman 4 | 19th June 2017 5 | 6 | ## Introduction 7 | 8 | Some simple scripts to ease management and local basecalling of millions of FAST5 files. 9 | 10 | These scripts are designed to help with the following (common) occurrences: 11 | 12 | * Albacore crashing/disk filling/lost power during a basecalling run; wishing to start back where you left off. 13 | * Live basecalling on a server while files are being synchronised over the network in real-time from a MinKNOW PC. 14 | * Directories getting muddled with the results of multiple sequencing runs from different flowcells. 15 | 16 | ## Basic usage 17 | 18 | The scripts work in the following way and consider three main directories: 19 | 20 | * ``data`` - the directory (including subfolders) where reads are uploaded to 21 | * ``staging`` - a directory that basecalling is run from, organised by flowcell ID 22 | * ``basecalls`` - the final results directory with the basecalls from Albacore 23 | 24 | To run the scripts, we suggest the following pipeline 25 | 26 | ### Stage files 27 | 28 | This step will make a symbolic link to all the files that need to be processed in ``staging``. It won't stage files that have already been basecalled (as determined by their file name): 29 | 30 | ``python stageflowcells.py data basecalls staging`` 31 | 32 | ### Basecall 33 | 34 | Basecall as normal with Albacore, substituting $flowcell as appropriate: 35 | 36 | ``read_fast5_basecaller.py -r -i staging/$flowcell -s basecalls/$flowcell ...`` 37 | 38 | ### Live Basecalling 39 | 40 | If synchronising from the MinKNOW PC to a server you can run stageflowcells.py and then Albacore in a loop, nuking the staging directory each time, i.e.: 41 | 42 | rm -rf staging 43 | python stageflowcells.py data basecalls staging 44 | read_fast5_basecaller.py -r -i staging/$flowcell -s basecalls/$flowcell ... 45 | 46 | ## How to sync to a server 47 | 48 | We like to use ``rsync`` on the MinKNOW laptop. Mac and Linux machines will have ``rsync`` installed already. We like to use Cygwin on Windows PCs. 49 | 50 | We typically use a recipe like this to transfer all reads matching ``*.fast5`` into the data directory, over an SSH connection: 51 | 52 | - Start a new Cygwin Window 53 | 54 | - Change directory to c:\data\reads, e.g. 55 | 56 | ``cd /cygdrive/c/data/reads`` 57 | 58 | - To rsync on a loop, run the following, replacing ``USER``, ``SERVER`` and ``/REMOTE/DIRECTORY/data``: 59 | 60 | while true; 61 | do 62 | rsync -vr --remove-source-files --include "*.fast5" --include "*/" --exclude "*" . USER@SERVER:/REMOTE/DIRECTORY/data 63 | sleep 5 ; 64 | done 65 | 66 | - ``--remove-source-files`` will remove the FAST5 files after they are transferred! Useful if you want to stop the local MinKNOW PC hard disk from filling up. 67 | 68 | Don't use that flag if you want to keep a local copy- but try to move the files out somewhere else from time to time or the directory will get very full and you will get a mix of files from different runs as you put more runs on which gets hard to manage! 69 | 70 | ### Alternative sync methods 71 | 72 | An alternative approach suggested by Mick Watson is to copy to a remote server via a network share (e.g. a SAMBA share) and use Robocopy.exe on Windows. 73 | 74 | ## Related projects 75 | 76 | * Please see Alexis Luccatini's [Poreduck](https://github.com/alexiswl/poreduck) for another take on this problem. 77 | 78 | 79 | 80 | 81 | -------------------------------------------------------------------------------- /aggregate_stats.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import sys 3 | import glob 4 | 5 | first_record = True 6 | master_df = None 7 | 8 | for sample in sys.argv[1:]: 9 | reports = glob.glob("%s/reports/seque*" % (sample,)) 10 | reports.append("%s/sequencing_summary.txt" % (sample,)) 11 | print reports 12 | 13 | for report in reports: 14 | try: 15 | df = pd.read_csv(report, sep="\t") 16 | except ValueError: 17 | continue 18 | 19 | df['sample'] = sample 20 | 21 | if first_record: 22 | master_df = df 23 | first_record = False 24 | else: 25 | master_df = master_df.append(df) 26 | 27 | master_df.to_csv('aggregate_stats.txt', sep="\t") 28 | 29 | -------------------------------------------------------------------------------- /copyunprocessedfiles.py: -------------------------------------------------------------------------------- 1 | #Written by Nick Loman @pathogenomenick 2 | 3 | import os 4 | import os.path 5 | import sys 6 | import shutil 7 | 8 | input_dir = sys.argv[1] 9 | albacore_dir = sys.argv[2] 10 | process_dir = sys.argv[3] 11 | 12 | basecalled_files = set() 13 | 14 | for root, dirs, files in os.walk(albacore_dir, topdown=False): 15 | for name in files: 16 | basecalled_files.add(name) 17 | 18 | # don't copy already staged files 19 | for root, dirs, files in os.walk(process_dir, topdown=False): 20 | for name in files: 21 | basecalled_files.add(name) 22 | 23 | for root, dirs, files in os.walk(input_dir, topdown=False): 24 | for name in files: 25 | if name not in basecalled_files: 26 | albacore_root = root[len(input_dir):] 27 | # move it 28 | checkdir = process_dir + '/' + albacore_root 29 | if not os.path.exists(checkdir): 30 | os.makedirs(checkdir) 31 | movefrom = input_dir + '/' + albacore_root + '/' + name 32 | moveto = process_dir + '/' + albacore_root + '/' + name 33 | print "Copy %s to %s" % (movefrom, moveto) 34 | shutil.copy(movefrom, moveto) 35 | 36 | 37 | 38 | 39 | -------------------------------------------------------------------------------- /dups.py: -------------------------------------------------------------------------------- 1 | #Written by Nick Loman @pathogenomenick 2 | 3 | from __future__ import print_function 4 | 5 | import os 6 | import os.path 7 | import sys 8 | import shutil 9 | import re 10 | import argparse 11 | 12 | def run(args): 13 | basecalled_files = set() 14 | 15 | for root, dirs, files in os.walk(args.dir, topdown=False): 16 | for name in files: 17 | if name.endswith('.fast5'): 18 | if name in basecalled_files: 19 | fn = args.dir + '/' + root + '/' + name 20 | print(fn) 21 | os.unlink(fn) 22 | basecalled_files.add(name) 23 | 24 | parser = argparse.ArgumentParser(description='Stage files for processing.') 25 | parser.add_argument('dir', help='Directory to remove duplicates from') 26 | 27 | args = parser.parse_args() 28 | run(args) 29 | 30 | -------------------------------------------------------------------------------- /go_albacore.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | tag=$1 3 | 4 | #for var in "$@" 5 | #do 6 | # python copyunprocessedfiles.py data/"$var" basecalls/"$tag" staging/"$tag" 7 | #done 8 | 9 | python moveprocessedfiles.py staging/"$tag" basecalls/"$tag"/workspace processed/"$tag"_processed 10 | time read_fast5_basecaller.py --input staging/"$tag" --worker_threads 32 -c r94_450bps_linear.cfg -s basecalls/"$tag" -r -o fast5 11 | -------------------------------------------------------------------------------- /launchbasecalling.py: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | import sys 4 | from dask import compute, delayed 5 | import dask.multiprocessing 6 | import os 7 | import os.path 8 | import datetime 9 | import shutil 10 | 11 | staging_dir = sys.argv[1] 12 | gridion_basecalls = sys.argv[2] 13 | 14 | class MyTask: 15 | def __init__(self, staging, basecalls, dirname): 16 | self.staging = staging 17 | self.basecalls = basecalls 18 | self.dirname = dirname 19 | 20 | def process(t): 21 | datstr = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M:%S") 22 | 23 | outdir = "%s/%s" % (t.basecalls, t.dirname) 24 | for fn in ['configuration.cfg', 'pipeline.log', 'sequencing_summary.txt']: 25 | if os.path.exists(outdir+'/'+fn): 26 | if not os.path.exists(outdir+'/reports'): 27 | os.makedirs(outdir+'/reports') 28 | shutil.move(outdir+'/'+fn, outdir+'/reports/'+fn+'-'+datstr) 29 | 30 | cmd = ("read_fast5_basecaller.py -c r95_450bps_linear.cfg -i %s/%s -s %s/%s -t 12 -r -o fast5,fastq" % (t.staging, t.dirname, t.basecalls, t.dirname)) 31 | os.system(cmd) 32 | 33 | dirs = os.listdir(staging_dir) 34 | #for d in dirs: 35 | # process(MyTask(staging_dir, gridion_basecalls, d)) 36 | 37 | values = [delayed(process)(MyTask(staging_dir, gridion_basecalls,x)) for x in dirs] 38 | results = compute(*values, get=dask.multiprocessing.get) 39 | 40 | 41 | -------------------------------------------------------------------------------- /moveprocessedfiles.py: -------------------------------------------------------------------------------- 1 | #Written by Nick Loman @pathogenomenick 2 | 3 | import os 4 | import os.path 5 | import sys 6 | import shutil 7 | 8 | input_dir = sys.argv[1] 9 | albacore_dir = sys.argv[2] 10 | process_dir = sys.argv[3] 11 | 12 | basecalled_files = set() 13 | 14 | for root, dirs, files in os.walk(albacore_dir, topdown=False): 15 | for name in files: 16 | basecalled_files.add(name) 17 | 18 | for root, dirs, files in os.walk(input_dir, topdown=False): 19 | for name in files: 20 | if name in basecalled_files: 21 | albacore_root = root[len(input_dir):] 22 | # move it 23 | checkdir = process_dir + '/' + albacore_root 24 | if not os.path.exists(checkdir): 25 | os.makedirs(checkdir) 26 | movefrom = input_dir + '/' + albacore_root + '/' + name 27 | moveto = process_dir + '/' + albacore_root + '/' + name 28 | print "Move %s to %s" % (movefrom, moveto) 29 | shutil.move(movefrom, moveto) 30 | 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /reportflowcells.py: -------------------------------------------------------------------------------- 1 | #Written by Nick Loman @pathogenomenick 2 | 3 | import sys 4 | from collections import defaultdict 5 | import re 6 | 7 | stats = defaultdict(int) 8 | 9 | for name in sys.stdin: 10 | flowcell = '' 11 | samplename = '' 12 | f = re.search('_2\d{7}_(F.*?)_', name) 13 | if f: 14 | flowcell = f.group(1) 15 | f = re.search('_(sequencing_run|mux_scan)_(.*)_\d+_read', name) 16 | if f: 17 | samplename = f.group(2) 18 | 19 | if flowcell: 20 | directory_name = flowcell 21 | elif name: 22 | directory_name = samplename 23 | stats[directory_name] += 1 24 | 25 | for k, v in stats.iteritems(): 26 | print "%s\t%s" % (k, v) 27 | 28 | -------------------------------------------------------------------------------- /stageflowcells.py: -------------------------------------------------------------------------------- 1 | #Written by Nick Loman @pathogenomenick 2 | 3 | from __future__ import print_function 4 | 5 | 6 | import os 7 | import os.path 8 | import sys 9 | import shutil 10 | import re 11 | import argparse 12 | import time 13 | 14 | try: 15 | from os import scandir 16 | except ImportError: 17 | from scandir import scandir 18 | 19 | # Modified version of scandir.walk that doesn't stat() symlinks to check if they are directories 20 | 21 | def walk(top, topdown=True, onerror=None, followlinks=False): 22 | """Like Python 3.5's implementation of os.walk() -- faster than 23 | the pre-Python 3.5 version as it uses scandir() internally. 24 | """ 25 | dirs = [] 26 | nondirs = [] 27 | 28 | # We may not have read permission for top, in which case we can't 29 | # get a list of the files the directory contains. os.walk 30 | # always suppressed the exception then, rather than blow up for a 31 | # minor reason when (say) a thousand readable directories are still 32 | # left to visit. That logic is copied here. 33 | try: 34 | scandir_it = scandir(top) 35 | except OSError as error: 36 | if onerror is not None: 37 | onerror(error) 38 | return 39 | 40 | while True: 41 | try: 42 | try: 43 | entry = next(scandir_it) 44 | except StopIteration: 45 | break 46 | except OSError as error: 47 | if onerror is not None: 48 | onerror(error) 49 | return 50 | 51 | try: 52 | is_dir = entry.is_dir(follow_symlinks=False) 53 | except OSError: 54 | # If is_dir() raises an OSError, consider that the entry is not 55 | # a directory, same behaviour than os.path.isdir(). 56 | is_dir = False 57 | 58 | if is_dir: 59 | dirs.append(entry.name) 60 | else: 61 | nondirs.append(entry.name) 62 | 63 | if not topdown and is_dir: 64 | # Bottom-up: recurse into sub-directory, but exclude symlinks to 65 | # directories if followlinks is False 66 | if followlinks: 67 | walk_into = True 68 | else: 69 | try: 70 | is_symlink = entry.is_symlink() 71 | except OSError: 72 | # If is_symlink() raises an OSError, consider that the 73 | # entry is not a symbolic link, same behaviour than 74 | # os.path.islink(). 75 | is_symlink = False 76 | walk_into = not is_symlink 77 | 78 | if walk_into: 79 | for entry in walk(entry.path, topdown, onerror, followlinks): 80 | yield entry 81 | 82 | # Yield before recursion if going top down 83 | if topdown: 84 | yield top, dirs, nondirs 85 | 86 | # Recurse into sub-directories 87 | for name in dirs: 88 | new_path = join(top, name) 89 | # Issue #23605: os.path.islink() is used instead of caching 90 | # entry.is_symlink() result during the loop on os.scandir() because 91 | # the caller can replace the directory entry during the "yield" 92 | # above. 93 | if followlinks or not islink(new_path): 94 | for entry in walk(new_path, topdown, onerror, followlinks): 95 | yield entry 96 | else: 97 | # Yield after recursion if going bottom up 98 | yield top, dirs, nondirs 99 | 100 | def run(args): 101 | basecalled_files = set() 102 | 103 | print ("Walk basecalls\n") 104 | for root, dirs, files in walk(args.basecalled, topdown=False): 105 | for name in files: 106 | if name.endswith('.fast5'): 107 | basecalled_files.add(name) 108 | 109 | print ("Walk staging\n") 110 | # don't copy already staged files 111 | for root, dirs, files in walk(args.staging, topdown=False): 112 | for name in files: 113 | if name.endswith('.fast5'): 114 | basecalled_files.add(name) 115 | 116 | already_processed = set() 117 | print (len(already_processed)) 118 | 119 | print ("Walk prebasecalled\n") 120 | 121 | for root, dirs, files in walk(args.prebasecalled, topdown=False): 122 | for filename in files: 123 | if filename.endswith('.tmp'): 124 | name = filename[0:len(filename)-4] 125 | else: 126 | name = filename 127 | 128 | if name not in basecalled_files and \ 129 | name.endswith('.fast5') and \ 130 | name not in already_processed: 131 | print ("Processing %s" % (name)) 132 | 133 | #delta = time.time() - os.stat(root+'/'+filename).st_mtime 134 | #if delta < (30*60): 135 | # print ("Skipping as too new: %s" % (delta,)) 136 | # continue 137 | 138 | flowcell = '' 139 | samplename = '' 140 | 141 | f = re.search('_2\d{7}_(F.*?)_', name) 142 | if f: 143 | flowcell = f.group(1) 144 | f = re.search('_(sequencing_run|mux_scan)_(.*_\d+)_read', name) 145 | if f: 146 | samplename = f.group(2) 147 | else: 148 | f = re.search('_(sequencing_run|mux_scan)_(.*)_ch(\d+)_read(\d+)', name) 149 | if f: 150 | samplename = f.group(2) 151 | 152 | if args.organiseby == 'flowcell' and flowcell: 153 | directory_name = flowcell 154 | elif args.organiseby == 'sample' and name: 155 | directory_name = samplename 156 | elif args.organiseby == 'nothing': 157 | directory_name = '' 158 | else: 159 | if flowcell and name: 160 | directory_name = "%s/%s" % (flowcell, samplename) 161 | elif name: 162 | directory_name = samplename 163 | else: 164 | print >>sys.stderr, "Skipping %s" % (name,) 165 | continue 166 | 167 | albacore_root = root[len(args.prebasecalled):] 168 | # move it 169 | checkdir = args.staging + '/' + directory_name + '/' + albacore_root 170 | if not os.path.exists(checkdir): 171 | os.makedirs(checkdir) 172 | movefrom = args.prebasecalled + '/' + albacore_root + '/' + filename 173 | moveto = args.staging + '/' + directory_name + '/' + albacore_root + '/' + name 174 | 175 | print("Copy %s to %s" % (movefrom, moveto)) 176 | 177 | abspath = os.path.abspath(movefrom) 178 | os.symlink(abspath, moveto) 179 | 180 | already_processed.add(name) 181 | 182 | parser = argparse.ArgumentParser(description='Stage files for processing.') 183 | parser.add_argument('prebasecalled', 184 | help='directory containing non-basecalled reads') 185 | parser.add_argument('basecalled', 186 | help='directory containing basecalled reads') 187 | parser.add_argument('staging', 188 | help='staging directory') 189 | parser.add_argument('--organiseby', choices=('wotevs', 'flowcell', 'sample', 'nothing'), default='wotevs', 190 | help='organise reads by specific part of read id') 191 | 192 | args = parser.parse_args() 193 | run(args) 194 | 195 | --------------------------------------------------------------------------------