├── README.md
├── aggregate_stats.py
├── copyunprocessedfiles.py
├── dups.py
├── go_albacore.sh
├── launchbasecalling.py
├── moveprocessedfiles.py
├── reportflowcells.py
└── stageflowcells.py


/README.md:
--------------------------------------------------------------------------------
 1 | # nanopore-basecalling-scripts
 2 | 
 3 | Nick Loman
 4 | 19th June 2017
 5 | 
 6 | ## Introduction
 7 | 
 8 | Some simple scripts to ease management and local basecalling of millions of FAST5 files.
 9 | 
10 | These scripts are designed to help with the following (common) occurrences:
11 | 
12 |   * Albacore crashing/disk filling/lost power during a basecalling run; wishing to start back where you left off.
13 |   * Live basecalling on a server while files are being synchronised over the network in real-time from a MinKNOW PC.
14 |   * Directories getting muddled with the results of multiple sequencing runs from different flowcells.
15 | 
16 | ## Basic usage
17 | 
18 | The scripts work in the following way and consider three main directories:
19 | 
20 |   * ``data`` - the directory (including subfolders) where reads are uploaded to
21 |   * ``staging`` - a directory that basecalling is run from, organised by flowcell ID
22 |   * ``basecalls`` - the final results directory with the basecalls from Albacore
23 | 
24 | To run the scripts, we suggest the following pipeline
25 | 
26 | ### Stage files
27 | 
28 | This step will make a symbolic link to all the files that need to be processed in ``staging``. It won't stage files that have already been basecalled (as determined by their file name):
29 | 
30 |   ``python stageflowcells.py data basecalls staging``
31 | 
32 | ### Basecall
33 | 
34 | Basecall as normal with Albacore, substituting $flowcell as appropriate:
35 | 
36 |   ``read_fast5_basecaller.py -r -i staging/$flowcell -s basecalls/$flowcell ...``
37 | 
38 | ### Live Basecalling
39 | 
40 | If synchronising from the MinKNOW PC to a server you can run stageflowcells.py and then Albacore in a loop, nuking the staging directory each time, i.e.:
41 | 
42 |    rm -rf staging
43 |    python stageflowcells.py data basecalls staging
44 |    read_fast5_basecaller.py -r -i staging/$flowcell -s basecalls/$flowcell ...
45 | 
46 | ## How to sync to a server
47 | 
48 | We like to use ``rsync`` on the MinKNOW laptop. Mac and Linux machines will have ``rsync`` installed already. We like to use Cygwin on Windows PCs.
49 | 
50 | We typically use a recipe like this to transfer all reads matching ``*.fast5`` into the data directory, over an SSH connection:
51 | 
52 |    - Start a new Cygwin Window
53 | 
54 |    - Change directory to c:\data\reads, e.g.
55 | 
56 |      ``cd /cygdrive/c/data/reads``
57 | 
58 |    - To rsync on a loop, run the following, replacing ``USER``, ``SERVER`` and ``/REMOTE/DIRECTORY/data``:
59 | 
60 |     while true;
61 |     do
62 |        rsync -vr --remove-source-files --include "*.fast5" --include "*/" --exclude "*" . USER@SERVER:/REMOTE/DIRECTORY/data
63 |        sleep 5 ;
64 |     done
65 | 
66 |    - ``--remove-source-files`` will remove the FAST5 files after they are transferred! Useful if you want to stop the local MinKNOW PC hard disk from filling up.
67 | 
68 | Don't use that flag if you want to keep a local copy- but try to move the files out somewhere else from time to time or the directory will get very full and you will get a mix of files from different runs as you put more runs on which gets hard to manage!
69 | 
70 | ### Alternative sync methods
71 | 
72 | An alternative approach suggested by Mick Watson is to copy to a remote server via a network share (e.g. a SAMBA share) and use Robocopy.exe on Windows.
73 | 
74 | ## Related projects
75 | 
76 |    * Please see Alexis Luccatini's [Poreduck](https://github.com/alexiswl/poreduck) for another take on this problem.
77 | 
78 | 
79 | 
80 | 
81 | 


--------------------------------------------------------------------------------
/aggregate_stats.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import sys
 3 | import glob
 4 | 
 5 | first_record = True
 6 | master_df = None
 7 | 
 8 | for sample in sys.argv[1:]:
 9 | 	reports = glob.glob("%s/reports/seque*" % (sample,))
10 | 	reports.append("%s/sequencing_summary.txt" % (sample,))
11 | 	print reports
12 | 
13 | 	for report in reports:
14 | 		try:
15 | 			df = pd.read_csv(report, sep="\t")
16 | 		except ValueError:
17 | 			continue
18 | 
19 | 		df['sample'] = sample
20 | 
21 | 		if first_record:
22 | 			master_df = df
23 | 			first_record = False
24 | 		else:
25 | 			master_df = master_df.append(df)	
26 | 
27 | master_df.to_csv('aggregate_stats.txt', sep="\t")
28 | 
29 | 


--------------------------------------------------------------------------------
/copyunprocessedfiles.py:
--------------------------------------------------------------------------------
 1 | #Written by Nick Loman @pathogenomenick
 2 | 
 3 | import os
 4 | import os.path
 5 | import sys
 6 | import shutil
 7 | 
 8 | input_dir = sys.argv[1]
 9 | albacore_dir = sys.argv[2]
10 | process_dir = sys.argv[3]
11 | 
12 | basecalled_files = set()
13 | 
14 | for root, dirs, files in os.walk(albacore_dir, topdown=False):
15 | 	for name in files:
16 | 		basecalled_files.add(name)
17 | 
18 | # don't copy already staged files
19 | for root, dirs, files in os.walk(process_dir, topdown=False):
20 | 	for name in files:
21 | 		basecalled_files.add(name)
22 | 
23 | for root, dirs, files in os.walk(input_dir, topdown=False):
24 | 	    for name in files:
25 | 			if name not in basecalled_files:
26 | 				albacore_root = root[len(input_dir):]
27 | 				# move it
28 | 				checkdir = process_dir + '/' + albacore_root
29 | 				if not os.path.exists(checkdir):
30 | 					os.makedirs(checkdir)
31 | 				movefrom = input_dir + '/' + albacore_root + '/' + name
32 | 				moveto = process_dir + '/' + albacore_root + '/' + name
33 | 				print "Copy %s to %s" % (movefrom, moveto)
34 | 				shutil.copy(movefrom, moveto)
35 | 
36 | 
37 | 
38 | 
39 | 


--------------------------------------------------------------------------------
/dups.py:
--------------------------------------------------------------------------------
 1 | #Written by Nick Loman @pathogenomenick
 2 | 
 3 | from __future__ import print_function
 4 | 
 5 | import os
 6 | import os.path
 7 | import sys
 8 | import shutil
 9 | import re
10 | import argparse
11 | 
12 | def run(args):
13 | 	basecalled_files = set()
14 | 
15 | 	for root, dirs, files in os.walk(args.dir, topdown=False):
16 | 		for name in files:
17 | 			if name.endswith('.fast5'):
18 | 				if name in basecalled_files:
19 | 					fn = args.dir + '/' + root + '/' + name
20 | 					print(fn)
21 | 					os.unlink(fn)
22 | 				basecalled_files.add(name)
23 | 
24 | parser = argparse.ArgumentParser(description='Stage files for processing.')
25 | parser.add_argument('dir', help='Directory to remove duplicates from')
26 | 
27 | args = parser.parse_args()
28 | run(args)
29 | 
30 | 


--------------------------------------------------------------------------------
/go_albacore.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash 
 2 | tag=$1
 3 | 
 4 | #for var in "$@"
 5 | #do
 6 | #   python copyunprocessedfiles.py data/"$var" basecalls/"$tag" staging/"$tag"
 7 | #done
 8 | 
 9 | python moveprocessedfiles.py staging/"$tag" basecalls/"$tag"/workspace processed/"$tag"_processed
10 | time read_fast5_basecaller.py --input staging/"$tag" --worker_threads 32 -c r94_450bps_linear.cfg -s basecalls/"$tag" -r -o fast5
11 | 


--------------------------------------------------------------------------------
/launchbasecalling.py:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | import sys
 4 | from dask import compute, delayed
 5 | import dask.multiprocessing
 6 | import os
 7 | import os.path
 8 | import datetime
 9 | import shutil
10 | 
11 | staging_dir = sys.argv[1]
12 | gridion_basecalls = sys.argv[2]
13 | 
14 | class MyTask:
15 | 	def __init__(self, staging, basecalls, dirname):
16 | 		self.staging = staging
17 | 		self.basecalls = basecalls
18 | 		self.dirname = dirname
19 | 
20 | def process(t):
21 | 	datstr = datetime.datetime.now().strftime("%Y-%m-%d-%H:%M:%S")
22 | 
23 | 	outdir = "%s/%s" % (t.basecalls, t.dirname)
24 | 	for fn in ['configuration.cfg', 'pipeline.log', 'sequencing_summary.txt']:
25 | 		if os.path.exists(outdir+'/'+fn):
26 | 			if not os.path.exists(outdir+'/reports'):
27 | 				os.makedirs(outdir+'/reports')
28 | 			shutil.move(outdir+'/'+fn, outdir+'/reports/'+fn+'-'+datstr)
29 | 
30 | 	cmd = ("read_fast5_basecaller.py -c r95_450bps_linear.cfg -i %s/%s -s %s/%s -t 12 -r -o fast5,fastq" % (t.staging, t.dirname, t.basecalls, t.dirname))
31 | 	os.system(cmd)
32 | 
33 | dirs = os.listdir(staging_dir)
34 | #for d in dirs:
35 | #	process(MyTask(staging_dir, gridion_basecalls, d))
36 | 
37 | values = [delayed(process)(MyTask(staging_dir, gridion_basecalls,x)) for x in dirs]
38 | results = compute(*values, get=dask.multiprocessing.get)
39 | 
40 | 
41 | 


--------------------------------------------------------------------------------
/moveprocessedfiles.py:
--------------------------------------------------------------------------------
 1 | #Written by Nick Loman @pathogenomenick
 2 | 
 3 | import os
 4 | import os.path
 5 | import sys
 6 | import shutil
 7 | 
 8 | input_dir = sys.argv[1]
 9 | albacore_dir = sys.argv[2]
10 | process_dir = sys.argv[3]
11 | 
12 | basecalled_files = set()
13 | 
14 | for root, dirs, files in os.walk(albacore_dir, topdown=False):
15 | 	for name in files:
16 | 		basecalled_files.add(name)
17 | 
18 | for root, dirs, files in os.walk(input_dir, topdown=False):
19 | 	    for name in files:
20 | 			if name in basecalled_files:
21 | 				albacore_root = root[len(input_dir):]
22 | 				# move it
23 | 				checkdir = process_dir + '/' + albacore_root
24 | 				if not os.path.exists(checkdir):
25 | 					os.makedirs(checkdir)
26 | 				movefrom = input_dir + '/' + albacore_root + '/' + name
27 | 				moveto = process_dir + '/' + albacore_root + '/' + name
28 | 				print "Move %s to %s" % (movefrom, moveto)
29 | 				shutil.move(movefrom, moveto)
30 | 
31 | 
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/reportflowcells.py:
--------------------------------------------------------------------------------
 1 | #Written by Nick Loman @pathogenomenick
 2 | 
 3 | import sys
 4 | from collections import defaultdict
 5 | import re
 6 | 
 7 | stats = defaultdict(int)
 8 | 
 9 | for name in sys.stdin:
10 | 	flowcell = ''
11 | 	samplename = ''
12 | 	f = re.search('_2\d{7}_(F.*?)_', name)
13 | 	if f:
14 | 		flowcell = f.group(1)
15 | 	f = re.search('_(sequencing_run|mux_scan)_(.*)_\d+_read', name)
16 | 	if f:
17 | 		samplename = f.group(2)
18 | 
19 | 	if flowcell:
20 | 		directory_name = flowcell
21 | 	elif name:
22 | 		directory_name = samplename
23 | 	stats[directory_name] += 1
24 | 
25 | for k, v in stats.iteritems():
26 | 	print "%s\t%s" % (k, v)
27 | 
28 | 


--------------------------------------------------------------------------------
/stageflowcells.py:
--------------------------------------------------------------------------------
  1 | #Written by Nick Loman @pathogenomenick
  2 | 
  3 | from __future__ import print_function
  4 | 
  5 | 
  6 | import os
  7 | import os.path
  8 | import sys
  9 | import shutil
 10 | import re
 11 | import argparse
 12 | import time
 13 | 
 14 | try:
 15 |     from os import scandir
 16 | except ImportError:
 17 |     from scandir import scandir
 18 | 
 19 | # Modified version of scandir.walk that doesn't stat() symlinks to check if they are directories
 20 | 
 21 | def walk(top, topdown=True, onerror=None, followlinks=False):
 22 |     """Like Python 3.5's implementation of os.walk() -- faster than
 23 |     the pre-Python 3.5 version as it uses scandir() internally.
 24 |     """
 25 |     dirs = []
 26 |     nondirs = []
 27 | 
 28 |     # We may not have read permission for top, in which case we can't
 29 |     # get a list of the files the directory contains.  os.walk
 30 |     # always suppressed the exception then, rather than blow up for a
 31 |     # minor reason when (say) a thousand readable directories are still
 32 |     # left to visit.  That logic is copied here.
 33 |     try:
 34 |         scandir_it = scandir(top)
 35 |     except OSError as error:
 36 |         if onerror is not None:
 37 |             onerror(error)
 38 |         return
 39 | 
 40 |     while True:
 41 |         try:
 42 |             try:
 43 |                 entry = next(scandir_it)
 44 |             except StopIteration:
 45 |                 break
 46 |         except OSError as error:
 47 |             if onerror is not None:
 48 |                 onerror(error)
 49 |             return
 50 | 
 51 |         try:
 52 |             is_dir = entry.is_dir(follow_symlinks=False)
 53 |         except OSError:
 54 |             # If is_dir() raises an OSError, consider that the entry is not
 55 |             # a directory, same behaviour than os.path.isdir().
 56 |             is_dir = False
 57 | 
 58 |         if is_dir:
 59 |             dirs.append(entry.name)
 60 |         else:
 61 |             nondirs.append(entry.name)
 62 | 
 63 |         if not topdown and is_dir:
 64 |             # Bottom-up: recurse into sub-directory, but exclude symlinks to
 65 |             # directories if followlinks is False
 66 |             if followlinks:
 67 |                 walk_into = True
 68 |             else:
 69 |                 try:
 70 |                     is_symlink = entry.is_symlink()
 71 |                 except OSError:
 72 |                     # If is_symlink() raises an OSError, consider that the
 73 |                     # entry is not a symbolic link, same behaviour than
 74 |                     # os.path.islink().
 75 |                     is_symlink = False
 76 |                 walk_into = not is_symlink
 77 | 
 78 |             if walk_into:
 79 |                 for entry in walk(entry.path, topdown, onerror, followlinks):
 80 |                     yield entry
 81 | 
 82 |     # Yield before recursion if going top down
 83 |     if topdown:
 84 |         yield top, dirs, nondirs
 85 | 
 86 |         # Recurse into sub-directories
 87 |         for name in dirs:
 88 |             new_path = join(top, name)
 89 |             # Issue #23605: os.path.islink() is used instead of caching
 90 |             # entry.is_symlink() result during the loop on os.scandir() because
 91 |             # the caller can replace the directory entry during the "yield"
 92 |             # above.
 93 |             if followlinks or not islink(new_path):
 94 |                 for entry in walk(new_path, topdown, onerror, followlinks):
 95 |                     yield entry
 96 |     else:
 97 |         # Yield after recursion if going bottom up
 98 |         yield top, dirs, nondirs
 99 | 
100 | def run(args):
101 | 	basecalled_files = set()
102 | 
103 | 	print ("Walk basecalls\n")
104 | 	for root, dirs, files in walk(args.basecalled, topdown=False):
105 | 		for name in files:
106 | 			if name.endswith('.fast5'):
107 | 				basecalled_files.add(name)
108 | 
109 | 	print ("Walk staging\n")
110 | 	# don't copy already staged files
111 | 	for root, dirs, files in walk(args.staging, topdown=False):
112 | 		for name in files:
113 | 			if name.endswith('.fast5'):
114 | 				basecalled_files.add(name)
115 | 
116 | 	already_processed = set()
117 | 	print (len(already_processed))
118 | 
119 | 	print ("Walk prebasecalled\n")
120 | 
121 | 	for root, dirs, files in walk(args.prebasecalled, topdown=False):
122 | 		for filename in files:
123 | 			if filename.endswith('.tmp'):
124 | 				name = filename[0:len(filename)-4]
125 | 			else:
126 | 				name = filename
127 | 			
128 | 			if name not in basecalled_files and \
129 |                            name.endswith('.fast5') and \
130 |                            name not in already_processed:
131 | 				print ("Processing %s" % (name))
132 | 
133 | 				#delta = time.time() - os.stat(root+'/'+filename).st_mtime
134 | 				#if delta < (30*60):
135 | 				#	print ("Skipping as too new: %s" % (delta,))
136 | 				#	continue
137 | 
138 | 				flowcell = ''
139 | 				samplename = ''
140 | 
141 | 				f = re.search('_2\d{7}_(F.*?)_', name)
142 | 				if f:
143 | 					flowcell = f.group(1)
144 | 				f = re.search('_(sequencing_run|mux_scan)_(.*_\d+)_read', name)
145 | 				if f:
146 | 					samplename = f.group(2)
147 | 				else:
148 | 					f = re.search('_(sequencing_run|mux_scan)_(.*)_ch(\d+)_read(\d+)', name)
149 | 					if f:
150 | 						samplename = f.group(2)
151 | 
152 | 				if args.organiseby == 'flowcell' and flowcell:
153 | 					directory_name = flowcell
154 | 				elif args.organiseby == 'sample' and name:
155 | 					directory_name = samplename
156 | 				elif args.organiseby == 'nothing':
157 | 					directory_name = ''
158 | 				else:
159 | 					if flowcell and name:
160 | 						directory_name = "%s/%s" % (flowcell, samplename)
161 | 					elif name:
162 | 						directory_name = samplename
163 | 					else:
164 | 						print >>sys.stderr, "Skipping %s" % (name,)
165 | 						continue
166 | 
167 | 				albacore_root = root[len(args.prebasecalled):]
168 | 				# move it
169 | 				checkdir = args.staging + '/' + directory_name + '/' + albacore_root
170 | 				if not os.path.exists(checkdir):
171 | 					os.makedirs(checkdir)
172 | 				movefrom = args.prebasecalled + '/' + albacore_root + '/' + filename
173 | 				moveto = args.staging + '/' + directory_name + '/' + albacore_root + '/' + name
174 | 
175 | 				print("Copy %s to %s" % (movefrom, moveto))
176 | 
177 | 				abspath = os.path.abspath(movefrom)
178 | 				os.symlink(abspath, moveto)
179 | 
180 | 				already_processed.add(name)
181 | 
182 | parser = argparse.ArgumentParser(description='Stage files for processing.')
183 | parser.add_argument('prebasecalled', 
184 | 	    help='directory containing non-basecalled reads')
185 | parser.add_argument('basecalled',
186 | 	    help='directory containing basecalled reads')
187 | parser.add_argument('staging',
188 | 	    help='staging directory')
189 | parser.add_argument('--organiseby', choices=('wotevs', 'flowcell', 'sample', 'nothing'), default='wotevs',
190 | 	    help='organise reads by specific part of read id')
191 | 
192 | args = parser.parse_args()
193 | run(args)
194 | 
195 | 


--------------------------------------------------------------------------------