├── LICENSE.txt ├── README.md ├── align.sh ├── align_transcription.py ├── chime5.json ├── estimate_alignment.py ├── plots ├── align_S03_P.pdf └── align_S03_U.pdf ├── requirements.txt ├── transcript_utils.py └── view_alignments.py /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 University of Sheffield (Jon Barker) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CHiME-5 Array Synchronisation Baseline 2 | 3 | ## About 4 | 5 | This repository contains the code used to obtain the baseline device synchronisation for the CHiME-5 challenge. See the [challenge website](http://spandh.dcs.shef.ac.uk/chime_challenge/) for further details. 6 | 7 | CHiME-5 employs audio recording made simultaneously by a number of different devices. The start of each device recording has been synchronised by aligning the onset of a synchronisation tone played at the beginning of each recording session. However, the signals can become progressively out of synch due to a combination of clock drift (on all devices) and occasional frame-dropping on the Kinects. 8 | 9 | This misalignment problem is solved by performing an analysis to align the signal and then using this alignment to provide device-dependent utterance start and end times provided in the transcripts. 10 | 11 | This is performed in two steps: 12 | 13 | `estimate_alignment.py`: Estimation of a 'recording time' to 'signal delay' mapping between a reference binaural recorder and all other devices. The delay between a pair of channels is estimated at regular intervals throughout the party by locating the peak in a cross-correlation between windows of each signal. Estimation is performed in two passes: first at 10 second intervals, and then at 1 second intervals during periods of rapid change. Estimates are fitted with a linear regression when comparing binaural mics (no frame dropping) and are smoothed with median filter when comparing the binaural mic to Kinect recordings. The alignment is estimated in two passes. 14 | 15 | `align_transcription.py`: The transcript files are augmented with the device-dependent utterance timings. The original binaural recorder transcription times are first mapped onto the reference binaural recorder, and then from the reference recorder onto each of the Kinects. 16 | 17 | ## Installation 18 | 19 | 20 | 1. Edit `align.sh` to set the variable `chime5_corpus` to point to your CHiME-5 installation. 21 | 22 | 2. Install the python module requirements using 23 | 24 | ``` 25 | pip install -r requirements.txt 26 | ``` 27 | 28 | ## Usage 29 | 30 | A script `align.sh` has been provided to show how the tools have been used to compute the device-dependent times appearing in the CHiME-5 challenge transcription files. The script will process the session S03 but can be easily edited to process other sessions. 31 | 32 | The script will run both alignment passes (the initial alignment and the refinement). It then applies the alignment to transcription file to generated the corrected device-dependent timings (which should look like the timings that already in the distributed files). It will also end by displaying plots of the alignments for each channel. These should look like these 33 | 34 | Binaural recorders: [plots/align_S03_P.pdf](plots/align_S03_P.pdf) 35 | 36 | Kinect devices: [plots/align_S03_U.pdf](plots/align_S03_U.pdf) 37 | 38 | #### Notes 39 | 40 | - python3.6 or later is required 41 | 42 | - The output transcriptions appear in the directory `transcriptions_align`. 43 | 44 | - The script was initially run on the original CHiME-5 audio recordings before any redactions were made. For this reason when it is re-run on the distributed version of the CHiME-5 data it may produced some utterance timings that are different from those appearing in the distributed transcripts. 45 | 46 | - It is necessary to run both passes of the alignment process before running align_transcription. Skipping the refinement stage will lead to an error. 47 | -------------------------------------------------------------------------------- /align.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # 3 | # Copyright 2018 University of Sheffield (Jon Barker) 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | # Example of how to use the CHiME-5 alignment tools to produce an alignment 7 | # for a session of the CHiME-5 data 8 | 9 | 10 | # The location of the chime5 installation - edit to fit or add a symbolic link 11 | chime5_corpus=./CHiME5 12 | 13 | session="S03" # The session to be aligned 14 | dataset="train" # The session's dataset. Can be 'train' or 'dev' 15 | 16 | 17 | ##### Shouldn't need to edit anything beyond this point. ##### 18 | 19 | audio_dir=${chime5_corpus}/audio 20 | 21 | aligned_dir="./transcriptions_aligned" 22 | align_data="./align_data" 23 | 24 | # Make directories for holding results 25 | [ ! -d "$align_data" ] && mkdir "$align_data" 26 | [ ! -d "$align_data/first_pass" ] && mkdir "$align_data/first_pass" 27 | [ ! -d "$align_data/refined" ] && mkdir "$align_data/refined" 28 | [ ! -d "$aligned_dir" ] && mkdir "$aligned_dir" 29 | 30 | # Perform the course alignment pass (slowest step) 31 | python3 estimate_alignment.py --sessions "$session" "$audio_dir"/"$dataset" "$align_data"/first_pass 32 | 33 | # Peform the alignment refinement pass 34 | python3 estimate_alignment.py --refine "$align_data"/first_pass --sessions "$session" "$audio_dir"/"$dataset" "$align_data"/refined 35 | 36 | # Apply the alignment to the transcript file 37 | python3 align_transcription.py --sessions "$session" "$align_data"/refined "$chime5_corpus"/transcriptions/"$dataset" "$aligned_dir" 38 | 39 | # View the alignment 40 | python3 view_alignments.py --sessions "$session" "$align_data"/refined 41 | -------------------------------------------------------------------------------- /align_transcription.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2018 University of Sheffield (Jon Barker) 4 | # Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) 5 | 6 | """ 7 | align_transcription.py 8 | 9 | Apply the alignments to the transcription file 10 | """ 11 | 12 | import pickle 13 | import argparse 14 | import traceback 15 | import numpy as np 16 | 17 | import transcript_utils as tu 18 | 19 | CHIME_DATA = tu.chime_data() 20 | 21 | 22 | def correct_time_linear(time_to_correct, linear_fit): 23 | """Adjust the time using a linear fit of time to lag.""" 24 | corrected = time_to_correct - linear_fit * time_to_correct 25 | return corrected 26 | 27 | 28 | def correct_time_mapping(time_to_correct, linear_fit, times, lags): 29 | """Adjust the time using a linear fit + a mapping from time to lag.""" 30 | corrected = np.interp(time_to_correct + linear_fit * time_to_correct, 31 | np.array(times) + lags, np.array(times)) 32 | return corrected 33 | 34 | 35 | def align_kinect(kinect, transcription, align_data): 36 | """Add alignments for a given kinect.""" 37 | last_lag = align_data[kinect]['lag'][-1] 38 | last_time = align_data[kinect]['times'][-1] 39 | # Plus 100 in next line just to ensure that there is a lag 40 | # recorded beyond end of party duration. This is needed to 41 | # make sure that offset is correctly interpolated for 42 | # utterances in last few seconds of party given that 43 | # lags are only estimated every 10 seconds. 44 | times = [0] + list(align_data[kinect]['times']) + [last_time + 100] 45 | lags = [0] + list(align_data[kinect]['lag']) + [last_lag] 46 | for utterance in transcription: 47 | if 'speaker' not in utterance: # Skips redacted segments 48 | continue 49 | linear_fit = 0 50 | pid = utterance['speaker'] 51 | if pid in align_data.keys() and 'linear_fit': 52 | linear_fit = align_data[pid]['linear_fit'][0] 53 | utterance['start_time'][kinect] = correct_time_mapping(utterance['start_time']['original'], 54 | linear_fit, times, lags) 55 | utterance['end_time'][kinect] = correct_time_mapping(utterance['end_time']['original'], 56 | linear_fit, times, lags) 57 | 58 | 59 | def align_participant(pid, transcription, align_data): 60 | """Add alignments for a given binaural participant recording.""" 61 | for utterance in transcription: 62 | if 'speaker' not in utterance: # Skips redacted segments 63 | continue 64 | utterance_pid = utterance['speaker'] 65 | 66 | to_ref_linear = 0 67 | from_ref_linear = 0 68 | if utterance_pid in align_data: 69 | to_ref_linear = -align_data[utterance_pid]['linear_fit'][0] 70 | if pid in align_data: 71 | from_ref_linear = align_data[pid]['linear_fit'][0] 72 | linear_fit = to_ref_linear + from_ref_linear 73 | 74 | utterance['start_time'][pid] = correct_time_linear( 75 | utterance['start_time']['original'], linear_fit) 76 | utterance['end_time'][pid] = correct_time_linear( 77 | utterance['end_time']['original'], linear_fit) 78 | 79 | 80 | def align_transcription(session, align_data_path, in_path, out_path): 81 | """Apply kinect alignments to transcription file.""" 82 | align_data = pickle.load(open(f'{align_data_path}/align.{session}.p', 'rb')) 83 | transcription = tu.load_transcript(session, in_path, convert=True) 84 | 85 | # Compute alignments for the kinects 86 | kinects = CHIME_DATA[session]['kinects'] 87 | for kinect in kinects: 88 | print(kinect) 89 | align_kinect(kinect, transcription, align_data) 90 | 91 | # Compute alignments for the TASCAM binaural mics 92 | if CHIME_DATA[session]['dataset'] in ['train', 'dev']: 93 | pids = CHIME_DATA[session]['pids'] 94 | for pid in pids: 95 | print(pid) 96 | align_participant(pid, transcription, align_data) 97 | 98 | tu.save_transcript(transcription, session, out_path, convert=True) 99 | 100 | 101 | def main(): 102 | parser = argparse.ArgumentParser() 103 | parser.add_argument("--sessions", 104 | help="list of sessions to process (defaults to all)") 105 | parser.add_argument("align_path", help="path for the alignment pickle files") 106 | parser.add_argument("in_path", help="path for the input transcription file") 107 | parser.add_argument("out_path", help="path for the output transcription files") 108 | args = parser.parse_args() 109 | if args.sessions is None: 110 | sessions = tu.chime_data() 111 | else: 112 | sessions = args.sessions.split() 113 | 114 | for session in sessions: 115 | try: 116 | print(session) 117 | align_transcription(session, args.align_path, args.in_path, args.out_path) 118 | except: 119 | traceback.print_exc() 120 | 121 | 122 | if __name__ == '__main__': 123 | main() 124 | -------------------------------------------------------------------------------- /chime5.json: -------------------------------------------------------------------------------- 1 | { 2 | "S02": { 3 | "dataset": "dev", 4 | "pids": [ 5 | "P05", 6 | "P06", 7 | "P07", 8 | "P08" 9 | ], 10 | "kinects": [ 11 | "U01", 12 | "U02", 13 | "U03", 14 | "U04", 15 | "U05", 16 | "U06" 17 | ] 18 | }, 19 | "S03": { 20 | "dataset": "train", 21 | "pids": [ 22 | "P09", 23 | "P10", 24 | "P11", 25 | "P12" 26 | ], 27 | "kinects": [ 28 | "U01", 29 | "U02", 30 | "U03", 31 | "U04", 32 | "U05", 33 | "U06" 34 | ] 35 | }, 36 | "S04": { 37 | "dataset": "train", 38 | "pids": [ 39 | "P09", 40 | "P10", 41 | "P11", 42 | "P12" 43 | ], 44 | "kinects": [ 45 | "U01", 46 | "U02", 47 | "U03", 48 | "U04", 49 | "U05", 50 | "U06" 51 | ] 52 | }, 53 | "S05": { 54 | "dataset": "train", 55 | "pids": [ 56 | "P13", 57 | "P14", 58 | "P15", 59 | "P16" 60 | ], 61 | "kinects": [ 62 | "U01", 63 | "U02", 64 | "U04", 65 | "U05", 66 | "U06" 67 | ], 68 | "missing": { 69 | "U04": [ 70 | 4257.78, 71 | 282.50 72 | ] 73 | } 74 | }, 75 | "S06": { 76 | "dataset": "train", 77 | "pids": [ 78 | "P13", 79 | "P14", 80 | "P15", 81 | "P16" 82 | ], 83 | "kinects": [ 84 | "U01", 85 | "U02", 86 | "U03", 87 | "U04", 88 | "U05", 89 | "U06" 90 | ] 91 | }, 92 | "S07": { 93 | "dataset": "train", 94 | "pids": [ 95 | "P17", 96 | "P18", 97 | "P19", 98 | "P20" 99 | ], 100 | "kinects": [ 101 | "U01", 102 | "U02", 103 | "U03", 104 | "U04", 105 | "U05", 106 | "U06" 107 | ] 108 | }, 109 | "S08": { 110 | "dataset": "train", 111 | "pids": [ 112 | "P21", 113 | "P22", 114 | "P23", 115 | "P24" 116 | ], 117 | "kinects": [ 118 | "U01", 119 | "U02", 120 | "U03", 121 | "U04", 122 | "U05", 123 | "U06" 124 | ] 125 | }, 126 | "S09": { 127 | "dataset": "dev", 128 | "pids": [ 129 | "P25", 130 | "P26", 131 | "P27", 132 | "P28" 133 | ], 134 | "kinects": [ 135 | "U01", 136 | "U02", 137 | "U03", 138 | "U04", 139 | "U06" 140 | ] 141 | }, 142 | "S12": { 143 | "dataset": "train", 144 | "pids": [ 145 | "P33", 146 | "P34", 147 | "P35", 148 | "P36" 149 | ], 150 | "kinects": [ 151 | "U01", 152 | "U02", 153 | "U03", 154 | "U04", 155 | "U05", 156 | "U06" 157 | ] 158 | }, 159 | "S13": { 160 | "dataset": "train", 161 | "pids": [ 162 | "P33", 163 | "P34", 164 | "P35", 165 | "P36" 166 | ], 167 | "kinects": [ 168 | "U01", 169 | "U02", 170 | "U03", 171 | "U04", 172 | "U05", 173 | "U06" 174 | ] 175 | }, 176 | "S16": { 177 | "dataset": "train", 178 | "pids": [ 179 | "P21", 180 | "P22", 181 | "P23", 182 | "P24" 183 | ], 184 | "kinects": [ 185 | "U01", 186 | "U02", 187 | "U03", 188 | "U04", 189 | "U05", 190 | "U06" 191 | ] 192 | }, 193 | "S17": { 194 | "dataset": "train", 195 | "pids": [ 196 | "P17", 197 | "P18", 198 | "P19", 199 | "P20" 200 | ], 201 | "kinects": [ 202 | "U01", 203 | "U02", 204 | "U03", 205 | "U04", 206 | "U05", 207 | "U06" 208 | ] 209 | }, 210 | "S18": { 211 | "dataset": "train", 212 | "pids": [ 213 | "P41", 214 | "P42", 215 | "P43", 216 | "P44" 217 | ], 218 | "kinects": [ 219 | "U01", 220 | "U02", 221 | "U03", 222 | "U04", 223 | "U05", 224 | "U06" 225 | ], 226 | "missing": { 227 | "U06": [2659.92, 66.71] 228 | } 229 | }, 230 | "S19": { 231 | "dataset": "train", 232 | "pids": [ 233 | "P49", 234 | "P50", 235 | "P51", 236 | "P52" 237 | ], 238 | "kinects": [ 239 | "U01", 240 | "U02", 241 | "U03", 242 | "U04", 243 | "U05", 244 | "U06" 245 | ] 246 | }, 247 | "S20": { 248 | "dataset": "train", 249 | "pids": [ 250 | "P49", 251 | "P50", 252 | "P51", 253 | "P52" 254 | ], 255 | "kinects": [ 256 | "U01", 257 | "U02", 258 | "U03", 259 | "U04", 260 | "U05", 261 | "U06" 262 | ] 263 | }, 264 | "S21": { 265 | "dataset": "eval", 266 | "pids": [ 267 | "P45", 268 | "P46", 269 | "P47", 270 | "P48" 271 | ], 272 | "kinects": [ 273 | "U01", 274 | "U02", 275 | "U03", 276 | "U04", 277 | "U05", 278 | "U06" 279 | ] 280 | }, 281 | "S22": { 282 | "dataset": "train", 283 | "pids": [ 284 | "P41", 285 | "P42", 286 | "P43", 287 | "P44" 288 | ], 289 | "kinects": [ 290 | "U01", 291 | "U02", 292 | "U04", 293 | "U05", 294 | "U06" 295 | ] 296 | }, 297 | "S23": { 298 | "dataset": "train", 299 | "pids": [ 300 | "P53", 301 | "P54", 302 | "P55", 303 | "P56" 304 | ], 305 | "kinects": [ 306 | "U01", 307 | "U02", 308 | "U03", 309 | "U04", 310 | "U05", 311 | "U06" 312 | ] 313 | }, 314 | "S24": { 315 | "dataset": "train", 316 | "pids": [ 317 | "P53", 318 | "P54", 319 | "P55", 320 | "P56" 321 | ], 322 | "kinects": [ 323 | "U01", 324 | "U02", 325 | "U03", 326 | "U04", 327 | "U05", 328 | "U06" 329 | ], 330 | "missing": { 331 | "U06": [ 332 | 5527.40, 333 | 8.60 334 | ] 335 | } 336 | } 337 | } 338 | -------------------------------------------------------------------------------- /estimate_alignment.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2018 University of Sheffield (Jon Barker) 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | """ 7 | python3 estimate_alignment.py 8 | 9 | Estimate an alignment between the reference binaural recording channel and all other audio 10 | channels in a session. An alignment is a series of (time, lag) pairs where the 'lag' between 11 | the two channels is measured at a given 'time' point within the reference signal. Alignment 12 | are initialially computed at 10 second intervals. In periods where the lag changes rapidly 13 | the alignment is recomputed at 1 second intervals. 14 | 15 | The alignment at a time point is estimated by looking for the offset that maximises a 16 | cross correlation between the two signals being aligned. 17 | 18 | The alignments are stored in pickle files called align.{session}.p. They are then read 19 | and applied to the JSON transcript files by align_transcription.py 20 | """ 21 | 22 | import argparse 23 | import cv2 24 | import numpy as np 25 | import pickle 26 | import scipy.signal 27 | import struct 28 | import sys 29 | import traceback 30 | import wave 31 | 32 | import transcript_utils as tu 33 | 34 | # For Bin-to-bin 35 | BINAURAL_RESOLUTION = 100 # analysis resolution (in seconds) 36 | BINAURAL_SEARCH_DURATION = 0.5 # Max misalignment (in seconds) 37 | BINAURAL_TEMPLATE_DURATION = 20 # Duration of segment that is matched (in seconds) 38 | 39 | # # For Bin-to-Kinect 40 | KINECT_RESOLUTION = 10 # analysis resolution (in seconds) 41 | # KINECT_SEARCH_DURATION = 3 # Max misalignment (in seconds) 42 | KINECT_SEARCH_DURATION = 5 # Max misalignment (in seconds) 43 | KINECT_TEMPLATE_DURATION = 20 # Duration of segment that is matched (in seconds) 44 | 45 | 46 | def wavfile_duration(wavfile): 47 | """Return the duration of a wav file in seconds.""" 48 | wave_fp = wave.open(wavfile, 'rb') 49 | sample_rate = wave_fp.getframerate() 50 | nframes = wave_fp.getnframes() 51 | return nframes/sample_rate 52 | 53 | 54 | def readwav(wavfile, duration, start_time=0, channel=0): 55 | """Read a segment of a wav file. 56 | 57 | If stereo then can select channel 0 or channel 1. 58 | """ 59 | wave_fp = wave.open(wavfile, 'rb') 60 | sample_rate = wave_fp.getframerate() 61 | nchannels = wave_fp.getnchannels() 62 | nsamples = int(duration * sample_rate) 63 | wave_fp.setpos(int(start_time * sample_rate)) 64 | wavbytes = wave_fp.readframes(nsamples) 65 | signal = struct.unpack(f'{nsamples * nchannels}h', wavbytes) 66 | if nchannels == 2: 67 | signal = signal[channel::2] 68 | return signal, sample_rate 69 | 70 | 71 | def find_align(target_wavfile, template_wavfile, 72 | align_time, search_duration, template_duration, channel=0, missing=None): 73 | """Find lag between a pair of signals by correlating a short segment of the template 74 | signal (from one of the Kinects) against the target signal (the reference binaural recorder) 75 | 76 | This is the low level signal matching code called by the align_channel() function. It uses 77 | cv2.matchTemplate to do the work 78 | 79 | Arguments: 80 | target_wavfile -- name of the target signal wav file 81 | template_wavfile -- name of the template signal wav file 82 | align_time -- the time within the signals at which to measure the lag 83 | search_duration -- the max delay to consider 84 | template_duration -- length of window to use when computing correlations 85 | channel -- either use left (0) or right (1) stereo channel of the binaural recorder 86 | missing -- for dealing with Kinect recordings where a large chunk of data has gone missing. 87 | missing = (missing_time, missing_duration) 88 | missing_time - the time at which the missing segment starts 89 | missing_duration - the number of seconds of audio missing 90 | """ 91 | 92 | offset = 0 93 | if missing is not None: 94 | missing_time, missing_duration = missing 95 | if align_time > missing_time: 96 | offset = missing_duration 97 | 98 | target, _ = readwav(target_wavfile, 99 | 2.0 * search_duration + template_duration, 100 | align_time - search_duration + offset, 101 | channel=channel) 102 | template, sample_rate = readwav(template_wavfile, 103 | template_duration, 104 | align_time) 105 | result = cv2.matchTemplate(np.array(target, dtype=np.float32), 106 | np.array(template, dtype=np.float32), 107 | cv2.TM_CCOEFF_NORMED) 108 | lag = np.argmax(result)/sample_rate - search_duration + offset 109 | return lag, np.max(result) 110 | 111 | 112 | def align_channels(ref_fn, target_fn, analysis_times, 113 | search_duration, template_duration, missing): 114 | """Compute the alignment between a pair of channel. 115 | 116 | Arguments: 117 | ref_fn -- name of the reference binaural wav file 118 | target_fn -- name of the kinect channel wav file 119 | analysis_times -- a list of time points at which to estimate the delay 120 | search_duration -- the max delay to consider 121 | template_duration -- the length of the window over which to compute correlation 122 | missing -- for dealing with Kinect recordings where a large chunk of data has gone missing. 123 | (see find_align()) 124 | """ 125 | 126 | lag_score_0 = [find_align(ref_fn, target_fn, analysis_time, 127 | search_duration, template_duration, 128 | channel=0, missing=missing) 129 | for analysis_time in analysis_times] 130 | lag_score_1 = [find_align(ref_fn, target_fn, analysis_time, 131 | search_duration, template_duration, 132 | channel=1, missing=missing) 133 | for analysis_time in analysis_times] 134 | (lag_0, score_0) = zip(*lag_score_0) 135 | (lag_1, score_1) = zip(*lag_score_1) 136 | 137 | return {'times': analysis_times, 'lagL': lag_0, 138 | 'lagR': lag_1, 'scoreL': score_0, 139 | 'scoreR': score_1} 140 | 141 | 142 | def down_mix_lags(results): 143 | """Produce single sequence of lag estimate from L and R channel lag estimates.""" 144 | # Choose lag from left of right depending which has best correlation score. 145 | times = np.array(results['times']) 146 | lagL = np.array(results['lagL']) 147 | lagR = np.array(results['lagR']) 148 | best = np.array(results['scoreL']) > np.array(results['scoreR']) 149 | return best * lagL + ~best * lagR 150 | 151 | 152 | def clock_drift_linear_fit(results): 153 | """Compute best linear time-lag fit for clock drift.""" 154 | times = np.array(results['times']) 155 | lag = np.array(results['lag']) 156 | a, _, _, _ = np.linalg.lstsq(times[:, np.newaxis], lag) 157 | return a 158 | 159 | 160 | def merge_results(results, new_results): 161 | """Merge two sets of time-lag results.""" 162 | times = list(results['times']) + new_results['times'] 163 | lags = list(results['lag']) + list(new_results['lag']) 164 | # sort lags into time order 165 | results['times'], results['lag'] = zip(*(sorted(zip(times, lags)))) 166 | 167 | 168 | def refine_kinect_lags(results, audiopath, session, target_chan, ref_chan): 169 | """Refine alignment around big jumps in lag. 170 | 171 | The initial alignment is computed at 10 second intervals. If the alignment 172 | changes by a large amount (>50 ms) during a single 10 second step then the 173 | alignment is recomputed at a resolution of 1 second intervals. 174 | 175 | Arguments: 176 | results -- the alignment returned by align_channels() 177 | audiopath -- the directory containing the audio data 178 | session -- the name of the session to process (e.g. 'S10') 179 | target_chan -- the name of the kinect channel to process (e.g. 'U01') 180 | ref_chan -- the name of the reference binaural recorded (e.g. 'P34') 181 | 182 | Return: 183 | Note, the function updates the contents of results rather than returns results explicitly 184 | """ 185 | threshold = 0.05 186 | search_duration = KINECT_SEARCH_DURATION 187 | template_duration = KINECT_TEMPLATE_DURATION 188 | chime_data = tu.chime_data() 189 | 190 | times = np.array(results['times']) 191 | lag = np.array(results['lag']) 192 | if len(times) != len(lag): 193 | # This happens for the one case where a kinect was turned off early 194 | # and 15 minutes of audio got lost 195 | print('WARNING: missing lags') 196 | times = times[:len(lag)] 197 | dlag = np.diff(lag) 198 | jump_times = times[1:][dlag > threshold] 199 | analysis_times = set() 200 | 201 | for time in jump_times: 202 | analysis_times |= set(list(range(time-10, time+10))) 203 | analysis_times = list(analysis_times) 204 | print(len(analysis_times)) 205 | 206 | if len(analysis_times) > 0: 207 | missing = None 208 | if (('missing' in chime_data[session] and 209 | target_chan in chime_data[session]['missing'])): 210 | missing = chime_data[session]['missing'][target_chan] 211 | 212 | ref_fn = f'{audiopath}/{session}_{ref_chan}.wav' 213 | target_fn = f'{audiopath}/{session}_{target_chan}.CH1.wav' 214 | 215 | new_results = align_channels(ref_fn, target_fn, analysis_times, 216 | search_duration, template_duration, missing=missing) 217 | new_results['lag'] = down_mix_lags(new_results) 218 | merge_results(results, new_results) 219 | 220 | 221 | def align_session(session, audiopath, outpath, chans=None): 222 | """Align all channels within a given session.""" 223 | chime_data = tu.chime_data() 224 | 225 | # The first binaural recorder is taken as the reference 226 | ref_chan = chime_data[session]['pids'][0] 227 | 228 | # If chans not specified then use all channels available 229 | if chans is None: 230 | pids = chime_data[session]['pids'] 231 | kinects = chime_data[session]['kinects'] 232 | chans = pids[1:] + kinects 233 | 234 | all_results = dict() # Empty dictionary for storing results 235 | 236 | for target_chan in chans: 237 | print(target_chan) 238 | 239 | # For dealing with channels with big missing audio segments 240 | missing = None 241 | if (('missing' in chime_data[session] and 242 | target_chan in chime_data[session]['missing'])): 243 | missing = chime_data[session]['missing'][target_chan] 244 | 245 | # Parameters for alignment depend on whether target is 246 | # a binaural mic ('P') or a kinect mic 247 | if target_chan[0] == 'P': 248 | search_duration = BINAURAL_SEARCH_DURATION 249 | template_duration = BINAURAL_TEMPLATE_DURATION 250 | alignment_resolution = BINAURAL_RESOLUTION 251 | target_chan_name = target_chan 252 | else: 253 | search_duration = KINECT_SEARCH_DURATION 254 | template_duration = KINECT_TEMPLATE_DURATION 255 | alignment_resolution = KINECT_RESOLUTION 256 | target_chan_name = target_chan + '.CH1' 257 | 258 | # Place it try-except block so that can continue 259 | # if a channel fails. This shouldn't happen unless 260 | # there is some problem reading the audio data. 261 | try: 262 | offset = 0 263 | if missing is not None: 264 | _, offset = missing 265 | 266 | ref_fn = f'{audiopath}/{session}_{ref_chan}.wav' 267 | target_fn = f'{audiopath}/{session}_{target_chan_name}.wav' 268 | 269 | # Will analyse the alignment offset at regular intervals 270 | session_duration = int(min(wavfile_duration(ref_fn) - offset, 271 | wavfile_duration(target_fn)) 272 | - template_duration - search_duration) 273 | analysis_times = range(alignment_resolution, session_duration, alignment_resolution) 274 | 275 | # Run the alignment code and store results in dictionary 276 | all_results[target_chan] = \ 277 | align_channels(ref_fn, 278 | target_fn, 279 | analysis_times, 280 | search_duration, 281 | template_duration, 282 | missing=missing) 283 | except: 284 | traceback.print_exc() 285 | 286 | pickle.dump(all_results, open(f'{outpath}/align.{session}.p', "wb")) 287 | 288 | 289 | def refine_session(session, audiopath, inpath, outpath): 290 | """Refine alignment of all channels within a given session.""" 291 | chime_data = tu.chime_data() 292 | 293 | ref = chime_data[session]['pids'][0] 294 | pids = chime_data[session]['pids'][1:] 295 | kinects = chime_data[session]['kinects'] 296 | all_results = pickle.load(open(f'{inpath}/align.{session}.p', "rb")) 297 | kinects = sorted(list(set(kinects).intersection(all_results.keys()))) 298 | print(session) 299 | 300 | # Merges results of left and right channel alignments 301 | for channel in pids + kinects: 302 | results = all_results[channel] 303 | lag = down_mix_lags(results) 304 | results['lag'] = scipy.signal.medfilt(lag, 9) 305 | 306 | # Compute the linear fit for modelling clock drift 307 | for channel in pids: 308 | results = all_results[channel] 309 | results['linear_fit'] = clock_drift_linear_fit(results) 310 | 311 | # Refine kinect alignments - i.e. reanalyse on finer time 312 | # scale in regions where big jumps in offset occur and 313 | # apply a bit of smoothing to remove spurious estimates 314 | for channel in kinects: 315 | refine_kinect_lags(all_results[channel], audiopath, session=session, target_chan=channel, ref_chan=ref) 316 | results['lag'] = scipy.signal.medfilt(results['lag'], 7) 317 | 318 | pickle.dump(all_results, open(f'{outpath}/align.{session}.p', "wb")) 319 | 320 | 321 | def main(): 322 | parser = argparse.ArgumentParser() 323 | parser.add_argument("--sessions", 324 | help="list of sessions to process (defaults to all)") 325 | parser.add_argument("--chans", help="list of channels to process (defaults to all)") 326 | parser.add_argument("--refine", help="path to output of 1st pass (runs a refinement pass if provided)") 327 | 328 | parser.add_argument("audiopath", help="path to audio data") 329 | parser.add_argument("outpath", help="path to output alignment data") 330 | args = parser.parse_args() 331 | 332 | if args.sessions is None: 333 | sessions = tu.chime_data() 334 | else: 335 | sessions = args.sessions.split() 336 | 337 | if args.chans is None: 338 | chans = None 339 | else: 340 | chans = args.chans.split() 341 | 342 | chime_data = tu.chime_data() 343 | 344 | if args.refine: 345 | # The alignment refinement pass. 346 | for session in sessions: 347 | refine_session(session, args.audiopath, args.refine, args.outpath) 348 | else: 349 | # The initial alignment pass. 350 | for session in sessions: 351 | print(session, chans) 352 | align_session(session, args.audiopath, args.outpath, chans=chans) 353 | 354 | 355 | if __name__ == '__main__': 356 | main() 357 | -------------------------------------------------------------------------------- /plots/align_S03_P.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chimechallenge/chime5-synchronisation/4f52fbfacbd7918eea3a95b9eb7bda12bd955e6f/plots/align_S03_P.pdf -------------------------------------------------------------------------------- /plots/align_S03_U.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chimechallenge/chime5-synchronisation/4f52fbfacbd7918eea3a95b9eb7bda12bd955e6f/plots/align_S03_U.pdf -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.13.3 2 | opencv-python==3.3.0.10 3 | scipy==1.0.1 4 | matplotlib==2.2.2 5 | -------------------------------------------------------------------------------- /transcript_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2018 University of Sheffield (Jon Barker) 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | import json 7 | 8 | CHIME5_JSON = 'chime5.json' # Name of the CHiME5 json metadata file 9 | 10 | 11 | def chime_data(sets_to_load=None): 12 | """Load CHiME corpus data for specified sets eg. sets=['train', 'eval'] 13 | 14 | Defaults to all 15 | """ 16 | if sets_to_load is None: 17 | sets_to_load = ['train', 'dev'] 18 | 19 | with open(CHIME5_JSON) as fh: 20 | data = json.load(fh) 21 | 22 | data = {k: v for (k, v) in data.items() if v['dataset'] in sets_to_load} 23 | 24 | return data 25 | 26 | 27 | def time_text_to_float(time_string): 28 | """Convert tramscript time from text to float format.""" 29 | hours, minutes, seconds = time_string.split(':') 30 | seconds = int(hours) * 3600 + int(minutes) * 60 + float(seconds) 31 | return seconds 32 | 33 | 34 | def time_float_to_text(time_float): 35 | """Convert tramscript time from float to text format.""" 36 | hours = int(time_float/3600) 37 | time_float %= 3600 38 | minutes = int(time_float/60) 39 | seconds = time_float % 60 40 | return f'{hours}:{minutes:02d}:{seconds:05.2f}' 41 | 42 | 43 | def load_transcript(session, root, convert=False): 44 | """Load final merged transcripts. 45 | 46 | session: recording session name, e.g. 'S12' 47 | """ 48 | with open(f'{root}/{session}.json') as f: 49 | transcript = json.load(f) 50 | if convert: 51 | for item in transcript: 52 | for key in item['start_time']: 53 | item['start_time'][key] = time_text_to_float(item['start_time'][key]) 54 | for key in item['end_time']: 55 | item['end_time'][key] = time_text_to_float(item['end_time'][key]) 56 | return transcript 57 | 58 | 59 | def save_transcript(transcript, session, root, convert=False): 60 | """Save transcripts to json file.""" 61 | 62 | # Need to make a deep copy so time to string conversions only happen locally 63 | transcript_copy = [element.copy() for element in transcript] 64 | 65 | if convert: 66 | for item in transcript_copy: 67 | for key in item['start_time']: 68 | item['start_time'][key] = time_float_to_text( 69 | item['start_time'][key]) 70 | for key in item['end_time']: 71 | item['end_time'][key] = time_float_to_text( 72 | item['end_time'][key]) 73 | 74 | with open(f'{root}/{session}.json', 'w') as outfile: 75 | json.dump(transcript_copy, outfile, indent=4) 76 | -------------------------------------------------------------------------------- /view_alignments.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright 2018 University of Sheffield (Jon Barker) 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | import argparse 7 | import pickle 8 | import matplotlib.pylab as plt 9 | import traceback 10 | 11 | import transcript_utils as tu 12 | 13 | def make_plots(data, device_type, layout, ylim): 14 | plt.figure() 15 | plot_keys = [key for key in data.keys() if device_type in key] 16 | for index, key in enumerate(plot_keys): 17 | chan_data = data[key] 18 | times = chan_data['times'] 19 | plt.subplot(*layout, index + 1) 20 | if 'lag' in chan_data: 21 | plt.plot(times, chan_data['lag'], '-') 22 | else: 23 | plt.plot(times, chan_data['lagL']) 24 | plt.plot(times, chan_data['lagR']) 25 | plt.ylim(ylim) 26 | plt.title(key) 27 | plt.gcf().tight_layout() 28 | 29 | 30 | def plot_session(session, path, show_plot=True, save_dir=None): 31 | """Plot figure for a single session.""" 32 | name = f'{path}/align.{session}.p' 33 | data = pickle.load(open(name, 'rb')) 34 | 35 | device_type='U' 36 | make_plots(data, device_type, (2,3), ylim=(0, 1)) 37 | if save_dir is not None: 38 | plt.savefig(f'{save_dir}/align_{session}_{device_type}.pdf') 39 | 40 | device_type='P' 41 | make_plots(data, device_type, (1, 3), ylim=(-0.1, 0.1)) 42 | if save_dir is not None: 43 | plt.savefig(f'{save_dir}/align_{session}_{device_type}.pdf') 44 | 45 | if show_plot: 46 | plt.show() 47 | 48 | 49 | def main(): 50 | parser = argparse.ArgumentParser() 51 | parser.add_argument("--sessions", help="list of sessions to process (defaults to all)") 52 | parser.add_argument("--save", help="path of directory in which to save plots") 53 | parser.add_argument("--no_plot", action='store_true', help="suppress display of plot (defaults to false)") 54 | parser.add_argument("path", help="path to alignment data") 55 | args = parser.parse_args() 56 | if args.sessions is None: 57 | sessions = tu.chime_data() 58 | else: 59 | sessions = args.sessions.split() 60 | 61 | for session in sessions: 62 | print(session) 63 | try: 64 | plot_session(session, args.path, not args.no_plot, args.save) 65 | except: 66 | traceback.print_exc() 67 | 68 | 69 | if __name__ == '__main__': 70 | main() 71 | --------------------------------------------------------------------------------