├── LICENSE.txt
├── README.md
├── align.sh
├── align_transcription.py
├── chime5.json
├── estimate_alignment.py
├── plots
    ├── align_S03_P.pdf
    └── align_S03_U.pdf
├── requirements.txt
├── transcript_utils.py
└── view_alignments.py


/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 University of Sheffield (Jon Barker)
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CHiME-5 Array Synchronisation Baseline
 2 | 
 3 | ## About
 4 | 
 5 | This repository contains the code used to obtain the baseline device synchronisation for the CHiME-5 challenge. See the [challenge website](http://spandh.dcs.shef.ac.uk/chime_challenge/) for further details.
 6 | 
 7 | CHiME-5 employs audio recording made simultaneously by a number of different devices. The start of each device recording has been synchronised by aligning the onset of a synchronisation tone played at the beginning of each recording session. However, the signals can become progressively out of synch due to a combination of clock drift (on all devices) and occasional frame-dropping on the Kinects. 
 8 | 
 9 | This misalignment problem is solved by performing an analysis to align the signal and then using this alignment to provide device-dependent utterance start and end times provided in the transcripts.
10 | 
11 | This is performed in two steps:
12 | 
13 | `estimate_alignment.py`: Estimation of a 'recording time' to 'signal delay' mapping between a reference binaural recorder and all other devices. The delay between a pair of channels is estimated at regular intervals throughout the party by locating the peak in a cross-correlation between windows of each signal. Estimation is performed in two passes: first at 10 second intervals, and then at 1 second intervals during periods of rapid change. Estimates are fitted with a linear regression when comparing binaural mics (no frame dropping) and are smoothed with median filter when comparing the binaural mic to Kinect recordings. The alignment is estimated in two passes.
14 | 
15 | `align_transcription.py`: The transcript files are augmented with the device-dependent utterance timings. The original binaural recorder transcription times are first mapped onto the reference binaural recorder, and then from the reference recorder onto each of the Kinects.
16 | 
17 | ## Installation
18 | 
19 | 
20 | 1. Edit `align.sh` to set the variable `chime5_corpus` to point to your CHiME-5 installation.
21 | 
22 | 2. Install the python module requirements using
23 | 
24 | ``` 
25 | pip install -r requirements.txt
26 | ```
27 | 
28 | ## Usage
29 | 
30 | A script `align.sh` has been provided to show how the tools have been used to compute the device-dependent times appearing in the CHiME-5 challenge transcription files. The script will process the session S03 but can be easily edited to process other sessions.
31 | 
32 | The script will run both alignment passes (the initial alignment and the refinement). It then applies the alignment to transcription file to generated the corrected device-dependent timings (which should look like the timings that already in the distributed files).  It will also end by displaying plots of the alignments for each channel. These should look like these
33 | 
34 | Binaural recorders: [plots/align_S03_P.pdf](plots/align_S03_P.pdf)
35 | 
36 | Kinect devices: [plots/align_S03_U.pdf](plots/align_S03_U.pdf)
37 | 
38 | #### Notes
39 | 
40 | - python3.6 or later is required
41 | 
42 | - The output transcriptions appear in the directory `transcriptions_align`.
43 | 
44 | - The script was initially run on the original CHiME-5 audio recordings before any redactions were made. For this reason when it is re-run on the distributed version of the CHiME-5 data it may produced some utterance timings that are different from those appearing in the distributed transcripts.
45 | 
46 | - It is necessary to run both passes of the alignment process before running align_transcription. Skipping the refinement stage will lead to an error.
47 | 


--------------------------------------------------------------------------------
/align.sh:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env bash
 2 | #
 3 | # Copyright 2018 University of Sheffield (Jon Barker)
 4 | # MIT License (https://opensource.org/licenses/MIT)
 5 | 
 6 | # Example of how to use the CHiME-5 alignment tools to produce an alignment 
 7 | # for a session of the CHiME-5 data
 8 | 
 9 | 
10 | # The location of the chime5 installation - edit to fit or add a symbolic link
11 | chime5_corpus=./CHiME5
12 | 
13 | session="S03"     # The session to be aligned
14 | dataset="train"   # The session's dataset. Can be 'train' or 'dev'
15 | 
16 | 
17 | ##### Shouldn't need to edit anything beyond this point. #####
18 | 
19 | audio_dir=${chime5_corpus}/audio
20 | 
21 | aligned_dir="./transcriptions_aligned"
22 | align_data="./align_data"
23 | 
24 | # Make directories for holding results
25 | [ ! -d "$align_data" ]  && mkdir "$align_data"
26 | [ ! -d "$align_data/first_pass" ] && mkdir "$align_data/first_pass"
27 | [ ! -d "$align_data/refined" ] && mkdir "$align_data/refined"
28 | [ ! -d "$aligned_dir" ] && mkdir "$aligned_dir"
29 | 
30 | # Perform the course alignment pass (slowest step)
31 | python3 estimate_alignment.py --sessions "$session" "$audio_dir"/"$dataset" "$align_data"/first_pass 
32 | 
33 | # Peform the alignment refinement pass
34 | python3 estimate_alignment.py --refine "$align_data"/first_pass  --sessions "$session" "$audio_dir"/"$dataset"  "$align_data"/refined
35 | 
36 | # Apply the alignment to the transcript file
37 | python3 align_transcription.py --sessions "$session"  "$align_data"/refined "$chime5_corpus"/transcriptions/"$dataset" "$aligned_dir"
38 | 
39 | # View the alignment
40 | python3 view_alignments.py  --sessions "$session" "$align_data"/refined
41 | 


--------------------------------------------------------------------------------
/align_transcription.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2018 University of Sheffield (Jon Barker)
  4 | #  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
  5 | 
  6 | """
  7 | align_transcription.py
  8 | 
  9 | Apply the alignments to the transcription file
 10 | """
 11 | 
 12 | import pickle
 13 | import argparse
 14 | import traceback
 15 | import numpy as np
 16 | 
 17 | import transcript_utils as tu
 18 | 
 19 | CHIME_DATA = tu.chime_data()
 20 | 
 21 | 
 22 | def correct_time_linear(time_to_correct, linear_fit):
 23 |     """Adjust the time using a linear fit of time to lag."""
 24 |     corrected = time_to_correct - linear_fit * time_to_correct
 25 |     return corrected
 26 | 
 27 | 
 28 | def correct_time_mapping(time_to_correct, linear_fit, times, lags):
 29 |     """Adjust the time using a linear fit + a mapping from time to lag."""
 30 |     corrected = np.interp(time_to_correct + linear_fit * time_to_correct,
 31 |                           np.array(times) + lags, np.array(times))
 32 |     return corrected
 33 | 
 34 | 
 35 | def align_kinect(kinect, transcription, align_data):
 36 |     """Add alignments for a given kinect."""
 37 |     last_lag = align_data[kinect]['lag'][-1]
 38 |     last_time = align_data[kinect]['times'][-1]
 39 |     # Plus 100 in next line just to ensure that there is a lag
 40 |     # recorded beyond end of party duration. This is needed to
 41 |     # make sure that offset is correctly interpolated for
 42 |     # utterances in last few seconds of party given that
 43 |     # lags are only estimated every 10 seconds.
 44 |     times = [0] + list(align_data[kinect]['times']) + [last_time + 100]
 45 |     lags = [0] + list(align_data[kinect]['lag']) + [last_lag]
 46 |     for utterance in transcription:
 47 |         if 'speaker' not in utterance:  # Skips redacted segments
 48 |             continue
 49 |         linear_fit = 0 
 50 |         pid = utterance['speaker']
 51 |         if pid in align_data.keys() and 'linear_fit':
 52 |             linear_fit = align_data[pid]['linear_fit'][0]
 53 |         utterance['start_time'][kinect] = correct_time_mapping(utterance['start_time']['original'],
 54 |                                                                linear_fit, times, lags)
 55 |         utterance['end_time'][kinect] = correct_time_mapping(utterance['end_time']['original'],
 56 |                                                              linear_fit, times, lags)
 57 | 
 58 | 
 59 | def align_participant(pid, transcription, align_data):
 60 |     """Add alignments for a given binaural participant recording."""
 61 |     for utterance in transcription:
 62 |         if 'speaker' not in utterance:  # Skips redacted segments
 63 |             continue
 64 |         utterance_pid = utterance['speaker']
 65 | 
 66 |         to_ref_linear = 0
 67 |         from_ref_linear = 0
 68 |         if utterance_pid in align_data:
 69 |             to_ref_linear = -align_data[utterance_pid]['linear_fit'][0]
 70 |         if pid in align_data:
 71 |             from_ref_linear = align_data[pid]['linear_fit'][0]
 72 |         linear_fit = to_ref_linear + from_ref_linear
 73 | 
 74 |         utterance['start_time'][pid] = correct_time_linear(
 75 |             utterance['start_time']['original'], linear_fit)
 76 |         utterance['end_time'][pid] = correct_time_linear(
 77 |             utterance['end_time']['original'], linear_fit)
 78 | 
 79 | 
 80 | def align_transcription(session, align_data_path, in_path, out_path):
 81 |     """Apply kinect alignments to transcription file."""
 82 |     align_data = pickle.load(open(f'{align_data_path}/align.{session}.p', 'rb'))
 83 |     transcription = tu.load_transcript(session, in_path, convert=True)
 84 | 
 85 |     # Compute alignments for the kinects
 86 |     kinects = CHIME_DATA[session]['kinects']
 87 |     for kinect in kinects:
 88 |         print(kinect)
 89 |         align_kinect(kinect, transcription, align_data)
 90 | 
 91 |     # Compute alignments for the TASCAM binaural mics
 92 |     if CHIME_DATA[session]['dataset'] in ['train', 'dev']:
 93 |         pids = CHIME_DATA[session]['pids']
 94 |         for pid in pids:
 95 |             print(pid)
 96 |             align_participant(pid, transcription, align_data)
 97 | 
 98 |     tu.save_transcript(transcription, session, out_path, convert=True)
 99 | 
100 | 
101 | def main():
102 |     parser = argparse.ArgumentParser()
103 |     parser.add_argument("--sessions",
104 |                         help="list of sessions to process (defaults to all)")
105 |     parser.add_argument("align_path", help="path for the alignment pickle files")
106 |     parser.add_argument("in_path", help="path for the input transcription file")
107 |     parser.add_argument("out_path", help="path for the output transcription files")
108 |     args = parser.parse_args()
109 |     if args.sessions is None:
110 |         sessions = tu.chime_data()
111 |     else:
112 |         sessions = args.sessions.split()
113 | 
114 |     for session in sessions:
115 |         try:
116 |             print(session)
117 |             align_transcription(session, args.align_path, args.in_path, args.out_path)
118 |         except:
119 |             traceback.print_exc()
120 | 
121 | 
122 | if __name__ == '__main__':
123 |     main()
124 | 


--------------------------------------------------------------------------------
/chime5.json:
--------------------------------------------------------------------------------
  1 | {
  2 |   "S02": {
  3 |     "dataset": "dev",    
  4 |     "pids": [
  5 |       "P05",
  6 |       "P06",
  7 |       "P07",
  8 |       "P08"
  9 |     ],
 10 |     "kinects": [
 11 |       "U01",
 12 |       "U02",
 13 |       "U03",
 14 |       "U04",
 15 |       "U05",
 16 |       "U06"
 17 |     ]
 18 |   },
 19 |   "S03": {
 20 |     "dataset": "train",
 21 |     "pids": [
 22 |       "P09",
 23 |       "P10",
 24 |       "P11",
 25 |       "P12"
 26 |     ],
 27 |     "kinects": [
 28 |       "U01",
 29 |       "U02",
 30 |       "U03",
 31 |       "U04",
 32 |       "U05",
 33 |       "U06"
 34 |     ]
 35 |   },
 36 |   "S04": {
 37 |     "dataset": "train",
 38 |     "pids": [
 39 |       "P09",
 40 |       "P10",
 41 |       "P11",
 42 |       "P12"
 43 |     ],
 44 |     "kinects": [
 45 |       "U01",
 46 |       "U02",
 47 |       "U03",
 48 |       "U04",
 49 |       "U05",
 50 |       "U06"
 51 |     ]
 52 |   },
 53 |   "S05": {
 54 |     "dataset": "train",
 55 |     "pids": [
 56 |       "P13",
 57 |       "P14",
 58 |       "P15",
 59 |       "P16"
 60 |     ],
 61 |     "kinects": [
 62 |       "U01",
 63 |       "U02",
 64 |       "U04",
 65 |       "U05",
 66 |       "U06"
 67 |     ],
 68 |     "missing": {
 69 |       "U04": [
 70 |         4257.78,
 71 |         282.50
 72 |       ]
 73 |     }
 74 |   },
 75 |   "S06": {
 76 |     "dataset": "train",
 77 |     "pids": [
 78 |       "P13",
 79 |       "P14",
 80 |       "P15",
 81 |       "P16"
 82 |     ],
 83 |     "kinects": [
 84 |       "U01",
 85 |       "U02",
 86 |       "U03",
 87 |       "U04",
 88 |       "U05",
 89 |       "U06"
 90 |     ]
 91 |   },
 92 |   "S07": {
 93 |     "dataset": "train",
 94 |     "pids": [
 95 |       "P17",
 96 |       "P18",
 97 |       "P19",
 98 |       "P20"
 99 |     ],
100 |     "kinects": [
101 |       "U01",
102 |       "U02",
103 |       "U03",
104 |       "U04",
105 |       "U05",
106 |       "U06"
107 |     ]
108 |   },
109 |   "S08": {
110 |     "dataset": "train",
111 |     "pids": [
112 |       "P21",
113 |       "P22",
114 |       "P23",
115 |       "P24"
116 |     ],
117 |     "kinects": [
118 |       "U01",
119 |       "U02",
120 |       "U03",
121 |       "U04",
122 |       "U05",
123 |       "U06"
124 |     ]
125 |   },
126 |   "S09": {
127 |     "dataset": "dev",
128 |     "pids": [
129 |       "P25",
130 |       "P26",
131 |       "P27",
132 |       "P28"
133 |     ],
134 |     "kinects": [
135 |       "U01",
136 |       "U02",
137 |       "U03",
138 |       "U04",
139 |       "U06"
140 |     ]
141 |   },
142 |   "S12": {
143 |     "dataset": "train",
144 |     "pids": [
145 |       "P33",
146 |       "P34",
147 |       "P35",
148 |       "P36"
149 |     ],
150 |     "kinects": [
151 |       "U01",
152 |       "U02",
153 |       "U03",
154 |       "U04",
155 |       "U05",
156 |       "U06"
157 |     ]
158 |   },
159 |   "S13": {
160 |     "dataset": "train",
161 |     "pids": [
162 |       "P33",
163 |       "P34",
164 |       "P35",
165 |       "P36"
166 |     ],
167 |     "kinects": [
168 |       "U01",
169 |       "U02",
170 |       "U03",
171 |       "U04",
172 |       "U05",
173 |       "U06"
174 |     ]
175 |   },
176 |   "S16": {
177 |     "dataset": "train",
178 |     "pids": [
179 |       "P21",
180 |       "P22",
181 |       "P23",
182 |       "P24"
183 |     ],
184 |     "kinects": [
185 |       "U01",
186 |       "U02",
187 |       "U03",
188 |       "U04",
189 |       "U05",
190 |       "U06"
191 |     ]
192 |   },
193 |   "S17": {
194 |     "dataset": "train",
195 |     "pids": [
196 |       "P17",
197 |       "P18",
198 |       "P19",
199 |       "P20"
200 |     ],
201 |     "kinects": [
202 |       "U01",
203 |       "U02",
204 |       "U03",
205 |       "U04",
206 |       "U05",
207 |       "U06"
208 |     ]
209 |   },
210 |   "S18": {
211 |     "dataset": "train",
212 |     "pids": [
213 |       "P41",
214 |       "P42",
215 |       "P43",
216 |       "P44"
217 |     ],
218 |     "kinects": [
219 |       "U01",
220 |       "U02",
221 |       "U03",
222 |       "U04",
223 |       "U05",
224 |       "U06"
225 |     ],
226 |     "missing": {
227 |       "U06": [2659.92, 66.71]
228 |     }
229 |   },
230 |   "S19": {
231 |     "dataset": "train",
232 |     "pids": [
233 |       "P49",
234 |       "P50",
235 |       "P51",
236 |       "P52"
237 |     ],
238 |     "kinects": [
239 |       "U01",
240 |       "U02",
241 |       "U03",
242 |       "U04",
243 |       "U05",
244 |       "U06"
245 |     ]
246 |   },
247 |   "S20": {
248 |     "dataset": "train",
249 |     "pids": [
250 |       "P49",
251 |       "P50",
252 |       "P51",
253 |       "P52"
254 |     ],
255 |     "kinects": [
256 |       "U01",
257 |       "U02",
258 |       "U03",
259 |       "U04",
260 |       "U05",
261 |       "U06"
262 |     ]
263 |   },
264 |   "S21": {
265 |     "dataset": "eval",
266 |     "pids": [
267 |       "P45",
268 |       "P46",
269 |       "P47",
270 |       "P48"
271 |     ],
272 |     "kinects": [
273 |       "U01",
274 |       "U02",
275 |       "U03",
276 |       "U04",
277 |       "U05",
278 |       "U06"
279 |     ]
280 |   },
281 |   "S22": {
282 |     "dataset": "train",
283 |     "pids": [
284 |       "P41",
285 |       "P42",
286 |       "P43",
287 |       "P44"
288 |     ],
289 |     "kinects": [
290 |       "U01",
291 |       "U02",
292 |       "U04",
293 |       "U05",
294 |       "U06"
295 |     ]
296 |   },
297 |   "S23": {
298 |     "dataset": "train",
299 |     "pids": [
300 |       "P53",
301 |       "P54",
302 |       "P55",
303 |       "P56"
304 |     ],
305 |     "kinects": [
306 |       "U01",
307 |       "U02",
308 |       "U03",
309 |       "U04",
310 |       "U05",
311 |       "U06"
312 |     ]
313 |   },
314 |   "S24": {
315 |     "dataset": "train",
316 |     "pids": [
317 |       "P53",
318 |       "P54",
319 |       "P55",
320 |       "P56"
321 |     ],
322 |     "kinects": [
323 |       "U01",
324 |       "U02",
325 |       "U03",
326 |       "U04",
327 |       "U05",
328 |       "U06"
329 |     ],
330 |     "missing": {
331 |       "U06": [
332 |         5527.40,
333 |         8.60
334 |       ]
335 |     }
336 |   }
337 | }
338 | 


--------------------------------------------------------------------------------
/estimate_alignment.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | # Copyright 2018 University of Sheffield (Jon Barker)
  4 | # MIT License (https://opensource.org/licenses/MIT)
  5 | 
  6 | """
  7 | python3 estimate_alignment.py
  8 | 
  9 | Estimate an alignment between the reference binaural recording channel and all other audio 
 10 | channels in a session. An alignment is a series of (time, lag) pairs where the 'lag' between 
 11 | the two channels is measured at a given 'time' point within the reference signal. Alignment 
 12 | are initialially computed at 10 second intervals. In periods where the lag changes rapidly 
 13 | the alignment is recomputed at 1 second intervals.
 14 | 
 15 | The alignment at a time point is estimated by looking for the offset that maximises a 
 16 | cross correlation between the two signals being aligned.
 17 | 
 18 | The alignments are stored in pickle files called align.{session}.p. They are then read
 19 | and applied to the JSON transcript files by align_transcription.py
 20 | """
 21 | 
 22 | import argparse
 23 | import cv2
 24 | import numpy as np
 25 | import pickle
 26 | import scipy.signal
 27 | import struct
 28 | import sys
 29 | import traceback
 30 | import wave
 31 | 
 32 | import transcript_utils as tu
 33 | 
 34 | # For Bin-to-bin
 35 | BINAURAL_RESOLUTION = 100  # analysis resolution (in seconds)
 36 | BINAURAL_SEARCH_DURATION = 0.5  # Max misalignment (in seconds)
 37 | BINAURAL_TEMPLATE_DURATION = 20  # Duration of segment that is matched (in seconds)
 38 | 
 39 | # # For Bin-to-Kinect
 40 | KINECT_RESOLUTION = 10  # analysis resolution (in seconds)
 41 | # KINECT_SEARCH_DURATION = 3  # Max misalignment (in seconds)
 42 | KINECT_SEARCH_DURATION = 5  # Max misalignment (in seconds)
 43 | KINECT_TEMPLATE_DURATION = 20  # Duration of segment that is matched (in seconds)
 44 | 
 45 | 
 46 | def wavfile_duration(wavfile):
 47 |     """Return the duration of a wav file in seconds."""
 48 |     wave_fp = wave.open(wavfile, 'rb')
 49 |     sample_rate = wave_fp.getframerate()
 50 |     nframes = wave_fp.getnframes()
 51 |     return nframes/sample_rate
 52 | 
 53 | 
 54 | def readwav(wavfile, duration, start_time=0, channel=0):
 55 |     """Read a segment of a wav file.
 56 |     
 57 |     If stereo then can select channel 0 or channel 1.
 58 |     """
 59 |     wave_fp = wave.open(wavfile, 'rb')
 60 |     sample_rate = wave_fp.getframerate()
 61 |     nchannels = wave_fp.getnchannels()
 62 |     nsamples = int(duration * sample_rate)
 63 |     wave_fp.setpos(int(start_time * sample_rate))
 64 |     wavbytes = wave_fp.readframes(nsamples)
 65 |     signal = struct.unpack(f'{nsamples * nchannels}h', wavbytes)
 66 |     if nchannels == 2:
 67 |         signal = signal[channel::2]
 68 |     return signal, sample_rate
 69 | 
 70 | 
 71 | def find_align(target_wavfile, template_wavfile,
 72 |                align_time,  search_duration, template_duration, channel=0, missing=None):
 73 |     """Find lag between a pair of signals by correlating a short segment of the template
 74 |     signal (from one of the Kinects) against the target signal (the reference binaural recorder)
 75 |     
 76 |     This is the low level signal matching code called by the align_channel() function. It uses
 77 |     cv2.matchTemplate to do the work
 78 | 
 79 |     Arguments:
 80 |     target_wavfile -- name of the target signal wav file 
 81 |     template_wavfile -- name of the template signal wav file 
 82 |     align_time -- the time within the signals at which to measure the lag
 83 |     search_duration -- the max delay to consider
 84 |     template_duration -- length of window to use when computing correlations
 85 |     channel -- either use left (0) or right (1) stereo channel of the binaural recorder
 86 |     missing -- for dealing with Kinect recordings where a large chunk of data has gone missing.
 87 |                  missing = (missing_time, missing_duration) 
 88 |                  missing_time - the time at which the missing segment starts
 89 |                  missing_duration - the number of seconds of audio missing 
 90 |     """
 91 | 
 92 |     offset = 0
 93 |     if missing is not None:
 94 |         missing_time, missing_duration = missing
 95 |         if align_time > missing_time:
 96 |             offset = missing_duration
 97 | 
 98 |     target, _ = readwav(target_wavfile,
 99 |                         2.0 * search_duration + template_duration,
100 |                         align_time - search_duration + offset,
101 |                         channel=channel)
102 |     template, sample_rate = readwav(template_wavfile,
103 |                                     template_duration,
104 |                                     align_time)
105 |     result = cv2.matchTemplate(np.array(target, dtype=np.float32),
106 |                                np.array(template, dtype=np.float32),
107 |                                cv2.TM_CCOEFF_NORMED)
108 |     lag = np.argmax(result)/sample_rate - search_duration + offset
109 |     return lag, np.max(result)
110 | 
111 | 
112 | def align_channels(ref_fn, target_fn, analysis_times,
113 |                    search_duration, template_duration, missing):
114 |     """Compute the alignment between a pair of channel.
115 |     
116 |     Arguments:
117 |     ref_fn -- name of the reference binaural wav file
118 |     target_fn -- name of the kinect channel wav file
119 |     analysis_times -- a list of time points at which to estimate the delay 
120 |     search_duration -- the max delay to consider
121 |     template_duration -- the length of the window over which to compute correlation
122 |     missing -- for dealing with Kinect recordings where a large chunk of data has gone missing.
123 |                 (see find_align())
124 |     """
125 | 
126 |     lag_score_0 = [find_align(ref_fn, target_fn, analysis_time,
127 |                               search_duration, template_duration,
128 |                               channel=0, missing=missing)
129 |                    for analysis_time in analysis_times]
130 |     lag_score_1 = [find_align(ref_fn, target_fn, analysis_time,
131 |                               search_duration, template_duration,
132 |                               channel=1, missing=missing)
133 |                    for analysis_time in analysis_times]
134 |     (lag_0, score_0) = zip(*lag_score_0)
135 |     (lag_1, score_1) = zip(*lag_score_1)
136 | 
137 |     return {'times': analysis_times, 'lagL': lag_0,
138 |             'lagR': lag_1, 'scoreL': score_0,
139 |             'scoreR': score_1}
140 | 
141 | 
142 | def down_mix_lags(results):
143 |     """Produce single sequence of lag estimate from L and R channel lag estimates."""
144 |     # Choose lag from left of right depending which has best correlation score.
145 |     times = np.array(results['times'])
146 |     lagL = np.array(results['lagL'])
147 |     lagR = np.array(results['lagR'])
148 |     best = np.array(results['scoreL']) > np.array(results['scoreR'])
149 |     return best * lagL + ~best * lagR
150 | 
151 | 
152 | def clock_drift_linear_fit(results):
153 |     """Compute best linear time-lag fit for clock drift."""
154 |     times = np.array(results['times'])
155 |     lag = np.array(results['lag'])
156 |     a, _, _, _ = np.linalg.lstsq(times[:, np.newaxis], lag)
157 |     return a
158 | 
159 | 
160 | def merge_results(results, new_results):
161 |     """Merge two sets of time-lag results."""
162 |     times = list(results['times']) + new_results['times']
163 |     lags = list(results['lag']) + list(new_results['lag'])
164 |     # sort lags into time order
165 |     results['times'], results['lag'] = zip(*(sorted(zip(times, lags))))
166 | 
167 | 
168 | def refine_kinect_lags(results, audiopath, session, target_chan, ref_chan):
169 |     """Refine alignment around big jumps in lag.
170 |     
171 |     The initial alignment is computed at 10 second intervals. If the alignment
172 |     changes by a large amount (>50 ms) during a single 10 second step then the
173 |     alignment is recomputed at a resolution of 1 second intervals.
174 | 
175 |     Arguments:
176 |     results -- the alignment returned by align_channels()
177 |     audiopath -- the directory containing the audio data
178 |     session -- the name of the session to process (e.g. 'S10')
179 |     target_chan -- the name of the kinect channel to process (e.g. 'U01')
180 |     ref_chan -- the name of the reference binaural recorded (e.g. 'P34')
181 | 
182 |     Return:
183 |     Note, the function updates the contents of results rather than returns results explicitly
184 |     """
185 |     threshold = 0.05
186 |     search_duration = KINECT_SEARCH_DURATION
187 |     template_duration = KINECT_TEMPLATE_DURATION
188 |     chime_data = tu.chime_data()
189 | 
190 |     times = np.array(results['times'])
191 |     lag = np.array(results['lag'])
192 |     if len(times) != len(lag):
193 |         # This happens for the one case where a kinect was turned off early
194 |         # and 15 minutes of audio got lost
195 |         print('WARNING: missing lags')
196 |         times = times[:len(lag)]
197 |     dlag = np.diff(lag)
198 |     jump_times = times[1:][dlag > threshold]
199 |     analysis_times = set()
200 | 
201 |     for time in jump_times:
202 |         analysis_times |= set(list(range(time-10, time+10)))
203 |     analysis_times = list(analysis_times)
204 |     print(len(analysis_times))
205 | 
206 |     if len(analysis_times) > 0:
207 |         missing = None
208 |         if (('missing' in chime_data[session] and
209 |             target_chan in chime_data[session]['missing'])):
210 |             missing = chime_data[session]['missing'][target_chan]
211 | 
212 |         ref_fn = f'{audiopath}/{session}_{ref_chan}.wav'
213 |         target_fn = f'{audiopath}/{session}_{target_chan}.CH1.wav'
214 | 
215 |         new_results = align_channels(ref_fn, target_fn, analysis_times,
216 |                                     search_duration, template_duration, missing=missing)
217 |         new_results['lag'] = down_mix_lags(new_results)
218 |         merge_results(results, new_results)
219 | 
220 | 
221 | def align_session(session, audiopath, outpath, chans=None):
222 |     """Align all channels within a given session."""
223 |     chime_data = tu.chime_data()
224 | 
225 |     # The first binaural recorder is taken as the reference
226 |     ref_chan = chime_data[session]['pids'][0]
227 | 
228 |     # If chans not specified then use all channels available
229 |     if chans is None:  
230 |         pids = chime_data[session]['pids']
231 |         kinects = chime_data[session]['kinects']
232 |         chans = pids[1:] + kinects
233 | 
234 |     all_results = dict()  # Empty dictionary for storing results
235 | 
236 |     for target_chan in chans:
237 |         print(target_chan)
238 | 
239 |         # For dealing with channels with big missing audio segments
240 |         missing = None
241 |         if (('missing' in chime_data[session] and
242 |              target_chan in chime_data[session]['missing'])):
243 |             missing = chime_data[session]['missing'][target_chan]
244 | 
245 |         # Parameters for alignment depend on whether target is
246 |         # a binaural mic ('P') or a kinect mic
247 |         if target_chan[0] == 'P':
248 |             search_duration = BINAURAL_SEARCH_DURATION
249 |             template_duration = BINAURAL_TEMPLATE_DURATION
250 |             alignment_resolution = BINAURAL_RESOLUTION
251 |             target_chan_name = target_chan
252 |         else:
253 |             search_duration = KINECT_SEARCH_DURATION
254 |             template_duration = KINECT_TEMPLATE_DURATION
255 |             alignment_resolution = KINECT_RESOLUTION
256 |             target_chan_name = target_chan + '.CH1'
257 | 
258 |         # Place it try-except block so that can continue
259 |         # if a channel fails. This shouldn't happen unless
260 |         # there is some problem reading the audio data.
261 |         try:
262 |             offset = 0
263 |             if missing is not None:
264 |                 _, offset = missing
265 | 
266 |             ref_fn = f'{audiopath}/{session}_{ref_chan}.wav'
267 |             target_fn = f'{audiopath}/{session}_{target_chan_name}.wav'
268 | 
269 |             # Will analyse the alignment offset at regular intervals
270 |             session_duration = int(min(wavfile_duration(ref_fn) - offset,
271 |                                    wavfile_duration(target_fn))
272 |                                    - template_duration - search_duration)
273 |             analysis_times = range(alignment_resolution, session_duration, alignment_resolution)
274 | 
275 |             # Run the alignment code and store results in dictionary
276 |             all_results[target_chan] = \
277 |                 align_channels(ref_fn, 
278 |                                target_fn,
279 |                                analysis_times, 
280 |                                search_duration,
281 |                                template_duration, 
282 |                                missing=missing)
283 |         except:
284 |             traceback.print_exc()
285 | 
286 |     pickle.dump(all_results, open(f'{outpath}/align.{session}.p', "wb"))
287 | 
288 | 
289 | def refine_session(session, audiopath, inpath, outpath):
290 |     """Refine alignment of all channels within a given session."""
291 |     chime_data = tu.chime_data()
292 | 
293 |     ref = chime_data[session]['pids'][0]
294 |     pids = chime_data[session]['pids'][1:]
295 |     kinects = chime_data[session]['kinects']
296 |     all_results = pickle.load(open(f'{inpath}/align.{session}.p', "rb"))
297 |     kinects = sorted(list(set(kinects).intersection(all_results.keys())))
298 |     print(session)
299 | 
300 |     # Merges results of left and right channel alignments
301 |     for channel in pids + kinects:
302 |         results = all_results[channel]
303 |         lag = down_mix_lags(results)
304 |         results['lag'] = scipy.signal.medfilt(lag, 9)
305 | 
306 |     # Compute the linear fit for modelling clock drift
307 |     for channel in pids:
308 |         results = all_results[channel]
309 |         results['linear_fit'] = clock_drift_linear_fit(results)
310 | 
311 |     # Refine kinect alignments - i.e. reanalyse on finer time
312 |     # scale in regions where big jumps in offset occur and
313 |     # apply a bit of smoothing to remove spurious estimates
314 |     for channel in kinects:
315 |         refine_kinect_lags(all_results[channel], audiopath, session=session, target_chan=channel, ref_chan=ref)
316 |         results['lag'] = scipy.signal.medfilt(results['lag'], 7)
317 | 
318 |     pickle.dump(all_results, open(f'{outpath}/align.{session}.p', "wb"))
319 | 
320 | 
321 | def main():
322 |     parser = argparse.ArgumentParser()
323 |     parser.add_argument("--sessions",
324 |                         help="list of sessions to process (defaults to all)")
325 |     parser.add_argument("--chans", help="list of channels to process (defaults to all)")
326 |     parser.add_argument("--refine", help="path to output of 1st pass (runs a refinement pass if provided)")
327 | 
328 |     parser.add_argument("audiopath", help="path to audio data")
329 |     parser.add_argument("outpath", help="path to output alignment data")
330 |     args = parser.parse_args()
331 | 
332 |     if args.sessions is None:
333 |         sessions = tu.chime_data()
334 |     else:
335 |         sessions = args.sessions.split()
336 | 
337 |     if args.chans is None:
338 |         chans = None
339 |     else:
340 |         chans = args.chans.split()
341 | 
342 |     chime_data = tu.chime_data()
343 | 
344 |     if args.refine:
345 |         # The alignment refinement pass.
346 |         for session in sessions:
347 |             refine_session(session, args.audiopath, args.refine, args.outpath)
348 |     else:
349 |         # The initial alignment pass.
350 |         for session in sessions:
351 |             print(session, chans)
352 |             align_session(session, args.audiopath, args.outpath, chans=chans)
353 | 
354 | 
355 | if __name__ == '__main__':
356 |     main()
357 | 


--------------------------------------------------------------------------------
/plots/align_S03_P.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chimechallenge/chime5-synchronisation/4f52fbfacbd7918eea3a95b9eb7bda12bd955e6f/plots/align_S03_P.pdf


--------------------------------------------------------------------------------
/plots/align_S03_U.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/chimechallenge/chime5-synchronisation/4f52fbfacbd7918eea3a95b9eb7bda12bd955e6f/plots/align_S03_U.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.13.3
2 | opencv-python==3.3.0.10
3 | scipy==1.0.1
4 | matplotlib==2.2.2
5 | 


--------------------------------------------------------------------------------
/transcript_utils.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2018 University of Sheffield (Jon Barker)
 4 | # MIT License (https://opensource.org/licenses/MIT)
 5 | 
 6 | import json
 7 | 
 8 | CHIME5_JSON = 'chime5.json'  # Name of the CHiME5 json metadata file
 9 | 
10 | 
11 | def chime_data(sets_to_load=None):
12 |     """Load CHiME corpus data for specified sets eg. sets=['train', 'eval']
13 | 
14 |     Defaults to all
15 |     """
16 |     if sets_to_load is None:
17 |         sets_to_load = ['train', 'dev']
18 | 
19 |     with open(CHIME5_JSON) as fh:
20 |         data = json.load(fh)
21 | 
22 |     data = {k: v for (k, v) in data.items() if v['dataset'] in sets_to_load}
23 | 
24 |     return data
25 | 
26 | 
27 | def time_text_to_float(time_string):
28 |     """Convert tramscript time from text to float format."""
29 |     hours, minutes, seconds = time_string.split(':')
30 |     seconds = int(hours) * 3600 + int(minutes) * 60 + float(seconds)
31 |     return seconds
32 | 
33 | 
34 | def time_float_to_text(time_float):
35 |     """Convert tramscript time from float to text format."""
36 |     hours = int(time_float/3600)
37 |     time_float %= 3600
38 |     minutes = int(time_float/60)
39 |     seconds = time_float % 60
40 |     return f'{hours}:{minutes:02d}:{seconds:05.2f}'
41 | 
42 | 
43 | def load_transcript(session, root, convert=False):
44 |     """Load final merged transcripts.
45 | 
46 |     session: recording session name, e.g. 'S12'
47 |     """
48 |     with open(f'{root}/{session}.json') as f:
49 |                 transcript = json.load(f)
50 |     if convert:
51 |         for item in transcript:
52 |             for key in item['start_time']:
53 |                 item['start_time'][key] = time_text_to_float(item['start_time'][key])
54 |             for key in item['end_time']:
55 |                 item['end_time'][key] = time_text_to_float(item['end_time'][key])
56 |     return transcript
57 | 
58 | 
59 | def save_transcript(transcript, session, root, convert=False):
60 |     """Save transcripts to json file."""
61 | 
62 |     # Need to make a deep copy so time to string conversions only happen locally
63 |     transcript_copy = [element.copy() for element in transcript]
64 | 
65 |     if convert:
66 |         for item in transcript_copy:
67 |             for key in item['start_time']:
68 |                 item['start_time'][key] = time_float_to_text(
69 |                     item['start_time'][key])
70 |             for key in item['end_time']:
71 |                 item['end_time'][key] = time_float_to_text(
72 |                     item['end_time'][key])
73 | 
74 |     with open(f'{root}/{session}.json', 'w') as outfile:
75 |         json.dump(transcript_copy, outfile, indent=4)
76 | 


--------------------------------------------------------------------------------
/view_alignments.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | # Copyright 2018 University of Sheffield (Jon Barker)
 4 | # MIT License (https://opensource.org/licenses/MIT)
 5 | 
 6 | import argparse
 7 | import pickle
 8 | import matplotlib.pylab as plt
 9 | import traceback
10 | 
11 | import transcript_utils as tu
12 | 
13 | def make_plots(data, device_type, layout, ylim):
14 |     plt.figure()
15 |     plot_keys = [key for key in data.keys() if device_type in key]
16 |     for index, key in enumerate(plot_keys):
17 |         chan_data = data[key]
18 |         times = chan_data['times']
19 |         plt.subplot(*layout, index + 1)
20 |         if 'lag' in chan_data:
21 |             plt.plot(times, chan_data['lag'], '-')
22 |         else:
23 |             plt.plot(times, chan_data['lagL'])
24 |             plt.plot(times, chan_data['lagR'])
25 |         plt.ylim(ylim)
26 |         plt.title(key)
27 |     plt.gcf().tight_layout()
28 | 
29 | 
30 | def plot_session(session, path, show_plot=True, save_dir=None):
31 |     """Plot figure for a single session."""
32 |     name = f'{path}/align.{session}.p'
33 |     data = pickle.load(open(name, 'rb'))
34 | 
35 |     device_type='U'
36 |     make_plots(data, device_type, (2,3), ylim=(0, 1))
37 |     if save_dir is not None:
38 |         plt.savefig(f'{save_dir}/align_{session}_{device_type}.pdf')
39 | 
40 |     device_type='P'
41 |     make_plots(data, device_type, (1, 3), ylim=(-0.1, 0.1))
42 |     if save_dir is not None:
43 |         plt.savefig(f'{save_dir}/align_{session}_{device_type}.pdf')
44 | 
45 |     if show_plot:
46 |         plt.show()
47 | 
48 | 
49 | def main():
50 |     parser = argparse.ArgumentParser()
51 |     parser.add_argument("--sessions", help="list of sessions to process (defaults to all)")   
52 |     parser.add_argument("--save", help="path of directory in which to save plots")   
53 |     parser.add_argument("--no_plot", action='store_true', help="suppress display of plot (defaults to false)")   
54 |     parser.add_argument("path", help="path to alignment data")   
55 |     args = parser.parse_args()
56 |     if args.sessions is None:
57 |         sessions = tu.chime_data()
58 |     else:
59 |         sessions = args.sessions.split()
60 | 
61 |     for session in sessions:
62 |         print(session)
63 |         try:
64 |             plot_session(session, args.path, not args.no_plot, args.save)
65 |         except:
66 |             traceback.print_exc()
67 | 
68 | 
69 | if __name__ == '__main__':
70 |     main()
71 | 


--------------------------------------------------------------------------------