├── .gitignore
├── Dockerfile
├── README.md
├── __init__.py
├── decode_model.py
├── install_model.sh
├── main_denoising.py
├── main_get_vad.py
├── model
    ├── global_mvn_stats.mat
    ├── speech_enhancement.model0
    └── speech_enhancement.model1
├── run_eval.sh
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/*
2 | \#*
3 | *.pyc
4 | *~
5 | 
6 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Starting from the official CNTK docker image (based on
 2 | # Ubuntu-16.04)
 3 | FROM nvidia/cuda:10.1-cudnn8-runtime-ubuntu18.04
 4 | 
 5 | # Install packages.
 6 | RUN apt-get update && \
 7 |     apt-get install -y --no-install-recommends \
 8 |         g++ gfortran \
 9 | 	openmpi-bin \
10 | 	libsndfile-dev \
11 |         software-properties-common \
12 | 	emacs && \
13 |     rm -rf /var/lib/apt/lists/*
14 | RUN ln -s /usr/lib/x86_64-linux-gnu/libmpi_cxx.so.20 /usr/lib/x86_64-linux-gnu/libmpi_cxx.so.1 && \
15 |     ln -s /usr/lib/x86_64-linux-gnu/libmpi.so.20.10.1 /usr/lib/x86_64-linux-gnu/libmpi.so.12 && \
16 |     ldconfig
17 |     
18 | # Install Python 3.6.
19 | RUN add-apt-repository ppa:deadsnakes/ppa
20 | RUN apt-get update && \
21 |     apt-get install -y python3.6 python3-pip
22 | RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 0
23 | 
24 | # Install Python packages.
25 | RUN pip3 install --upgrade pip && \
26 |     pip3 install numpy scipy librosa joblib webrtcvad wurlitzer cntk-gpu 
27 | 
28 | 
29 | # Copy the repository inside the docker in /dihard18
30 | WORKDIR /dihard18
31 | COPY . .
32 | 
33 | # Install model.
34 | RUN ./install_model.sh
35 | 
36 | # Make the eval script executable
37 | RUN chmod +x ./run_eval.sh
38 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # A quick-use package for speech enhancement based on our DIHARD18 system
  2 | Original founder: @staplesinLA
  3 | 
  4 | Major contributor: @nryant @mmmaat(many thanks!)
  5 | 			
  6 | The repository provides tools to reproduce the enhancement results of the
  7 | speech preprocessing part of our DIHARD18 system[1]. The deep-learning based
  8 | denoising model is trained on 400 hours of English and Mandarin audio; for full
  9 | details see [1,2,3]. Currently the tools accept 16 kHz, 16-bit monochannel
 10 | WAV files. Please convert the audio format in advance.
 11 | 
 12 | Additionally, this package integrates a voice activity detection (VAD) module
 13 | based on [py-webrtcvad](https://github.com/wiseman/py-webrtcvad), which provides a Python interface to the
 14 | [WebRTC](https://webrtc.org/) VAD. The default parameters are tuned on the
 15 | development set of DIHARD18.
 16 | 
 17 | [1] Sun, Lei, et al. "Speaker Diarization with Enhancing Speech for the
 18 | First DIHARD Challenge." Proc. Interspeech 2018 (2018):
 19 | 2793-2797. [PDF](http://home.ustc.edu.cn/~sunlei17/pdf/lei_IS2018.pdf)
 20 | 
 21 | [2] Gao, Tian, et al. "Densely connected progressive learning for
 22 | lstm-based speech enhancement." 2018 IEEE International Conference on
 23 | Acoustics, Speech and Signal Processing
 24 | (ICASSP). IEEE, 2018. [PDF](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8461861)
 25 | 
 26 | [3] Sun, Lei, et al. "Multiple-target deep learning for LSTM-RNN based
 27 | speech enhancement." 2017 Hands-free Speech Communications and
 28 | Microphone Arrays (HSCMA). IEEE,
 29 | 2017. [PDF](http://home.ustc.edu.cn/~sunlei17/pdf/MULTIPLE-TARGET.pdf)
 30 | 
 31 | 
 32 | ## Main Prerequisites
 33 | 
 34 | * [CNTK](https://docs.microsoft.com/en-us/cognitive-toolkit/setup-linux-python?tabs=cntkpy26)
 35 | * [webrtcvad](https://github.com/wiseman/py-webrtcvad)
 36 | * [Numpy](https://github.com/numpy/numpy)
 37 | * [Scipy](https://github.com/scipy/scipy)
 38 | * [Librosa](https://github.com/librosa/librosa)
 39 | * [Wurlitzer](https://github.com/minrk/wurlitzer)
 40 | * [joblib](https://github.com/joblib/joblib)
 41 | 
 42 | ## How to use it?
 43 | 
 44 | 1. Install all dependencies (Note that you need to have Python and pip
 45 |    already installed on your system) :
 46 | 
 47 |         sudo apt-get install openmpi-bin
 48 |         pip install numpy scipy librosa
 49 |         pip install cntk-gpu
 50 |         pip install webrtcvad
 51 |         pip install wurlitzer
 52 |         pip install joblib
 53 | 
 54 |    Make sure the CNTK engine installed successfully by querying its version:
 55 | 
 56 |         python -c "import cntk; print(cntk.__version__)"
 57 | 
 58 | 2. Download the speech enhancement repository :
 59 | 
 60 |         git clone https://github.com/staplesinLA/denoising_DIHARD18.git
 61 | 	
 62 | 3. Install the pretrained model:
 63 | 
 64 |         cd denoising_DIHARD18
 65 |         ./install_model.sh
 66 | 
 67 | 4. Specify parameters in ``run_eval.sh``:
 68 | 
 69 |     * For the speech enhancement tool:
 70 | 
 71 |             WAV_DIR=<path to original wavs>
 72 |             SE_WAV_DIR=<path to output dir>
 73 |             USE_GPU=<true|false, if false use CPU, default=true>
 74 |             GPU_DEVICE_ID=<GPU device id on your machine, default=0>
 75 |             TRUNCATE_MINUTES=<audio chunk length in minutes, default=10>
 76 | 
 77 |       We recommend using a GPU for decoding as it's much faster than CPU.
 78 |       If decoding fails with a ``CUDA Error: out of memory`` error, reduce the
 79 |       value of ``TRUNCATE_MINUTES``.
 80 | 
 81 |     * For the VAD tool:
 82 | 
 83 |             VAD_DIR=<path to output dir>
 84 |             HOPLENGTH=<duration in milliseconds of VAD frame size, default=30>
 85 |             MODE=<WebRTC aggressiveness, default=3>
 86 |             NJOBS=<number of parallel processes, default=1>
 87 | 
 88 | 5. Execute ``run_eval.sh``:
 89 | 
 90 |         ./run_eval.sh
 91 | 
 92 | ### Use within docker
 93 | 
 94 | 1. Install [docker](https://docs.docker.com/install/linux/docker-ee/ubuntu)
 95 | 
 96 | 2. Install [nvidia docker](https://github.com/nvidia/nvidia-docker), a
 97 |    plugin to use your GPUs within docker
 98 | 
 99 | 3. Build the image using the provided ``Dockerfile``:
100 | 
101 |         docker build -t dihard18 .
102 | 
103 | 4. Run the evaluation script within docker with the following commands:
104 | 
105 |         docker run -it --rm --runtime=nvidia -v /abs/path/to/dihard/data:/data dihard18 /bin/bash
106 |         # you are now in the docker machine
107 |         ./run_eval.sh  # before launcing the script you can edit it to modify the parameters
108 | 
109 |    * The option ``--runtime=nvidia`` enables the use of GPUs within docker
110 | 
111 |    * The option ``-v /absolute/path/to/dihard/data:/data`` mounts the
112 |      folder where the data are stored into Docker in the ``/data``
113 |      folder. The directory ``/absolute/path/to/dihard/data`` **must
114 |      contain** a ``wav/`` subdirectory. The results will be stored in
115 |      the directories ``wav_pn_enhanced/`` and ``vad/``.
116 | 
117 | 
118 | ## Details
119 | 
120 | 1. Speech enhancement model
121 | 
122 |    The scripts accept 16 kHz, 16-bit monochannel WAV files. Please convert the
123 |    audio format in advance. To easily rebuild the waveform, the input feature
124 |    is log-power spectrum (LPS). As the model has dual outputs including "IRM"
125 |    and "LPS", the final used component is the "IRM" target which directly
126 |    applies a mask to the original speech. Compared with "LPS" output, it can
127 |    yield better speech intelligibility and fewer distortions.
128 | 
129 | 2. VAD module
130 | 
131 |    The optional parameters of WebRTC VAD are aggressiveness mode (default=3)
132 |    and hop length (default=30 ms). The default settings are tuned on the
133 |    development set of the [First DIHARD challenge](https://coml.lscp.ens.fr/dihard/2018/index.html).
134 |    For the development set, here is the comparison between original speech
135 |    and processed speech in terms of VAD metrics:
136 | 
137 |    | VAD(default) | Original_Dev | Processed_Dev |
138 |    | ------       | ------       | ------        |
139 |    | Miss         | 11.85        | 7.21          |
140 |    | FA           | 6.12         | 6.17          |
141 |    | Total        | 17.97        | 13.38         |
142 | 
143 |    And the performance on the evaluation set:
144 | 
145 |    | VAD(default) | Original_Eval | Processed_Eval |
146 |    | ------       | ------        | ------         |
147 |    | Miss         | 17.49         | 8.89           |
148 |    | FA           | 6.36          | 6.4            |
149 |    | Total        | 23.85         | 15.29          |
150 | 
151 | 
152 | 3. Effectiveness
153 | 
154 |    The contribution of a single sub-module on the final speaker diarization
155 |    performance is too trivial to analyze. However, it can be seen clearly that
156 |    the enhancement based pre-processing is beneficial to at least VAD
157 |    performance. Users can also tune the default VAD parameters to obtain a
158 |    desired trade-off between Miss and False Alarm rates.
159 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/staplesinLA/denoising_DIHARD18/7d6b17c8646183880e94a8d950b653a61a2e8d05/__init__.py


--------------------------------------------------------------------------------
/decode_model.py:
--------------------------------------------------------------------------------
  1 | """Functions for deriving ideal ratio masks for purposes of speech denoising.
  2 | 
  3 | References
  4 | ----------
  5 | Sun, Lei, et al. "Speaker diarization with enhancing speech for the First DIHARD
  6 | Challenge." Proceedings of INTERSPEECH 2019. 2793-2797.
  7 | """
  8 | from __future__ import print_function
  9 | from __future__ import unicode_literals
 10 | import os
 11 | import re
 12 | import sys
 13 | import warnings
 14 | 
 15 | warnings.filterwarnings(
 16 |     'ignore', message=r'[\s\S]+Missing optional dependency')
 17 | warnings.filterwarnings(
 18 |     'ignore', message='Unsupported Linux distribution')
 19 | warnings.filterwarnings(
 20 |     'ignore', message='HTKDeserializer')
 21 | warnings.filterwarnings(
 22 |     'ignore', message='Insufficiently recent colorama version found')
 23 | 
 24 | from cntk.io import MinibatchSource, HTKFeatureDeserializer, StreamDef, StreamDefs
 25 | from cntk import load_model, combine
 26 | from cntk.device import try_set_default_device, gpu, cpu
 27 | import numpy as np
 28 | import scipy.io as sio
 29 | import wurlitzer
 30 | 
 31 | 
 32 | HERE = os.path.abspath(os.path.dirname(__file__))
 33 | MODELF = os.path.join(HERE, "model", "speech_enhancement.model")
 34 | PY2 = sys.version_info[0] == 2
 35 | 
 36 | 
 37 | def decode_model(features_file, irm_mat_dir, feature_dim, use_gpu=True,
 38 |                  gpu_id=0):
 39 |     """Applies model to LPS features to generate ideal ratio mask.
 40 | 
 41 |     Parameters
 42 |     ----------
 43 |     features_file : str
 44 |         Path to HTK script file for chunks of LPS features to be processed.
 45 | 
 46 |     irm_mat_dir : str
 47 |         Path to output directory for ``.mat`` files containing ideal ratio
 48 |         masks.
 49 | 
 50 |     feature_dim : int
 51 |         Feature dimensionality. Needed to parse HTK binary file containing
 52 |         features.
 53 | 
 54 |     use_gpu : bool, optional
 55 |         If True and GPU is available, perform all processing on GPU.
 56 |         (Default: True)
 57 | 
 58 |     gpu_id : int, optional
 59 |          Id of GPU on which to do computation.
 60 |          (Default: 0)
 61 |     """
 62 |     if not os.path.exists(irm_mat_dir):
 63 |         os.makedirs(irm_mat_dir)
 64 | 
 65 |     # Load model.
 66 |     with wurlitzer.pipes() as (stdout, stderr):
 67 |         try_set_default_device(gpu(gpu_id) if use_gpu else cpu())
 68 |         model_dnn = load_model(MODELF)
 69 | 
 70 |     # Compute ideal ratio masks for all chunks of LPS features specified in
 71 |     # the script file and save as .mat files in irm_mat_dir.
 72 |     with wurlitzer.pipes() as (stdout, stderr):
 73 |         test_reader = MinibatchSource(
 74 |             HTKFeatureDeserializer(StreamDefs(
 75 |                 amazing_features=StreamDef(shape=feature_dim, context=(3, 3),
 76 |                                            scp=features_file))),
 77 |             randomize=False, frame_mode=False, trace_level=0)
 78 |     eval_input_map = {input: test_reader.streams.amazing_features}
 79 |     with open(features_file, 'r') as f:
 80 |         for line in f:
 81 |             # Parse line of script file to get id for chunk and location of
 82 |             # corresponding LPS features. Each line has the format:
 83 |             #
 84 |             #     {CHUNK_ID}={PATH_TO_HTK_BIN}[{START_FRAME_INDEX},{END_FRAME_INDEX}]
 85 |             line = line.strip()
 86 |             chunk_id, htk_bin_path, start_ind, end_ind = re.match(
 87 |                 r'(\S+)=(\S+)\[(\d+),(\d+)\]$', line).groups()
 88 |             start_ind = int(start_ind)
 89 |             end_ind = int(end_ind)
 90 |             mb_size = end_ind - start_ind + 1
 91 | 
 92 |             # Determine IRM features for frames in chunk.
 93 |             noisy_fea = test_reader.next_minibatch(
 94 |                 mb_size, input_map=eval_input_map)
 95 |             real_noisy_fea = noisy_fea[input].data
 96 |             node_name = b'irm' if PY2 else 'irm'
 97 |             node_in_graph = model_dnn.find_by_name(node_name)
 98 |             output_nodes = combine([node_in_graph.owner])
 99 |             with wurlitzer.pipes() as (stdout, stderr):
100 |                 irm = output_nodes.eval(real_noisy_fea)
101 |             irm = np.concatenate((irm), axis=0)
102 | 
103 |             # Write .mat file.
104 |             sio.savemat(
105 |                 os.path.join(irm_mat_dir, chunk_id + '.mat'), {'IRM' : irm})
106 | 


--------------------------------------------------------------------------------
/install_model.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | cat model/speech_enhancement.model0  model/speech_enhancement.model1 > model/speech_enhancement.model
3 | 


--------------------------------------------------------------------------------
/main_denoising.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Perform speech enhancement for audio stored in WAV files.
  3 | 
  4 | This script performs speech enhancement of audio using a deep-learning based
  5 | enhancement model (Lei et al, 2018; Gao et al, 2018; Lei et al, 2017). To perform
  6 | enhancement for all WAV files under the directory ``wav_dir/`` and write the
  7 | enhanced audio to ``se_wav_dir/`` as WAV files:
  8 | 
  9 |     python main_denoising.py --wav_dir wav_dir --output_dir se_wav_dir
 10 | 
 11 | For each file with the ``.wav`` extension under ``wav_dir/``, there will now be
 12 | a corresponding enhanced version under ``se_wav_dir``.
 13 | 
 14 | Alternately, you may specify the files to process via a script file of paths to
 15 | WAV files with one path per line:
 16 | 
 17 |     /path/to/file1.wav
 18 |     /path/to/file2.wav
 19 |     /path/to/file3.wav
 20 |     ...
 21 | 
 22 | This functionality is enabled via the ``-S`` flag, as in the following:
 23 | 
 24 |    python main_denoising.py -S some.scp --output_dir se_wav_dir/
 25 | 
 26 | As this model is computationally demanding, use of a GPU is recommended, which
 27 | may be enabled via the ``--use_gpu`` and ``--gpu_id`` flags. The ``--use_gpu`` flag
 28 | indicates whether or not to use a GPU with possible values being ``false`` and ``true``.
 29 | The ``--gpu_id`` flag specifies the device id of the GPU to use. For instance:
 30 | 
 31 |    python main_denoising.py --use_gpu true --gpu_id 0 -S some.scp --output_dir se_wav_dir/
 32 | 
 33 | will perform enhancement using the GPU with device id 0.
 34 | 
 35 | If you find that you have insufficient available GPU memory to run the model, try
 36 | adjusting the flag ``--truncate_minutes``, which controls the length of audio
 37 | chunks processed. Smaller values of ``--truncate_minutes`` will lead to a smaller
 38 | memory footprint. For instance:
 39 | 
 40 |    python main_denoising.py --truncate_minutes 10 --use_gpu true --gpu_id 0 -S some.scp --output_dir se_wav_dir/
 41 | 
 42 | will perform enhancement on the GPU using chunks that are 10 minutes in duration. This should use at
 43 | most 8 GB of GPU memory.
 44 | 
 45 | References
 46 | ----------
 47 | - Sun, Lei, et al. (2018). "Speaker diarization with enhancing speech for the First DIHARD
 48 |  Challenge." Proceedings of INTERSPEECH 2018. 2793-2797.
 49 | - Gao, Tian, et al. (2018). "Densely connected progressive learning for LSTM-based speech
 50 |   enhancement." Proceedings of ICASSP 2018.
 51 | - Sun, Lei, et al. (2017). "Multiple-target deep learning for LSTM-RNN based speech enhancement."
 52 |   Proceedings of the Fifth Joint Workshop on Hands-free Speech Communication and Microphone
 53 |   Arrays.
 54 | """
 55 | from __future__ import division
 56 | from __future__ import print_function
 57 | from __future__ import unicode_literals
 58 | import argparse
 59 | import math
 60 | import multiprocessing
 61 | import os
 62 | import shutil
 63 | import sys
 64 | import tempfile
 65 | import traceback
 66 | 
 67 | import numpy as np
 68 | import scipy.io.wavfile as wav_io
 69 | import scipy.io as sio
 70 | 
 71 | from decode_model import decode_model
 72 | import utils
 73 | 
 74 | HERE = os.path.abspath(os.path.dirname(__file__))
 75 | GLOBAL_MEAN_VAR_MATF = os.path.join(HERE, 'model', 'global_mvn_stats.mat')
 76 | 
 77 | 
 78 | SR = 16000 # Expected sample rate (Hz) of input WAV.
 79 | NUM_CHANNELS = 1 # Expected number of channels of input WAV.
 80 | BITDEPTH = 16 # Expected bitdepth of input WAV.
 81 | WL = 512 # Analysis window length in samples for feature extraction.
 82 | WL2 = WL // 2
 83 | NFREQS = 257 # Number of positive frequencies in FFT output.
 84 | 
 85 | 
 86 | class Process(multiprocessing.Process):
 87 |     """Subclass of ``Process`` that retains raised exceptions as an attribute."""
 88 |     def __init__(self, *args, **kwargs):
 89 |         multiprocessing.Process.__init__(self, *args, **kwargs)
 90 |         self._pconn, self._cconn = multiprocessing.Pipe()
 91 |         self._exception = None
 92 | 
 93 |     def run(self):
 94 |         try:
 95 |             super(Process, self).run()
 96 |             self._cconn.send(None)
 97 |         except Exception as e:
 98 |             tb = traceback.format_exc()
 99 |             self._cconn.send((e, tb))
100 | 
101 |     @property
102 |     def exception(self):
103 |         if self._pconn.poll():
104 |             self._exception = self._pconn.recv()
105 |         return self._exception
106 | 
107 | 
108 | def denoise_wav(src_wav_file, dest_wav_file, global_mean, global_var, use_gpu,
109 |                 gpu_id, truncate_minutes):
110 |     """Apply speech enhancement to audio in WAV file.
111 | 
112 |     Parameters
113 |     ----------
114 |     src_wav_file : str
115 |         Path to WAV to denosie.
116 | 
117 |     dest_wav_file : str
118 |         Output path for denoised WAV.
119 | 
120 |     global_mean : ndarray, (n_feats,)
121 |         Global mean for LPS features. Used for CMVN.
122 | 
123 |     global_var : ndarray, (n_feats,)
124 |         Global variances for LPS features. Used for CMVN.
125 | 
126 |     use_gpu : bool, optional
127 |         If True and GPU is available, perform all processing on GPU.
128 |         (Default: True)
129 | 
130 |     gpu_id : int, optional
131 |          Id of GPU on which to do computation.
132 |          (Default: 0)
133 | 
134 |     truncate_minutes: float
135 |         Maximimize size in minutes to process at a time. The enhancement will
136 |         be done on chunks of audio no greather than ``truncate_minutes``
137 |         minutes duration.
138 |     """
139 |     # Read noisy audio WAV file. As scipy.io.wavefile.read is FAR faster than
140 |     # librosa.load, we use the former.
141 |     rate, wav_data = wav_io.read(src_wav_file)
142 | 
143 |     # Apply peak-normalization.
144 |     wav_data = utils.peak_normalization(wav_data)
145 | 
146 |     # Perform denoising in chunks of size chunk_length samples.
147 |     chunk_length = int(truncate_minutes*rate*60)
148 |     total_chunks = int(
149 |         math.ceil(wav_data.size / chunk_length))
150 |     data_se = [] # Will hold enhanced audio data for each chunk.
151 |     for i in range(1, total_chunks + 1):
152 |         tmp_dir = tempfile.mkdtemp()
153 |         try:
154 |             # Get samples for this chunk.
155 |             bi = (i-1)*chunk_length # Index of first sample of this chunk.
156 |             ei = bi + chunk_length # Index of last sample of this chunk + 1.
157 |             temp = wav_data[bi:ei]
158 |             print('Processing file: %s, segment: %d/%d.' %
159 |                   (src_wav_file, i, total_chunks))
160 | 
161 |             # Skip denoising if chunk is too short.
162 |             if temp.shape[0] < WL2:
163 |                 data_se.append(temp)
164 |                 continue
165 | 
166 |             # Determine paths to the temporary files to be created.
167 |             noisy_normed_lps_fn = os.path.join(
168 |                 tmp_dir, 'noisy_normed_lps.htk')
169 |             noisy_normed_lps_scp_fn = os.path.join(
170 |                 tmp_dir, 'noisy_normed_lps.scp')
171 |             irm_fn = os.path.join(
172 |                 tmp_dir, 'irm.mat')
173 | 
174 |             # Extract LPS features from waveform.
175 |             noisy_htkdata = utils.wav2logspec(temp, window=np.hamming(WL))
176 | 
177 |             # Do MVN before decoding.
178 |             normed_noisy = (noisy_htkdata - global_mean) / global_var
179 | 
180 |             # Write features to HTK binary format making sure to also
181 |             # create a script file.
182 |             utils.write_htk(
183 |                 noisy_normed_lps_fn, normed_noisy, samp_period=SR,
184 |                 parm_kind=9)
185 |             cntk_len = noisy_htkdata.shape[0] - 1
186 |             with open(noisy_normed_lps_scp_fn, 'w') as f:
187 |                 f.write('irm=%s[0,%d]\n' % (noisy_normed_lps_fn, cntk_len))
188 | 
189 |             # Apply CNTK model to determine ideal ratio mask (IRM), which will
190 |             # be output to the temp directory as irm.mat. In order to avoid a
191 |             # memory leak, must do this in a separate process which we then
192 |             # kill.
193 |             p = Process(
194 |                 target=decode_model,
195 |                 args=(noisy_normed_lps_scp_fn, tmp_dir, NFREQS, use_gpu, gpu_id))
196 |             p.start()
197 |             p.join()
198 |             if p.exception:
199 |                 e, tb = p.exception
200 |                 raise type(e)(tb)
201 | 
202 |             # Read in IRM and directly mask the original LPS features.
203 |             irm = sio.loadmat(irm_fn)['IRM']
204 |             masked_lps = noisy_htkdata + np.log(irm)
205 | 
206 |             # Reconstruct audio.
207 |             wave_recon = utils.logspec2wav(
208 |                 masked_lps, temp, window=np.hamming(WL), n_per_seg=WL,
209 |                 noverlap=WL2)
210 |             data_se.append(wave_recon)
211 |         finally:
212 |             shutil.rmtree(tmp_dir)
213 |     data_se = [x.astype(np.int16, copy=False) for x in data_se]
214 |     data_se = np.concatenate(data_se)
215 |     wav_io.write(dest_wav_file, SR, data_se)
216 | 
217 | 
218 | def main_denoising(wav_files, output_dir, verbose=False, **kwargs):
219 |     """Perform speech enhancement for WAV files in ``wav_dir``.
220 | 
221 |     Parameters
222 |     ----------
223 |     wav_files : list of str
224 |         Paths to WAV files to enhance.
225 | 
226 |     output_dir : str
227 |         Path to output directory for enhanced WAV files.
228 | 
229 |     verbose : bool, optional
230 |         If True, print full stacktrace to STDERR for files with errors.
231 | 
232 |     kwargs
233 |         Keyword arguments to pass to ``denoise_wav``.
234 |     """
235 |     if not os.path.exists(output_dir):
236 |         os.makedirs(output_dir)
237 | 
238 |     # Load global MVN statistics.
239 |     global_mean_var = sio.loadmat(GLOBAL_MEAN_VAR_MATF)
240 |     global_mean = global_mean_var['global_mean']
241 |     global_var = global_mean_var['global_var']
242 | 
243 |     # Perform speech enhancement.
244 |     for src_wav_file in wav_files:
245 |         # Perform basic checks of input WAV.
246 |         if not os.path.exists(src_wav_file):
247 |             utils.error('File "%s" does not exist. Skipping.' % src_wav_file)
248 |             continue
249 |         if not utils.is_wav(src_wav_file):
250 |             utils.error('File "%s" is not WAV. Skipping.' % src_wav_file)
251 |             continue
252 |         if utils.get_sr(src_wav_file) != SR:
253 |             utils.error('Sample rate of file "%s" is not %d Hz. Skipping.' %
254 |                         (src_wav_file, SR))
255 |             continue
256 |         if utils.get_num_channels(src_wav_file) != NUM_CHANNELS:
257 |             utils.error('File "%s" is not monochannel. Skipping.' % src_wav_file)
258 |             continue
259 |         if utils.get_bitdepth(src_wav_file) != BITDEPTH:
260 |             utils.error('Bitdepth of file "%s" is not %d. Skipping.' %
261 |                         (src_wav_file, BITDEPTH))
262 |             continue
263 | 
264 |         # Denoise.
265 |         try:
266 |             bn = os.path.basename(src_wav_file)
267 |             dest_wav_file = os.path.join(output_dir, bn)
268 |             denoise_wav(src_wav_file, dest_wav_file, global_mean, global_var, **kwargs)
269 |             print('Finished processing file "%s".' % src_wav_file)
270 |         except Exception as e:
271 |             msg = 'Problem encountered while processing file "%s". Skipping.' % src_wav_file
272 |             if verbose:
273 |                 msg = '%s Full error output:\n%s' % (msg, e)
274 |             utils.error(msg)
275 |             continue
276 | 
277 | 
278 | # TODO: Logging is getting complicated. Consider adding a custom logger...
279 | def main():
280 |     """Main."""
281 |     parser = argparse.ArgumentParser(
282 |         description='Denoise WAV files.', add_help=True)
283 |     parser.add_argument(
284 |         '--wav_dir', nargs=None, type=str, metavar='STR',
285 |         help='directory containing WAV files to denoise '
286 |              '(default: %(default)s')
287 |     parser.add_argument(
288 |         '--output_dir', nargs=None, type=str, metavar='STR',
289 |         help='output directory for denoised WAV files (default: %(default)s)')
290 |     parser.add_argument(
291 |         '-S', dest='scpf', nargs=None, type=str, metavar='STR',
292 |         help='script file of paths to WAV files to denosie (detault: %(default)s)')
293 |     parser.add_argument(
294 |         '--use_gpu', nargs=None, default='true', type=str, metavar='STR',
295 |         choices=['true', 'false'],
296 |         help='whether or not to use GPU (default: %(default)s)')
297 |     parser.add_argument(
298 |         '--gpu_id', nargs=None, default=0, type=int, metavar='INT',
299 |         help='device id of GPU to use (default: %(default)s)')
300 |     parser.add_argument(
301 |         '--truncate_minutes', nargs=None, default=10, type=float,
302 |         metavar='FLOAT',
303 |         help='maximum chunk size in minutes (default: %(default)s)')
304 |     parser.add_argument(
305 |         '--verbose', default=False, action='store_true',
306 |         help='print full stacktrace for files with errors')
307 |     if len(sys.argv) == 1:
308 |         parser.print_help()
309 |         sys.exit(1)
310 |     args = parser.parse_args()
311 |     if not utils.xor(args.wav_dir, args.scpf):
312 |         parser.error('Exactly one of --wav_dir and -S must be set.')
313 |         sys.exit(1)
314 |     use_gpu = args.use_gpu == 'true'
315 | 
316 |     # Determine files to denoise.
317 |     if args.scpf is not None:
318 |         wav_files = utils.load_script_file(args.scpf, '.wav')
319 |     else:
320 |         wav_files = utils.listdir(args.wav_dir, ext='.wav')
321 | 
322 |     # Determine output directory for denoised audio.
323 |     if args.output_dir is None and args.wav_dir is not None:
324 |         utils.warn('Output directory not specified. Defaulting to "%s"' %
325 |                    args.wav_dir)
326 |         args.output_dir = args.wav_dir
327 | 
328 |     # Perform denoising.
329 |     main_denoising(
330 |         wav_files, args.output_dir, args.verbose, use_gpu=use_gpu, gpu_id=args.gpu_id,
331 |         truncate_minutes=args.truncate_minutes)
332 | 
333 | 
334 | if __name__ == '__main__':
335 |     main()
336 | 


--------------------------------------------------------------------------------
/main_get_vad.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """Perform voice activity detection (VAD) using WebRTC's implementation.
  3 | 
  4 | To perform VAD for all WAV files under the directory ``wav_dir/`` and write
  5 | the output to the directory ``vad_dir/`` as HTK label files:
  6 | 
  7 |     python main_get_vad.py --wav_dir wav_dir/ --output_dir vad_dir/
  8 | 
  9 | For each file with the ``.wav`` extension under ``wav_dir/``, there will now be
 10 | a corresponding label file with the extension ``.sad`` under ``vad_dir/``. Each
 11 | label file will contain one speech segment per line, each consisting of three
 12 | space-delimited fields:
 13 | 
 14 | - onset  --  the onset of the segment in seconds
 15 | - offset --  the offset of the segment in seconds
 16 | - label  --  the label for the segment; controlled by the ``--speech_label`` flag
 17 | 
 18 | If ``--output_dir`` is not specified, these files will be output to ``wav_dir/``.
 19 | 
 20 | Alternately, you may specify the files to process via a script file of paths to
 21 | WAV files with one path per line:
 22 | 
 23 |     /path/to/file1.wav
 24 |     /path/to/file2.wav
 25 |     /path/to/file3.wav
 26 |     ...
 27 | 
 28 | This functionality is enabled via the ``-S`` flag, as in the following:
 29 | 
 30 |    python main_get_vad.py -S some.scp --output_dir vad_dir/
 31 | 
 32 | which will perform VAD for those file listed in ``some.scp`` and output label files
 33 | to ``vad_dir. Note that if you use a script file, you *MUST* specify an output
 34 | directory.
 35 | 
 36 | WebRTC exposes several parameters for tuning it's output, which may be adjusted via
 37 | the following flags:
 38 | 
 39 | - ``--fs_vad``  --  controls the sample rate the audio is resampled to prior to
 40 |   performing VAD; possible values are 8 kHz, 16 kHz, 32 kHz, and 48 kHz
 41 | - ``--hoplength``  --  the duration in milliseconds of the frames for VAD; possible
 42 |   values are 10 ms, 20 ms, and 30 ms
 43 | - ``--mode``  --   the WebRTC aggressiveness mode, which controls how aggressive
 44 |   WebRTC is about filter out non-speech; 0 is least aggressive and 3 most aggressive
 45 | 
 46 | Optionally, label smoothing may be applied to the output of WebRTC to eliminate short,
 47 | irregular silences and speech segments. Label smoothing is done using a median filter
 48 | applied to the frame-level labeling produced by WebRTC and is controlled by the 
 49 | ``--med_filt_width`` parameter.
 50 | 
 51 | When processing large batches of audio, it may be desireable to parallelize the
 52 | computation, which may be done by specifying the number of parallel processes to
 53 | employ via the ``--n_jobs`` flag:
 54 | 
 55 |    python main_get_vad.py --n_jobs 40 -S some.scp --output_dir vad_dir/
 56 | 
 57 | References
 58 | ----------
 59 | - https://github.com/wiseman/py-webrtcvad.git
 60 | - https://webrtc.org/
 61 | """
 62 | from __future__ import print_function
 63 | from __future__ import unicode_literals
 64 | import argparse
 65 | import numbers
 66 | import os
 67 | import sys
 68 | import traceback
 69 | 
 70 | from joblib import delayed, Parallel
 71 | import librosa
 72 | 
 73 | import utils
 74 | from utils import VALID_VAD_SRS, VALID_VAD_FRAME_LENGTHS, VALID_VAD_MODES
 75 | 
 76 | 
 77 | def perform_vad(wav_file, segs_file, speech_label, **kwargs):
 78 |     """Perform VAD for WAV file.
 79 | 
 80 |     If an exception is raised during processing, it returns the exception as well as
 81 |     the full traceback. Otherwise, returns ``None``.
 82 | 
 83 |     Parameters
 84 |     ----------
 85 |     wav_file : str
 86 |         Path to WAV file to perform VAD for.
 87 | 
 88 |     segs_file : str
 89 |         Path to output segments file.
 90 | 
 91 |     speech_label : str
 92 |         Label for speech segments.
 93 | 
 94 |     kwargs
 95 |         Keyword arguments to pass to ``utils.vad``.
 96 |     """
 97 |     try:
 98 |         data, fs = librosa.load(wav_file, sr=None)
 99 |         vad_info = utils.vad(data, fs, **kwargs)
100 |         segments = utils.get_segments(vad_info, fs)
101 |         utils.write_segments(segs_file, segments, label=speech_label)
102 |         return None
103 |     except Exception as e:
104 |         tb = traceback.format_exc()
105 |         return e, tb
106 | 
107 | 
108 | def main():
109 |     """Main."""
110 |     # Parse command line arguments.
111 |     parser = argparse.ArgumentParser(
112 |         description='Perform VAD using webrtcvad.', add_help=True)
113 |     parser.add_argument(
114 |         '--wav_dir', nargs=None, type=str, metavar='STR',
115 |         help='directory containing WAV files to perform VAD for '
116 |              '(default: %(default)s)')
117 |     parser.add_argument(
118 |         '-S', dest='scpf', nargs=None, type=str, metavar='STR',
119 |         help='script file of paths to WAV files to perform VAD for (default: %(default)s)')
120 |     parser.add_argument(
121 |         '--output_dir', nargs=None, type=str, metavar='STR',
122 |         help='output directory for label files (default: None)')
123 |     parser.add_argument(
124 |         '--output_ext', nargs=None, default='.sad', type=str, metavar='STR',
125 |         help='extension for output label files (default: %(default)s)')
126 |     parser.add_argument(
127 |         '--speech_label', nargs=None, default='', type=str, metavar='STR',
128 |         help='label for speech segments (default: %(default)s)')
129 |     parser.add_argument(
130 |         '--fs_vad', nargs=None, default=16000, type=int, metavar='INT',
131 |         help='target sample rate in Hz for VAD (default: %(default)s)')
132 |     parser.add_argument(
133 |         '--hoplength', nargs=None, default=30, type=int, metavar='INT',
134 |         help='duration between frames in ms (default: %(default)s)')
135 |     parser.add_argument(
136 |         '--mode', nargs=None, default=3, type=int, metavar='INT',
137 |         help='WebRTC VAD aggressiveness (default: %(default)s)')
138 |     parser.add_argument(
139 |         '--med_filt_width', nargs=None, default=1, type=int, metavar='INT',
140 |         help='window size in frames for median smoothing of VAD output; '
141 |              '<=1 disables (default: %(default)s')
142 |     parser.add_argument(
143 |         '--verbose', default=False, action='store_true',
144 |         help='print full stacktrace for files with errors')
145 |     parser.add_argument(
146 |         '--n_jobs', nargs=None, default=1, type=int, metavar='INT',
147 |         help='number of parallel jobs (default: %(default)s)')
148 |     if len(sys.argv) == 1:
149 |         parser.print_help()
150 |         sys.exit(1)
151 |     args = parser.parse_args()
152 |     if not utils.xor(args.wav_dir, args.scpf):
153 |         parser.error('Exactly one of --wav_dir and -S must be set.')
154 |         sys.exit(1)
155 |     if not (args.wav_dir or args.output_dir):
156 |         parser.error(
157 |             'At least one of --wav_dir or --output_dir must be set.')
158 |         sys.exit(1)
159 |     if args.fs_vad not in VALID_VAD_SRS:
160 |         parser.error(
161 |             '--fs_vad must be one of %s' % VALID_VAD_SRS)
162 |         sys.exit(1)
163 |     if args.hoplength not in VALID_VAD_FRAME_LENGTHS:
164 |         parser.error(
165 |             '--hop_length must be one of %s' % VALID_VAD_FRAME_LENGTHS)
166 |         sys.exit(1)
167 |     if args.mode not in VALID_VAD_MODES:
168 |         parser.error('--mode must be one of %s' % VALID_VAD_MODES)
169 |         sys.exit(1)
170 |     if (not isinstance(args.med_filt_width, numbers.Integral) or
171 |         args.med_filt_width % 2 == 0):
172 |         parser.error('--med_filt_width must be an odd integer')
173 |         sys.exit(1)
174 |     args.frame_length = args.hoplength # Retain hoplength argument for compatibility.
175 | 
176 |     # Determine files to perform VAD on.
177 |     if args.scpf is not None:
178 |         wav_files = utils.load_script_file(args.scpf, '.wav')
179 |     else:
180 |         wav_files = utils.listdir(args.wav_dir, ext='.wav')
181 | 
182 |     # Determine output directory for VAD.
183 |     if args.output_dir is None and args.wav_dir is not None:
184 |         utils.warn('Output directory not specified. Defaulting to "%s"' %
185 |                    args.wav_dir)
186 |         args.output_dir = args.wav_dir
187 |     if not os.path.exists(args.output_dir):
188 |         os.makedirs(args.output_dir)
189 |         
190 |     # Perform VAD.
191 |     def kwargs_gen():
192 |         for wav_file in wav_files:
193 |             bn = os.path.basename(wav_file)
194 |             segs_file = os.path.join(
195 |                 args.output_dir, bn.replace('.wav', args.output_ext))
196 |             yield dict(
197 |                 wav_file=wav_file, segs_file=segs_file,
198 |                 speech_label=args.speech_label, fs_vad=args.fs_vad,
199 |                 frame_length=args.frame_length, vad_mode=args.mode,
200 |                 med_filt_width=args.med_filt_width)
201 |     f = delayed(perform_vad)
202 |     res = Parallel(n_jobs=args.n_jobs)(f(**kwargs) for kwargs in kwargs_gen())
203 |     for res_, wav_file in zip(res, wav_files):
204 |         if res_ is None:
205 |             continue
206 |         e, tb = res_
207 |         msg = 'Problem encountered while processing file "%s". Skipping.' % wav_file
208 |         if args.verbose:
209 |             msg = '%s Full error output:\n%s' % (msg, tb)
210 |         utils.error(msg)
211 | 
212 | 
213 | if __name__ == '__main__':
214 |     main()
215 | 


--------------------------------------------------------------------------------
/model/global_mvn_stats.mat:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/staplesinLA/denoising_DIHARD18/7d6b17c8646183880e94a8d950b653a61a2e8d05/model/global_mvn_stats.mat


--------------------------------------------------------------------------------
/model/speech_enhancement.model0:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/staplesinLA/denoising_DIHARD18/7d6b17c8646183880e94a8d950b653a61a2e8d05/model/speech_enhancement.model0


--------------------------------------------------------------------------------
/model/speech_enhancement.model1:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/staplesinLA/denoising_DIHARD18/7d6b17c8646183880e94a8d950b653a61a2e8d05/model/speech_enhancement.model1


--------------------------------------------------------------------------------
/run_eval.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # This script demonstrates how to run speech enhancement and VAD. For full documentation,
 3 | # please consult the docstrings of ``main_denoising.py`` and ``main_get_vad.py``.
 4 | 
 5 | 
 6 | ###################################
 7 | # Run speech enhancement
 8 | ###################################
 9 | WAV_DIR=/data/wav/  # Directory of WAV files (16 kHz, 16 bit) to enhance.
10 | SE_WAV_DIR=/data/wav_pn_enhanced  # Output directory for enhanced WAV.
11 | USE_GPU=true  # Use GPU instead of CPU. To instead use CPU, set to 'false'.
12 | GPU_DEVICE_ID=0  # Use GPU with device id 0. Irrelevant if using CPU.
13 | TRUNCATE_MINUTES=10  # Duration in minutes of chunks for enhancement. If you experience
14 |                      # OOM errors with your GPU, try reducing this.
15 | python main_denoising.py \
16 |        --verbose \
17 |        --wav_dir $WAV_DIR --output_dir $SE_WAV_DIR \
18 |        --use_gpu $USE_GPU --gpu_id $GPU_DEVICE_ID \
19 |        --truncate_minutes $TRUNCATE_MINUTES || exit 1
20 | 
21 | 
22 | ###################################
23 | # Perform VAD using enhanced audio
24 | ###################################
25 | VAD_DIR=/data/vad  # Output directory for label files containing VAD output.
26 | HOPLENGTH=30  # Duration in milliseconds of frames for VAD. Also controls step size.
27 | MODE=3  # WebRTC aggressiveness. 0=least agressive and  3=most aggresive.
28 | NJOBS=1  # Number of parallel processes to use.
29 | python main_get_vad.py \
30 |        --verbose \
31 |        --wav_dir $SE_WAV_DIR --output_dir $VAD_DIR \
32 |        --mode $MODE --hoplength $HOPLENGTH \
33 |        --n_jobs $NJOBS || exit 1
34 | 
35 | exit 0
36 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | """Various utility functions."""
  2 | from __future__ import print_function
  3 | from __future__ import unicode_literals
  4 | import numbers
  5 | import os
  6 | import sndhdr
  7 | import struct
  8 | import sys
  9 | 
 10 | import librosa.core
 11 | import librosa.util
 12 | import numpy as np
 13 | import scipy.signal
 14 | import webrtcvad
 15 | 
 16 | EPS = 1e-8
 17 | 
 18 | 
 19 | def warn(msg):
 20 |     """Print warning message to STERR."""
 21 |     msg = 'WARN: %s' % msg
 22 |     print(msg, file=sys.stderr)
 23 | 
 24 | 
 25 | def error(msg):
 26 |     """Print warning message to STERR."""
 27 |     msg = 'ERROR: %s' % msg
 28 |     print(msg, file=sys.stderr)
 29 | 
 30 | 
 31 | # TODO: Find out why this duplicates functionality of librosa.core.stft.
 32 | def stft(x, window, n_per_seg=512, noverlap=256):
 33 |     """Return short-time Fourier transform (STFT) for signal.
 34 | 
 35 |     Parameters
 36 |     ----------
 37 |     x : ndarray, (n_samps,)
 38 |         Input signal.
 39 | 
 40 |     window : ndarray, (wl,)
 41 |         Array of weights to use when windowing the signal.
 42 | 
 43 |     n_per_seg : int, optional
 44 |     """
 45 |     if len(window) != n_per_seg:
 46 |         raise ValueError('window length must equal n_per_seg')
 47 |     x = np.array(x)
 48 |     nadd = noverlap - (len(x) - n_per_seg) % noverlap
 49 |     x = np.concatenate((x, np.zeros(nadd)))
 50 |     step = n_per_seg - noverlap
 51 |     shape = x.shape[:-1] + ((x.shape[-1] - noverlap) // step, n_per_seg)
 52 |     strides = x.strides[:-1] + (step * x.strides[-1], x.strides[-1])
 53 |     x = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)
 54 |     x = x * window
 55 |     result = np.fft.rfft(x, n=n_per_seg)
 56 |     return result
 57 | 
 58 | 
 59 | # TODO: Find out why this duplicates functionality of librosa.core.istft.
 60 | def istft(x, window, n_per_seg=512, noverlap=256):
 61 |     """TODO"""
 62 |     x = np.fft.irfft(x)
 63 |     y = np.zeros((len(x) - 1) * noverlap + n_per_seg)
 64 |     C1 = window[0:256]
 65 |     C2 = window[0:256] + window[256:512]
 66 |     C3 = window[256:512]
 67 |     y[0:noverlap] = x[0][0:noverlap] / C1
 68 |     for i in range(1, len(x)):
 69 |         y[i * noverlap:(i + 1) * noverlap] = (x[i - 1][noverlap:n_per_seg] + x[i][0:noverlap]) / C2
 70 |     y[-noverlap:] = x[len(x) - 1][noverlap:] / C3
 71 |     return y
 72 | 
 73 | 
 74 | def wav2logspec(x, window, n_per_seg=512, noverlap=256):
 75 |     """TODO"""
 76 |     y = stft(x, window, n_per_seg=n_per_seg, noverlap=noverlap)
 77 |     return np.log(np.square(abs(y)) + EPS)
 78 | 
 79 | 
 80 | def logspec2wav(lps, wave, window, n_per_seg=512, noverlap=256):
 81 |     "Convert log-power spectrum back to time domain."""
 82 |     z = stft(wave, window)
 83 |     angle = z / (np.abs(z) + EPS) # Recover phase information
 84 |     x = np.sqrt(np.exp(lps)) * angle
 85 |     x = np.fft.irfft(x)
 86 |     y = np.zeros((len(x) - 1) * noverlap + n_per_seg)
 87 |     C1 = window[0:256]
 88 |     C2 = window[0:256] + window[256:512]
 89 |     C3 = window[256:512]
 90 |     y[0:noverlap] = x[0][0:noverlap] / C1
 91 |     for i in range(1, len(x)):
 92 |         y[i*noverlap:(i + 1)*noverlap] = (x[i-1][noverlap:n_per_seg] + x[i][0:noverlap]) / C2
 93 |     y[-noverlap:] = x[len(x)-1][noverlap:] / C3
 94 |     return np.int16(y[0:len(wave)])
 95 | 
 96 | 
 97 | MAX_PCM_VAL = 32767
 98 | def peak_normalization(x):
 99 |     """Perform peak normalization."""
100 |     norm = x.astype(float)
101 |     norm = norm / max(abs(norm)) * MAX_PCM_VAL
102 |     return norm.astype(int)
103 | 
104 | 
105 | def read_htk(filename):
106 |     """Return features from HTK file a 2-D numpy array."""
107 |     with open(filename, 'rb') as f:
108 |         # Read header
109 |         n_samples, samp_period, samp_size, parm_kind = struct.unpack(
110 |             '>iihh', f.read(12))
111 | 
112 |         # Read data
113 |         data = struct.unpack(
114 |             '>%df' % (n_samples * samp_size / 4), f.read(n_samples * samp_size))
115 | 
116 |         return n_samples, samp_period, samp_size, parm_kind, data
117 | 
118 | 
119 | def write_htk(filename, feature, samp_period, parm_kind):
120 |     """Write array of frame-level features to HTK binary file."""
121 |     with open(filename, 'wb') as f:
122 |         # Write header
123 |         n_samples = feature.shape[0]
124 |         samp_size = feature.shape[1] * 4
125 |         f.write(struct.pack('>iihh', n_samples, samp_period, samp_size, parm_kind))
126 |         f.write(struct.pack('>%df' % (n_samples * samp_size / 4), *feature.ravel()))
127 | 
128 | 
129 | VALID_VAD_SRS = {8000, 16000, 32000, 48000}
130 | VALID_VAD_FRAME_LENGTHS = {10, 20, 30}
131 | VALID_VAD_MODES = {0, 1, 2, 3}
132 | def vad(data, fs, fs_vad=16000, frame_length=30, vad_mode=0, med_filt_width=1):
133 |     """Perform voice activity detection using WebRTC.
134 | 
135 |     VAD is performed by splitting the input into non-overlapping frames
136 |     of size ``frame_length`` ms and then applying a classifier to each
137 |     frame. The classifier is based on the VAD deveoped by Google for
138 |     WebRTC as implemented in ``py-webrtcvad``.
139 | 
140 |     Parameters
141 |     ----------
142 |     data : ndarray, (n_samples,)
143 |         Input signal.
144 | 
145 |     fs : int
146 |         Sample rate in Hz of ``data``.
147 | 
148 |     fs_vad : int, optional
149 |         Sample rate resampled to prior to performing VAD.
150 |         (Default: 16000)
151 | 
152 |     frame_length : int, optional
153 |         Frame length in milliseconds.
154 |         (Default: 30)
155 | 
156 |     vad_mode : int, optional
157 |         VAD aggressiveness. As ``vad_mode`` increases, it becomes more aggressive
158 |         about filtering out nonspeech.
159 |         (Default: 0)
160 | 
161 |     med_filt_width : int, optional
162 |         Window size for median filter used to smooth frame level VAD labels. *MUST*
163 |         be an odd number. Large values lead to more aggressive smoothing. When
164 |         <=1, label smoothing is disabled.
165 |         (Default: 1)
166 | 
167 |     Returns
168 |     -------
169 |     vact : ndarray, (n_samples,)
170 |         ``vact[i]`` is 1 if voicing detected at sample ``i`` and 0 otherwise.
171 | 
172 |     References
173 |     ----------
174 |     - https://github.com/wiseman/py-webrtcvad.git
175 |     - https://webrtc.org/
176 |     """
177 |     # Check arguments.
178 |     if fs_vad not in VALID_VAD_SRS:
179 |         raise ValueError('fs_vad must be one of %s' % VALID_VAD_SRS)
180 |     if frame_length not in VALID_VAD_FRAME_LENGTHS:
181 |         raise ValueError(
182 |             'frame_length must be one of %s' % VALID_VAD_FRAME_LENGTHS)
183 |     if vad_mode not in VALID_VAD_MODES:
184 |         raise ValueError('vad_mode must be one of %s' % VALID_VAD_MODES)
185 |     if data.dtype.kind == 'i':
186 |         if data.max() > 2**15 - 1 or data.min() < -2**15:
187 |             raise ValueError(
188 |                 'when data type is int, data must be in range [-32768, 32767].')
189 |     elif data.dtype.kind == 'f':
190 |         if np.abs(data).max() >= 1:
191 |             data = data / np.abs(data).max() * 0.9
192 |             warn('input data was rescaled.')
193 |         data = (data * 2**15).astype('f')
194 |     else:
195 |         raise ValueError('data dtype must be int or float.')
196 |     data = data.squeeze()
197 |     if not data.ndim == 1:
198 |         raise ValueError('data must be mono (1 ch).')
199 |     if not isinstance(med_filt_width, numbers.Integral):
200 |         raise TypeError('med_filt_width must be an odd integer')
201 |     if med_filt_width % 2 == 0:
202 |         raise ValueError('med_filt_width must be an odd integer')
203 | 
204 | 
205 |     # Resample.
206 |     if fs != fs_vad:
207 |         data = data.astype('f', copy=False)
208 |         resampled = librosa.core.resample(data, fs, fs_vad)
209 |     else:
210 |         resampled = data
211 |     resampled = resampled.astype('int16')
212 | 
213 |     # Convert from milliseconds to samples.
214 |     def ms_to_samples(t, sr):
215 |         return t*sr // 1000
216 |     frame_length_resamp = ms_to_samples(frame_length, fs_vad)
217 |     frame_length = ms_to_samples(frame_length, fs)
218 | 
219 |     # Enframe downsampled signal.
220 |     hop_length_resamp = frame_length_resamp
221 |     n_frames = resampled.size // hop_length_resamp + 1
222 |     n_pad = n_frames * hop_length_resamp - resampled.size
223 |     padded = np.pad(resampled, (0, n_pad), 'constant', constant_values=0)
224 |     framed = librosa.util.frame(
225 |         padded, frame_length=frame_length_resamp, hop_length=hop_length_resamp)
226 |     framed = framed.T # Convert to (n_frames, frame_length_resamp).
227 | 
228 |     # Classify frames as speech/nonspeech.
229 |     vad = webrtcvad.Vad()
230 |     vad.set_mode(vad_mode)
231 |     valist = [vad.is_speech(frame.tobytes(), fs_vad) for frame in framed]
232 | 
233 |     # Smooth labels.
234 |     if med_filt_width > 1:
235 |         valist = scipy.signal.medfilt(valist, med_filt_width)
236 |         valist = valist.astype(np.bool)
237 | 
238 |     # Convert to sample-level labels.
239 |     va_framed = np.zeros((n_frames, frame_length), dtype='uint8')
240 |     va_framed[valist] = 1
241 |     vact = va_framed.reshape(-1)[:data.size]
242 | 
243 |     return vact
244 | 
245 | 
246 | def get_segments(vad_info, fs):
247 |     """Convert array of VAD labels into segmentation."""
248 |     vad_index = np.where(vad_info == 1.0) # Find the speech index.
249 |     vad_diff = np.diff(vad_index)
250 | 
251 |     vad_temp = np.zeros_like(vad_diff)
252 |     vad_temp[np.where(vad_diff == 1)] = 1
253 |     vad_temp = np.column_stack((np.array([0]), vad_temp, np.array([0])))
254 |     final_index = np.diff(vad_temp)
255 | 
256 |     starts = np.where(final_index == 1)
257 |     ends = np.where(final_index == -1)
258 | 
259 |     sad_info = np.column_stack([starts[1], ends[1]])
260 |     vad_index = vad_index[0]
261 | 
262 |     segments = np.zeros_like(sad_info, dtype=np.float)
263 |     for i in range(sad_info.shape[0]):
264 |         segments[i][0] = float(vad_index[sad_info[i][0]]) / fs
265 |         segments[i][1] = float(vad_index[sad_info[i][1]] + 1) / fs
266 | 
267 |     return segments  # Present in seconds.
268 | 
269 | 
270 | def write_segments(fn, segs, n_digits=3, label=''):
271 |     """Write segmentation to file."""
272 |     fmt_str = '%%.%df %%.%df %%s\n' % (n_digits, n_digits)
273 |     with open(fn, 'wb') as f:
274 |         for onset, offset in segs:
275 |             line = fmt_str % (onset, offset, label)
276 |             f.write(line.encode('utf-8'))
277 | 
278 | 
279 | def listdir(dirpath, abspath=True, ext=None):
280 |     """List contents of directory."""
281 |     fns = os.listdir(dirpath)
282 |     if ext is not None:
283 |         fns = [fn for fn in fns if fn.endswith(ext)]
284 |     if abspath:
285 |         fns = [os.path.abspath(os.path.join(dirpath, fn))
286 |                for fn in fns]
287 |     fns = sorted(fns)
288 |     return fns
289 | 
290 | 
291 | def load_script_file(fn, ext=None):
292 |     """Load HTK script file of paths."""
293 |     with open(fn, 'rb') as f:
294 |         paths = [line.decode('utf-8').strip() for line in f]
295 |     paths = sorted(paths)
296 |     if ext is not None:
297 |         filt_paths = []
298 |         for path in paths:
299 |             if not path.endswith(ext):
300 |                 warn('Skipping file "%s" that does not match extension "%s"' %
301 |                      (path, ext))
302 |                 continue
303 |             filt_paths.append(path)
304 |         paths = filt_paths
305 |     return paths
306 | 
307 | 
308 | def xor(x, y):
309 |     """Return truth value of ``x`` XOR ``y``."""
310 |     return bool(x) != bool(y)
311 | 
312 | 
313 | def is_wav(fn):
314 |     """Returns True if ``fn`` is a WAV file."""
315 |     hinfo = sndhdr.what(fn)
316 |     if hinfo is None:
317 |         return False
318 |     elif hinfo[0] != 'wav':
319 |         return False
320 |     return True
321 | 
322 | 
323 | def get_sr(fn):
324 |     """Return sample rate in Hz of WAV file."""
325 |     if not is_wav(fn):
326 |         raise ValueError('File "%s" is not a valid WAV file.' % fn)
327 |     hinfo = sndhdr.what(fn)
328 |     return hinfo[1]
329 | 
330 | 
331 | def get_num_channels(fn):
332 |     """Return number of channels present in  WAV file."""
333 |     if not is_wav(fn):
334 |         raise ValueError('File "%s" is not a valid WAV file.' % fn)
335 |     hinfo = sndhdr.what(fn)
336 |     return hinfo[2]
337 | 
338 | 
339 | def get_bitdepth(fn):
340 |     """Return bitdepth of WAV file."""
341 |     if not is_wav(fn):
342 |         raise ValueError('File "%s" is not a valid WAV file.' % fn)
343 |     hinfo = sndhdr.what(fn)
344 |     return hinfo[4]
345 | 


--------------------------------------------------------------------------------