├── .gitignore ├── Dockerfile ├── LICENSE ├── README.md ├── convert.sh ├── model ├── saved_model.pb └── variables │ ├── variables.data-00000-of-00001 │ └── variables.index ├── requirements.txt ├── simple_ehm-runnable.py ├── simple_ehm.ipynb ├── training_data └── .gitignore └── trainingset └── .gitignore /.gitignore: -------------------------------------------------------------------------------- 1 | trainingset/* 2 | training_data/* 3 | tmp/* 4 | .ipynb_checkpoints/* 5 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.9-slim 2 | RUN apt update 3 | RUN apt install libgl1-mesa-glx -y 4 | RUN apt install libglib2.0-0 -y 5 | RUN apt install ffmpeg -y 6 | RUN mkdir /simple_ehm 7 | WORKDIR /simple_ehm 8 | ADD . . 9 | RUN pip install -r requirements.txt 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 morrolinux 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # simple-ehm 2 | A simple tool for a simple task: remove filler sounds ("ehm") from pre-recorded speeches. AI powered. 3 | Istruzioni in italiano in fondo al documento. 4 | 5 | # Usage 6 | Basic invokation should be enough: 7 | `./simple_emh-runnable.py /path/to/video/file` 8 | This will generate a subtitle track (`.srt`) for debugging and the output video in the same folder as the original file. 9 | 10 | For more info read the help: 11 | `./simple_emh-runnable.py --help` 12 | 13 | You can also run simple-ehm in a dockerized environment. Build the image using `docker build -t simple .` then, instead 14 | of using `./simple_ehm-runnable.py` use `./convert.sh` 15 | 16 | # Contributing to the model 17 | There are two ways you can contribute to the model: 18 | 19 | ## Contribute to the dataset 20 | By sending me at least 30 1-second long WAV pcm_s16le mono 16kHz clips for each class (silence, speech, ehm) [easy] 21 | - You can convert your clips to the right format with ffmpeg: `ffmpeg -i input-file -c:a pcm_s16le -ac 1 -ar 16000 -filter:a "dynaudnorm" output.wav` 22 | - You can extract ehm(s) and silences **along with erroneously classified sounds** (false positives) by passing `--generate-training-data` as an invocation parameter. You can then use the latter to improve your training set! 23 | 24 | ## Contribute to the training 25 | - By implementing transfer training logic on this model's python notebook 26 | - By retraining the current model with your dataset and make a PR with the updated one 27 | 28 | 29 | # ITA 30 | 31 | # simple-ehm 32 | Un semplice strumento per un semplice compito: rimuovere gli "ehm" (suoni di riempimento) da discorsi pre-registrati. 33 | 34 | # Utilizzo 35 | L'invocazione base dovrebbe essere sufficiente: 36 | `./simple_emh-runnable.py /percorso/al/file/video` 37 | Questo genererò una traccia di sottotitoli (`.srt`) per fini diagnostici e il video tagliato nella stessa cartella del file originale. 38 | 39 | Per maggiori informazioni sui parametri accettati, leggi la guida: 40 | `./simple_emh-runnable.py --help` 41 | 42 | Puoi anche utilizzare simple-ehm in un ambiente dockerizzato, per fare ciò crea l'immagine `docker build -t simple .`, e 43 | dove useresti `./simple_ehm-runnable.py` utilizza invece `./convert.sh` (N.B. per usare `./convert.sh` i file devono 44 | essere spostati prima in questa cartella) 45 | 46 | # Contribuire al modello 47 | Ci sono due modi in cui puoi contribuire al modello: 48 | 49 | ## Contribuisci al dataset 50 | Inviandomi almeno 30 clip in formato WAV (pcm_s16le) mono con campionamento a 16kHz per ciascuna classe (silenzio, parlato, ehm) [facile] 51 | - Puoi convertire le tue clip nel formato corretto con ffmpeg: `ffmpeg -i input-file -c:a pcm_s16le -ac 1 -ar 16000 -filter:a "dynaudnorm" output.wav` 52 | - Puoi estrarre gli ehm(s) e i silenzi **anche quelli classificati erroneamente** (falsi positivi) passando `--generate-training-data` come parametro di invocazione. Puoi usare le clip classificate erroneamente per migliorare il tuo training set! 53 | 54 | ## Contribuisci al training 55 | - Implementando la logica di transfer training sul notebook python di questo modello, e 56 | - Eseguendo il retraining della rete esistente con il tuo dataset ed inviandomi il modello aggiornato. 57 | -------------------------------------------------------------------------------- /convert.sh: -------------------------------------------------------------------------------- 1 | docker run --rm -v $PWD:/simple_ehm simple python ./simple_ehm-runnable.py $1 2 | -------------------------------------------------------------------------------- /model/saved_model.pb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/saved_model.pb -------------------------------------------------------------------------------- /model/variables/variables.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/variables/variables.data-00000-of-00001 -------------------------------------------------------------------------------- /model/variables/variables.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/variables/variables.index -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow==2.13.0 2 | matplotlib 3 | seaborn 4 | tqdm 5 | psutil 6 | opencv-python 7 | protobuf==3.20.0 8 | -------------------------------------------------------------------------------- /simple_ehm-runnable.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # you can prepare the training dataset like so: 3 | # ffmpeg -i full.wav -f segment -segment_time 1 -c copy out%03d.wav 4 | # and then manually dividing the speech from ehm(s) into separate folders 5 | import os 6 | import io 7 | import pathlib 8 | import argparse 9 | import signal 10 | 11 | import matplotlib.pyplot as plt 12 | import numpy as np 13 | import seaborn as sns 14 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2" 15 | import tensorflow as tf 16 | 17 | from tensorflow.keras.layers.experimental import preprocessing 18 | from tensorflow.keras.layers import Reshape 19 | from tensorflow.keras import models 20 | 21 | from tensorflow.keras import layers 22 | from tensorflow.keras import models 23 | 24 | from tqdm import * 25 | 26 | from functools import wraps 27 | import datetime 28 | import time 29 | import psutil 30 | import subprocess 31 | import cv2 32 | from contextlib import suppress 33 | 34 | model = models.Sequential([ # Usa models invece di model 35 | Reshape((124, 129, 1), input_shape=(124, 129, 1)), 36 | ]) 37 | 38 | 39 | parser = argparse.ArgumentParser() 40 | parser.add_argument("filename", help="video file name (or full file path) to classify") 41 | parser.add_argument("--fastcut", default=False, action='store_true',\ 42 | help="cut and merge an mp4 video without re-encoding using an edit list."\ 43 | "Might not work on some players. see https://stackoverflow.com/a/18449609") 44 | parser.add_argument("--window-size-divide", type=float, default=1, help="divide window size (default: 1s) by this factor") 45 | parser.add_argument("--window-slide-divide", type=float, default=2, help="divide the window slide by this factor (default: half the window size)") 46 | parser.add_argument("--fps", type=int, default=-1, help="frames per second of the encoded video. Lower FPS mean faster encoding (default: original)") 47 | parser.add_argument("--crf", type=int, default=-1, help="CRF factor for h264 encoding.") 48 | parser.add_argument("--spectrogram", default=False, action='store_true', help="print spectrogram of window_size sliding by window_slide during analysis (debubbing only)") 49 | parser.add_argument("--generate-training-data", default=False, action='store_true', help="export extracted ehm(s) and silences as well to a separate folder. Useful for training on false positives") 50 | parser.add_argument("--srt", default=False, action='store_true', help="generate subtitle track for easier accuracy evaluation") 51 | parser.add_argument("--keep", nargs="+", default=["speech"], help="space separated tags to to be kept in the final video. Eg: ehm silence. Default: speech") 52 | parser.add_argument("--keep-junk", default=False,action="store_true", help="keeps tmp files") 53 | 54 | args = parser.parse_args() 55 | video_path = args.filename 56 | audio_len = None 57 | pbar = None 58 | tmp_folder = "tmp" 59 | td_folder = "training_data" 60 | _perf = dict() 61 | stats = None 62 | labels = ["ehm", "silence", "speech"] 63 | keep = set() 64 | trash = set() 65 | 66 | cuts = [] # edits for ffmpeg to split 67 | mergelist = [] # files for ffmpeg to merge back 68 | 69 | # ctrl+c handler 70 | def signal_handler(sig, frame): 71 | try: 72 | filename, file_extension = os.path.splitext(video_path) 73 | if(not args.keep_junk): 74 | os.remove(f'{filename}.wav') 75 | for file in os.listdir("tmp"): 76 | os.remove("tmp/"+file) 77 | 78 | except Exception as e: 79 | pass 80 | 81 | exit() 82 | 83 | signal.signal(signal.SIGINT, signal_handler) 84 | 85 | 86 | def plot_spectrogram(spectrogram, ax): 87 | # Convert to frequencies to log scale and transpose so that the time is 88 | # represented in the x-axis (columns). 89 | log_spec = np.log(spectrogram.T) 90 | height = log_spec.shape[0] 91 | X = np.arange(16000, step=height + 1) 92 | Y = range(height) 93 | ax.pcolormesh(X, Y, log_spec) 94 | 95 | def timeit(f): 96 | @wraps(f) 97 | def timed(*args, **kw): 98 | 99 | a = datetime.datetime.now() 100 | output = f(*args, **kw) 101 | b = datetime.datetime.now() 102 | _perf[f.__name__] = b - a 103 | return output 104 | return timed 105 | 106 | @timeit 107 | def convert_input(path): 108 | wav_path = path[:-4] + ".wav" 109 | print("extracting audio track...") 110 | cmd = ["ffmpeg", "-hide_banner", "-loglevel", "error", "-i", path, "-c:a", "pcm_s16le", "-ar", "16000", "-ac", "1",\ 111 | "-filter:a", "dynaudnorm", wav_path, "-y"] 112 | subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).wait() 113 | return wav_path 114 | 115 | @timeit 116 | def decode_audio(audio_binary): 117 | audio, rate = tf.audio.decode_wav(audio_binary) 118 | return tf.squeeze(audio, axis=-1), rate 119 | 120 | def get_spectrogram(waveform, seek, window_size): 121 | # Padding for files with less than window_size 122 | if tf.shape(waveform) < window_size: 123 | zero_padding = tf.zeros([window_size] - tf.shape(waveform), dtype=tf.float32) 124 | else: 125 | zero_padding = tf.zeros(0, dtype=tf.float32) 126 | 127 | # Concatenate audio with padding so that all audio clips will be of the 128 | # same length 129 | waveform = tf.cast(waveform, tf.float32) 130 | if tf.shape(waveform) > window_size: 131 | equal_length = waveform[seek:seek+window_size] 132 | else: 133 | equal_length = tf.concat([waveform, zero_padding], 0) 134 | 135 | # Calculate the spectrogram with the correct frame size 136 | spectrogram = tf.signal.stft(equal_length, frame_length=255, frame_step=128, pad_end=True) 137 | spectrogram = tf.abs(spectrogram) 138 | 139 | # Trim or pad the spectrogram to the desired shape 140 | if tf.shape(spectrogram)[0] > 124: 141 | spectrogram = spectrogram[:124, :] 142 | else: 143 | pad_size = 124 - tf.shape(spectrogram)[0] 144 | spectrogram = tf.pad(spectrogram, [(0, pad_size), (0, 0)]) 145 | 146 | # Add the channel dimension 147 | spectrogram = tf.expand_dims(spectrogram, axis=-1) 148 | 149 | return spectrogram 150 | 151 | def td_folder_init(): 152 | with suppress(FileExistsError): os.mkdir(td_folder) 153 | removelist = [ f for f in os.listdir(td_folder) if f.endswith(".wav") ] 154 | for f in removelist: 155 | os.remove(os.path.join(td_folder, f)) 156 | 157 | def generate_tdata(ss, to, count, label): 158 | filename = td_folder + "/" + label + "-" + str(count) + ".wav" 159 | cuts.append(["ffmpeg", "-hide_banner", "-loglevel", "error", "-y", "-ss", ss, "-i", video_path, "-t", "1"]) 160 | cuts[-1].extend(["-c:a", "pcm_s16le", "-ac", "1", "-ar", "16000", "-filter:a", "dynaudnorm"]) 161 | cuts[-1].extend([filename]) 162 | 163 | def generate_cut(ss, to, count): 164 | out_name = str(count) + video_path[-4:] 165 | cuts.append(["ffmpeg", "-hide_banner", "-loglevel", "error", "-ss", ss ,"-i", video_path, "-ss", ss, "-to", to, "-copyts"]) 166 | if args.crf > 0: 167 | cuts[-1].extend(["-crf", str(args.crf)]) 168 | if args.fps > 0: 169 | cuts[-1].extend(["-filter:v", "fps=fps=" + str(args.fps)]) 170 | 171 | if args.fastcut: 172 | cuts[-1].extend(["-c:a", "copy", "-c:v", "copy", "-avoid_negative_ts", "1"]) 173 | else: 174 | cuts[-1].extend(["-c:v", "libx264", "-crf", "23"]) 175 | cuts[-1].extend([tmp_folder + "/" + out_name, "-y"]) 176 | mergelist.append("file '" + out_name + "'") 177 | 178 | @timeit 179 | def analyze_track(model, waveform, sample_rate): 180 | global cuts, mergelist, pbar, audio_len, labels, stats, keep, trash 181 | 182 | # state vars for analysis loop 183 | lastc = -1 # last seen class 184 | lasts = 0 # last visited second 185 | lastts = "00:00:00.000" # last cut was at this timestamp 186 | count = 0 # number of subtitle records 187 | lastwf = 0 # last frame of last analyzed. for 0s --> 1s at 16000Hz would be 16000 188 | stats = [[0,0] for _ in range(len(labels))] 189 | 190 | if args.srt: 191 | sub = open(video_path[:-4] + ".srt", 'w', encoding = 'utf-8') # subtitle track name 192 | else: 193 | sub = io.StringIO() # RAM file if no subtitle file needs to be generated 194 | 195 | window_size = int(sample_rate/args.window_size_divide) # 1s by default 196 | window_slide = int(window_size/args.window_slide_divide) 197 | 198 | # slide the window of size window_size by window_slide per iteration. 199 | # overlap may occour. 200 | print("analyzing track...") 201 | last_i = window_slide * int(audio_len/window_slide) 202 | pbar = tqdm(total=last_i) 203 | for i in range(0, audio_len, window_slide): 204 | pbar.update(n=window_slide) 205 | spectrogram = get_spectrogram(waveform, i, window_size) 206 | spectrogram = tf.expand_dims(spectrogram, axis=0) 207 | 208 | prediction = model(spectrogram) 209 | cls = int(tf.math.argmax(prediction[0])) 210 | conf = float(tf.nn.softmax(prediction[0])[cls]) 211 | 212 | if lastc == -1: 213 | lastc = cls 214 | continue 215 | 216 | # generate cut when we know the end of it (or the track is at its end) 217 | if cls != lastc or i == last_i: 218 | s = i / sample_rate 219 | if i == last_i: 220 | s += (audio_len - i) / sample_rate 221 | 222 | ts = "0" + str(datetime.timedelta(seconds=s))[:11] 223 | if len(ts) <= 8: 224 | ts += ".000" 225 | # if the window slide is overlapping the previous analyzed window 226 | # and prediction has changed, don't generate a new cut until we are over it 227 | # ...unless an undesired item is detected! 228 | if labels[cls] not in trash and i < lastwf and i < last_i: 229 | continue 230 | # generate subtitles 231 | record = str(count) + "\n" + lastts.replace('.',',') + " --> " + \ 232 | ts.replace('.',',') + "\n" + labels[lastc] + \ 233 | "\n[" + str(conf * 100)[:4] + "]" +"\n\n" 234 | count += 1 235 | sub.write(record) 236 | stats[lastc][0] += 1 237 | stats[lastc][1] += s - lasts 238 | lasts = s 239 | # generate cut 240 | if labels[lastc] in keep: 241 | generate_cut(lastts, ts, count) 242 | elif args.generate_training_data: 243 | generate_tdata(lastts, ts, count, labels[lastc]) 244 | lastts = ts 245 | lastc = cls 246 | # slide the right hand side of the window detection. 247 | # This allows to cut segments > than window size 248 | lastwf = i + window_size 249 | 250 | if not args.spectrogram: 251 | continue 252 | 253 | img = spectrogram.numpy().T 254 | img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) 255 | cv2.putText(img, labels[cls], 256 | (5, 120), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2) 257 | cv2.imshow("spectrogram", img) 258 | cv2.waitKey(1) & 0xFF 259 | time.sleep(0.2) 260 | 261 | sub.close() 262 | 263 | @timeit 264 | def cut_and_merge(out_filename): 265 | global pbar 266 | cores = int(psutil.cpu_count()/2) 267 | print("CUT and MERGE: running", cores, "ffmpeg simultaneous instances.") 268 | procs = [] 269 | i = 0 270 | 271 | # procs pool of size 272 | for c in range(cores): 273 | procs.append(None) 274 | 275 | pbar = tqdm(total=len(cuts)) 276 | # loop until all cuts are issued. 277 | while i < len(cuts): 278 | # find an empty spot on the pool and give it to the cut 279 | for p in range(len(procs)): 280 | # if the seat in the pool is empty or the occupying job has finished 281 | if procs[p] is None or procs[p].poll() != None: 282 | # if the occupying job has terminated with an error, abort everything. 283 | if procs[p] is not None and procs[p].poll() != 0: 284 | print("there was an error with an ffmpeg process. aborting!!!") 285 | print(procs[p].communicate()) 286 | exit(1) 287 | procs[p] = subprocess.Popen(cuts[i], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) 288 | # print(procs[p], "issued. PID:", procs[p].pid) 289 | i += 1 290 | pbar.update(n=1) 291 | break 292 | time.sleep(0.01) 293 | 294 | print("\nwaiting for all processes to finish...") 295 | for p in procs: 296 | with suppress(AttributeError): p.wait() 297 | 298 | mergelist_path = tmp_folder + "/inputs.txt" 299 | with open(mergelist_path, 'w', encoding = 'utf-8') as f: 300 | for m in mergelist: 301 | f.write(m + "\n") 302 | 303 | hour = str(datetime.datetime.now().hour) 304 | minute = str(datetime.datetime.now().minute) 305 | secs = str(datetime.datetime.now().second) 306 | 307 | out_filename += "_" + hour + "-" + minute + "-" + secs + video_path[-4:] 308 | 309 | cmd = ["ffmpeg", "-hide_banner", "-loglevel", "error", "-f", "concat", "-i", mergelist_path, "-c", "copy", out_filename, "-y"] 310 | subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).wait() 311 | 312 | 313 | if __name__ == '__main__': 314 | if (not os.path.isfile(args.filename)): 315 | raise Exception(f"Error! {args.filename} doesn't exists.") 316 | 317 | with suppress(FileExistsError): os.mkdir(tmp_folder) 318 | 319 | if args.generate_training_data: 320 | td_folder_init() 321 | 322 | keep = set(args.keep) 323 | trash = set(labels) - keep 324 | 325 | model = tf.keras.models.load_model('model') 326 | wav_path = convert_input(video_path) 327 | audio_binary = tf.io.read_file(wav_path) 328 | waveform, sample_rate = decode_audio(audio_binary) 329 | sample_rate = int(sample_rate) 330 | 331 | audio_len = len(waveform) 332 | 333 | analyze_track(model, waveform, sample_rate) 334 | cut_and_merge(video_path[:-4]) 335 | 336 | if(not args.keep_junk): 337 | os.remove(wav_path) 338 | for file in os.listdir("tmp"): 339 | os.remove("tmp/"+file) 340 | 341 | print("\nFATTO!") 342 | for k, v in zip(_perf.keys(), _perf.values()): 343 | print(k, "ha impiegato", str(v)) 344 | 345 | saved_time = 0 346 | 347 | print("") 348 | for i in range(len(labels)): 349 | if labels[i] == "speech": 350 | continue 351 | saved_time += stats[i][1] 352 | print("Rimosso ", stats[i][0], " ", labels[i], "(s)", 353 | " per un ammontare di ", str(datetime.timedelta(seconds=stats[i][1]))[:8], 354 | sep="") 355 | 356 | print("Tempo totale risparmiato:", str(datetime.timedelta(seconds=saved_time))[:8]) 357 | 358 | 359 | -------------------------------------------------------------------------------- /training_data/.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/training_data/.gitignore -------------------------------------------------------------------------------- /trainingset/.gitignore: -------------------------------------------------------------------------------- 1 | !.gitignore 2 | 3 | --------------------------------------------------------------------------------