├── .gitignore
├── Dockerfile
├── LICENSE
├── README.md
├── convert.sh
├── model
    ├── saved_model.pb
    └── variables
    │   ├── variables.data-00000-of-00001
    │   └── variables.index
├── requirements.txt
├── simple_ehm-runnable.py
├── simple_ehm.ipynb
├── training_data
    └── .gitignore
└── trainingset
    └── .gitignore


/.gitignore:
--------------------------------------------------------------------------------
1 | trainingset/*
2 | training_data/*
3 | tmp/*
4 | .ipynb_checkpoints/*
5 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.9-slim
 2 | RUN apt update
 3 | RUN apt install libgl1-mesa-glx -y
 4 | RUN apt install libglib2.0-0 -y
 5 | RUN apt install ffmpeg -y
 6 | RUN mkdir /simple_ehm
 7 | WORKDIR /simple_ehm
 8 | ADD . .
 9 | RUN pip install -r requirements.txt
10 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 morrolinux
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # simple-ehm
 2 | A simple tool for a simple task: remove filler sounds ("ehm") from pre-recorded speeches. AI powered.
 3 | Istruzioni in italiano in fondo al documento.
 4 | 
 5 | # Usage
 6 | Basic invokation should be enough:
 7 | `./simple_emh-runnable.py /path/to/video/file`
 8 | This will generate a subtitle track (`.srt`) for debugging and the output video in the same folder as the original file.
 9 | 
10 | For more info read the help:
11 | `./simple_emh-runnable.py --help`
12 | 
13 | You can also run simple-ehm in a dockerized environment. Build the image using `docker build -t simple .` then, instead
14 | of using `./simple_ehm-runnable.py` use `./convert.sh`
15 | 
16 | # Contributing to the model
17 | There are two ways you can contribute to the model:
18 | 
19 | ## Contribute to the dataset
20 | By sending me at least 30 1-second long WAV pcm_s16le mono 16kHz clips for each class (silence, speech, ehm)  [easy]
21 | - You can convert your clips to the right format with ffmpeg: `ffmpeg -i input-file -c:a pcm_s16le -ac 1 -ar 16000 -filter:a "dynaudnorm" output.wav`
22 | - You can extract ehm(s) and silences **along with erroneously classified sounds** (false positives) by passing `--generate-training-data` as an invocation parameter. You can then use the latter to improve your training set!
23 | 
24 | ## Contribute to the training
25 | - By implementing transfer training logic on this model's python notebook
26 | - By retraining the current model with your dataset and make a PR with the updated one
27 | 
28 | 
29 | # ITA
30 | 
31 | # simple-ehm
32 | Un semplice strumento per un semplice compito: rimuovere gli "ehm" (suoni di riempimento) da discorsi pre-registrati. 
33 | 
34 | # Utilizzo
35 | L'invocazione base dovrebbe essere sufficiente:
36 | `./simple_emh-runnable.py /percorso/al/file/video`
37 | Questo genererò una traccia di sottotitoli (`.srt`) per fini diagnostici e il video tagliato nella stessa cartella del file originale.
38 | 
39 | Per maggiori informazioni sui parametri accettati, leggi la guida:
40 | `./simple_emh-runnable.py --help`
41 | 
42 | Puoi anche utilizzare simple-ehm in un ambiente dockerizzato, per fare ciò crea l'immagine `docker build -t simple .`, e
43 | dove useresti `./simple_ehm-runnable.py` utilizza invece `./convert.sh` (N.B. per usare `./convert.sh` i file devono
44 | essere spostati prima in questa cartella)
45 | 
46 | # Contribuire al modello
47 | Ci sono due modi in cui puoi contribuire al modello:
48 | 
49 | ## Contribuisci al dataset
50 | Inviandomi almeno 30 clip in formato WAV (pcm_s16le) mono con campionamento a 16kHz per ciascuna classe (silenzio, parlato, ehm)  [facile]
51 | - Puoi convertire le tue clip nel formato corretto con ffmpeg: `ffmpeg -i input-file -c:a pcm_s16le -ac 1 -ar 16000 -filter:a "dynaudnorm" output.wav`
52 | - Puoi estrarre gli ehm(s) e i silenzi **anche quelli classificati erroneamente** (falsi positivi) passando `--generate-training-data` come parametro di invocazione. Puoi usare le clip classificate erroneamente per migliorare il tuo training set!
53 | 
54 | ## Contribuisci al training
55 | - Implementando la logica di transfer training sul notebook python di questo modello, e
56 | - Eseguendo il retraining della rete esistente con il tuo dataset ed inviandomi il modello aggiornato.
57 | 


--------------------------------------------------------------------------------
/convert.sh:
--------------------------------------------------------------------------------
1 | docker run --rm -v $PWD:/simple_ehm simple python ./simple_ehm-runnable.py $1
2 | 


--------------------------------------------------------------------------------
/model/saved_model.pb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/saved_model.pb


--------------------------------------------------------------------------------
/model/variables/variables.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/variables/variables.data-00000-of-00001


--------------------------------------------------------------------------------
/model/variables/variables.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/model/variables/variables.index


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow==2.13.0
2 | matplotlib
3 | seaborn
4 | tqdm
5 | psutil
6 | opencv-python
7 | protobuf==3.20.0
8 | 


--------------------------------------------------------------------------------
/simple_ehm-runnable.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | # you can prepare the training dataset like so:
  3 | # ffmpeg -i full.wav -f segment -segment_time 1 -c copy out%03d.wav
  4 | # and then manually dividing the speech from ehm(s) into separate folders
  5 | import os
  6 | import io
  7 | import pathlib
  8 | import argparse
  9 | import signal
 10 | 
 11 | import matplotlib.pyplot as plt
 12 | import numpy as np
 13 | import seaborn as sns
 14 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
 15 | import tensorflow as tf
 16 | 
 17 | from tensorflow.keras.layers.experimental import preprocessing
 18 | from tensorflow.keras.layers import Reshape
 19 | from tensorflow.keras import models  
 20 | 
 21 | from tensorflow.keras import layers
 22 | from tensorflow.keras import models
 23 | 
 24 | from tqdm import *
 25 | 
 26 | from functools import wraps
 27 | import datetime
 28 | import time
 29 | import psutil
 30 | import subprocess
 31 | import cv2
 32 | from contextlib import suppress
 33 | 
 34 | model = models.Sequential([  # Usa models invece di model
 35 |     Reshape((124, 129, 1), input_shape=(124, 129, 1)),
 36 | ])
 37 | 
 38 | 
 39 | parser = argparse.ArgumentParser()
 40 | parser.add_argument("filename", help="video file name (or full file path) to classify")
 41 | parser.add_argument("--fastcut", default=False, action='store_true',\
 42 |                     help="cut and merge an mp4 video without re-encoding using an edit list."\
 43 |                     "Might not work on some players. see https://stackoverflow.com/a/18449609")
 44 | parser.add_argument("--window-size-divide", type=float, default=1, help="divide window size (default: 1s) by this factor")
 45 | parser.add_argument("--window-slide-divide", type=float, default=2, help="divide the window slide by this factor (default: half the window size)")
 46 | parser.add_argument("--fps", type=int, default=-1, help="frames per second of the encoded video. Lower FPS mean faster encoding (default: original)")
 47 | parser.add_argument("--crf", type=int, default=-1, help="CRF factor for h264 encoding.")
 48 | parser.add_argument("--spectrogram", default=False, action='store_true', help="print spectrogram of window_size sliding by window_slide during analysis (debubbing only)")
 49 | parser.add_argument("--generate-training-data", default=False, action='store_true', help="export extracted ehm(s) and silences as well to a separate folder. Useful for training on false positives")
 50 | parser.add_argument("--srt", default=False, action='store_true', help="generate subtitle track for easier accuracy evaluation")
 51 | parser.add_argument("--keep", nargs="+", default=["speech"], help="space separated tags to to be kept in the final video. Eg: ehm silence. Default: speech")
 52 | parser.add_argument("--keep-junk", default=False,action="store_true", help="keeps tmp files")
 53 | 
 54 | args = parser.parse_args()
 55 | video_path = args.filename
 56 | audio_len = None
 57 | pbar = None
 58 | tmp_folder = "tmp"
 59 | td_folder = "training_data"
 60 | _perf = dict()
 61 | stats = None
 62 | labels = ["ehm", "silence", "speech"]
 63 | keep = set()
 64 | trash = set()
 65 | 
 66 | cuts = []       # edits for ffmpeg to split
 67 | mergelist = []  # files for ffmpeg to merge back
 68 | 
 69 | # ctrl+c handler
 70 | def signal_handler(sig, frame):
 71 |     try:
 72 |         filename, file_extension = os.path.splitext(video_path)
 73 |         if(not args.keep_junk):
 74 |             os.remove(f'{filename}.wav')
 75 |             for file in os.listdir("tmp"):
 76 |                 os.remove("tmp/"+file)
 77 | 
 78 |     except Exception as e:
 79 |         pass
 80 | 
 81 |     exit()
 82 | 
 83 | signal.signal(signal.SIGINT, signal_handler)
 84 | 
 85 | 
 86 | def plot_spectrogram(spectrogram, ax):
 87 |   # Convert to frequencies to log scale and transpose so that the time is
 88 |   # represented in the x-axis (columns).
 89 |   log_spec = np.log(spectrogram.T)
 90 |   height = log_spec.shape[0]
 91 |   X = np.arange(16000, step=height + 1)
 92 |   Y = range(height)
 93 |   ax.pcolormesh(X, Y, log_spec)
 94 | 
 95 | def timeit(f):
 96 |     @wraps(f)
 97 |     def timed(*args, **kw):
 98 |     
 99 |         a = datetime.datetime.now()
100 |         output = f(*args, **kw)
101 |         b = datetime.datetime.now()
102 |         _perf[f.__name__] = b - a
103 |         return output
104 |     return timed
105 | 
106 | @timeit
107 | def convert_input(path):
108 |     wav_path = path[:-4] + ".wav"
109 |     print("extracting audio track...")
110 |     cmd = ["ffmpeg", "-hide_banner", "-loglevel", "error", "-i", path, "-c:a", "pcm_s16le", "-ar", "16000", "-ac", "1",\
111 |           "-filter:a", "dynaudnorm", wav_path, "-y"]
112 |     subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).wait()
113 |     return wav_path
114 | 
115 | @timeit
116 | def decode_audio(audio_binary):
117 |     audio, rate = tf.audio.decode_wav(audio_binary)
118 |     return tf.squeeze(audio, axis=-1), rate
119 | 
120 | def get_spectrogram(waveform, seek, window_size):
121 |     # Padding for files with less than window_size
122 |     if tf.shape(waveform) < window_size:
123 |         zero_padding = tf.zeros([window_size] - tf.shape(waveform), dtype=tf.float32)
124 |     else:
125 |         zero_padding = tf.zeros(0, dtype=tf.float32)
126 | 
127 |     # Concatenate audio with padding so that all audio clips will be of the 
128 |     # same length
129 |     waveform = tf.cast(waveform, tf.float32)
130 |     if tf.shape(waveform) > window_size:
131 |         equal_length = waveform[seek:seek+window_size]
132 |     else:
133 |         equal_length = tf.concat([waveform, zero_padding], 0)
134 |     
135 |     # Calculate the spectrogram with the correct frame size
136 |     spectrogram = tf.signal.stft(equal_length, frame_length=255, frame_step=128, pad_end=True)
137 |     spectrogram = tf.abs(spectrogram)
138 | 
139 |     # Trim or pad the spectrogram to the desired shape
140 |     if tf.shape(spectrogram)[0] > 124:
141 |         spectrogram = spectrogram[:124, :]
142 |     else:
143 |         pad_size = 124 - tf.shape(spectrogram)[0]
144 |         spectrogram = tf.pad(spectrogram, [(0, pad_size), (0, 0)])
145 | 
146 |     # Add the channel dimension
147 |     spectrogram = tf.expand_dims(spectrogram, axis=-1)
148 | 
149 |     return spectrogram
150 | 
151 | def td_folder_init():
152 |     with suppress(FileExistsError): os.mkdir(td_folder)
153 |     removelist = [ f for f in os.listdir(td_folder) if f.endswith(".wav") ]
154 |     for f in removelist:
155 |         os.remove(os.path.join(td_folder, f))
156 | 
157 | def generate_tdata(ss, to, count, label):
158 |     filename = td_folder + "/" + label + "-" + str(count) + ".wav"
159 |     cuts.append(["ffmpeg", "-hide_banner", "-loglevel", "error", "-y", "-ss", ss, "-i", video_path, "-t", "1"])
160 |     cuts[-1].extend(["-c:a", "pcm_s16le", "-ac", "1", "-ar", "16000", "-filter:a", "dynaudnorm"])
161 |     cuts[-1].extend([filename])
162 | 
163 | def generate_cut(ss, to, count):
164 |     out_name = str(count) + video_path[-4:]
165 |     cuts.append(["ffmpeg", "-hide_banner", "-loglevel", "error", "-ss", ss ,"-i", video_path,  "-ss", ss,  "-to", to, "-copyts"])
166 |     if args.crf > 0:
167 |         cuts[-1].extend(["-crf", str(args.crf)])
168 |     if args.fps > 0:
169 |         cuts[-1].extend(["-filter:v", "fps=fps=" + str(args.fps)])
170 | 
171 |     if args.fastcut:
172 |         cuts[-1].extend(["-c:a", "copy", "-c:v", "copy", "-avoid_negative_ts", "1"])
173 |     else:
174 |         cuts[-1].extend(["-c:v", "libx264", "-crf", "23"])
175 |     cuts[-1].extend([tmp_folder + "/" + out_name, "-y"])
176 |     mergelist.append("file '" + out_name + "'")
177 |  
178 | @timeit
179 | def analyze_track(model, waveform, sample_rate):
180 |     global cuts, mergelist, pbar, audio_len, labels, stats, keep, trash
181 | 
182 |     # state vars for analysis loop
183 |     lastc = -1               # last seen class
184 |     lasts = 0              # last visited second
185 |     lastts = "00:00:00.000" # last cut was at this timestamp
186 |     count = 0               # number of subtitle records
187 |     lastwf = 0      # last frame of last analyzed. for 0s --> 1s at 16000Hz would be 16000
188 |     stats = [[0,0] for _ in range(len(labels))]
189 | 
190 |     if args.srt:
191 |         sub = open(video_path[:-4] + ".srt", 'w', encoding = 'utf-8')  # subtitle track name
192 |     else:
193 |         sub = io.StringIO()  # RAM file if no subtitle file needs to be generated
194 | 
195 |     window_size = int(sample_rate/args.window_size_divide)  # 1s by default
196 |     window_slide = int(window_size/args.window_slide_divide)
197 | 
198 |     # slide the window of size window_size by window_slide per iteration.
199 |     # overlap may occour.
200 |     print("analyzing track...")
201 |     last_i = window_slide * int(audio_len/window_slide)
202 |     pbar = tqdm(total=last_i)
203 |     for i in range(0, audio_len, window_slide):
204 |         pbar.update(n=window_slide)
205 |         spectrogram = get_spectrogram(waveform, i, window_size)
206 |         spectrogram = tf.expand_dims(spectrogram, axis=0)
207 | 
208 |         prediction = model(spectrogram)
209 |         cls = int(tf.math.argmax(prediction[0]))
210 |         conf = float(tf.nn.softmax(prediction[0])[cls])
211 | 
212 |         if lastc == -1:
213 |             lastc = cls
214 |             continue
215 |      
216 |         # generate cut when we know the end of it (or the track is at its end)
217 |         if cls != lastc or i == last_i:
218 |             s = i / sample_rate
219 |             if i == last_i:
220 |                 s += (audio_len - i) / sample_rate
221 | 
222 |             ts = "0" + str(datetime.timedelta(seconds=s))[:11]
223 |             if len(ts) <= 8:
224 |                 ts += ".000"
225 |             # if the window slide is overlapping the previous analyzed window
226 |             # and prediction has changed, don't generate a new cut until we are over it
227 |             # ...unless an undesired item is detected! 
228 |             if labels[cls] not in trash and i < lastwf and i < last_i:
229 |                 continue
230 |             # generate subtitles
231 |             record = str(count) + "\n" + lastts.replace('.',',') + " --> " + \
232 |                      ts.replace('.',',') + "\n" + labels[lastc] + \
233 |                      "\n[" + str(conf * 100)[:4] + "]" +"\n\n"
234 |             count += 1
235 |             sub.write(record)
236 |             stats[lastc][0] += 1
237 |             stats[lastc][1] += s - lasts
238 |             lasts = s
239 |             # generate cut
240 |             if labels[lastc] in keep:
241 |                 generate_cut(lastts, ts, count)
242 |             elif args.generate_training_data:
243 |                 generate_tdata(lastts, ts, count, labels[lastc])
244 |             lastts = ts
245 |             lastc = cls
246 |         # slide the right hand side of the window detection.
247 |         # This allows to cut segments > than window size
248 |         lastwf = i + window_size  
249 | 
250 |         if not args.spectrogram:
251 |             continue
252 | 
253 |         img = spectrogram.numpy().T
254 |         img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
255 |         cv2.putText(img, labels[cls],
256 |                    (5, 120), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
257 |         cv2.imshow("spectrogram", img)
258 |         cv2.waitKey(1) & 0xFF
259 |         time.sleep(0.2)
260 |  
261 |     sub.close()
262 | 
263 | @timeit
264 | def cut_and_merge(out_filename):
265 |     global pbar
266 |     cores = int(psutil.cpu_count()/2)
267 |     print("CUT and MERGE: running", cores, "ffmpeg simultaneous instances.")
268 |     procs = []
269 |     i = 0
270 |     
271 |     # procs pool of size <number of cores>
272 |     for c in range(cores):
273 |         procs.append(None)
274 |     
275 |     pbar = tqdm(total=len(cuts))
276 |     # loop until all cuts are issued.
277 |     while i < len(cuts):
278 |         # find an empty spot on the pool and give it to the cut
279 |         for p in range(len(procs)):
280 |             # if the seat in the pool is empty or the occupying job has finished
281 |             if procs[p] is None or procs[p].poll() != None:
282 |                 # if the occupying job has terminated with an error, abort everything.
283 |                 if procs[p] is not None and procs[p].poll() != 0:
284 |                     print("there was an error with an ffmpeg process. aborting!!!")
285 |                     print(procs[p].communicate())
286 |                     exit(1)
287 |                 procs[p] = subprocess.Popen(cuts[i], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
288 |                 # print(procs[p], "issued. PID:", procs[p].pid)
289 |                 i += 1
290 |                 pbar.update(n=1)
291 |                 break
292 |             time.sleep(0.01)
293 |     
294 |     print("\nwaiting for all processes to finish...")
295 |     for p in procs:
296 |         with suppress(AttributeError): p.wait()
297 |     
298 |     mergelist_path = tmp_folder + "/inputs.txt"
299 |     with open(mergelist_path, 'w', encoding = 'utf-8') as f:
300 |         for m in mergelist:
301 |           f.write(m + "\n")
302 | 
303 |     hour = str(datetime.datetime.now().hour)
304 |     minute = str(datetime.datetime.now().minute)
305 |     secs = str(datetime.datetime.now().second)
306 |     
307 |     out_filename += "_" + hour + "-" + minute + "-" + secs + video_path[-4:]
308 |     
309 |     cmd = ["ffmpeg", "-hide_banner", "-loglevel", "error", "-f", "concat", "-i", mergelist_path, "-c", "copy", out_filename, "-y"]
310 |     subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).wait()
311 | 
312 | 
313 | if __name__ == '__main__':
314 |     if (not os.path.isfile(args.filename)):
315 |        raise Exception(f"Error! {args.filename} doesn't exists.")
316 | 
317 |     with suppress(FileExistsError): os.mkdir(tmp_folder)
318 | 
319 |     if args.generate_training_data:
320 |         td_folder_init() 
321 | 
322 |     keep = set(args.keep)
323 |     trash = set(labels) - keep
324 | 
325 |     model = tf.keras.models.load_model('model')
326 |     wav_path = convert_input(video_path)
327 |     audio_binary = tf.io.read_file(wav_path)
328 |     waveform, sample_rate = decode_audio(audio_binary)
329 |     sample_rate = int(sample_rate)
330 |     
331 |     audio_len = len(waveform)
332 |     
333 |     analyze_track(model, waveform, sample_rate)
334 |     cut_and_merge(video_path[:-4])
335 |     
336 |     if(not args.keep_junk):
337 |         os.remove(wav_path)
338 |         for file in os.listdir("tmp"):
339 |             os.remove("tmp/"+file)
340 | 
341 |     print("\nFATTO!")
342 |     for k, v in zip(_perf.keys(), _perf.values()):
343 |         print(k, "ha impiegato", str(v))
344 | 
345 |     saved_time = 0
346 | 
347 |     print("")
348 |     for i in range(len(labels)):
349 |         if labels[i] == "speech":
350 |             continue
351 |         saved_time += stats[i][1]
352 |         print("Rimosso ", stats[i][0], " ", labels[i], "(s)",
353 |              " per un ammontare di ", str(datetime.timedelta(seconds=stats[i][1]))[:8],
354 |              sep="")
355 | 
356 |     print("Tempo totale risparmiato:", str(datetime.timedelta(seconds=saved_time))[:8])
357 | 
358 | 
359 | 


--------------------------------------------------------------------------------
/training_data/.gitignore:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/morrolinux/simple-ehm/605bb9eae94fb58e3ef0ff8bcb7944ae2ce884be/training_data/.gitignore


--------------------------------------------------------------------------------
/trainingset/.gitignore:
--------------------------------------------------------------------------------
1 | !.gitignore
2 | 
3 | 


--------------------------------------------------------------------------------