├── README.md
├── data
├── KatyPerry-Firework.mp3
└── silence.webm
├── demucsWrapper.py
├── lang2to3.py
├── markers
├── OKW-MRK-da.wav
├── OKW-MRK-de.wav
├── OKW-MRK-el.wav
├── OKW-MRK-en.wav
├── OKW-MRK-es.wav
├── OKW-MRK-fr.wav
├── OKW-MRK-hi.wav
├── OKW-MRK-hu.wav
├── OKW-MRK-it.wav
├── OKW-MRK-ja.wav
├── OKW-MRK-ko.wav
├── OKW-MRK-nl.wav
├── OKW-MRK-pl.wav
├── OKW-MRK-pt.wav
├── OKW-MRK-sv.wav
├── OKW-MRK-uk.wav
├── OKW-MRK.wav
├── WOK-MRK-da.wav
├── WOK-MRK-de.wav
├── WOK-MRK-el.wav
├── WOK-MRK-en.wav
├── WOK-MRK-es.wav
├── WOK-MRK-fr.wav
├── WOK-MRK-hi.wav
├── WOK-MRK-hu.wav
├── WOK-MRK-it.wav
├── WOK-MRK-ja.wav
├── WOK-MRK-ko.wav
├── WOK-MRK-nl.wav
├── WOK-MRK-pl.wav
├── WOK-MRK-pt.wav
├── WOK-MRK-sv.wav
├── WOK-MRK-uk.wav
└── WOK-MRK.wav
└── transcribeHallu.py
/README.md:
--------------------------------------------------------------------------------
1 | # WhisperHallu
2 | Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts
3 |
4 | See this discussion: https://github.com/openai/whisper/discussions/679
5 |
6 | # Main algo
7 | - remove noise by voice extraction using [Facebook Demucs](https://github.com/facebookresearch/demucs) or [Deezer Spleeter](https://github.com/deezer/spleeter).
8 | - remove silences, and normalize loudness with ffmpeg.
9 | - remove noise parts using [Silero VAD](https://github.com/snakers4/silero-vad).
10 | - add voice markers.
11 | - apply speech compressor (requires `ffmpeg` 4.4, while Google Colab is 4.2, it has to be upgraded, see below).
12 | - try to transcribe. If markers are present in output, transcription is OK.
13 | - if not, try to invert markers. If markers are present in output, transcription is OK.
14 | - if not, try without markers.
15 |
16 | # Processing options and parameters
17 | - use Whisper V1, V2 or V3 (V2 by default, because V3 seems bad with music).
18 | - beam_size (2 by default), patience, temperature.
19 | - process only a subpart of the input file (needs a post-processing of timestamp values).
20 | - various time stretching methods tested (see in-code comments. Needs a post-processing of timestamp values. It was an interesting suggested idea, but no real gain obtained on my side).
21 | - vocals remix (with or without speech normalization).
22 | - multiple final transcriptions (get multiple results, knowing Whisper is not stable from one run to an other, without doing pre-processing several times)
23 |
24 | # Complement
25 |
26 | May be used to produce "accurate transcriptions" for WhisperTimeSync:
27 | https://github.com/EtienneAb3d/WhisperTimeSync
28 |
29 | May be tested using NeuroSpell Dictaphone:
30 | https://neurospell.com/
31 |
32 | WhisperHallu and WhisperTimeSync are used to extract vocals and lyrics in karaok-AI:
33 | https://github.com/EtienneAb3d/karaok-AI
34 |
35 | ChatMate is a complete versatile ChatGPT automation tool, including explanations to produce a SRT file translator to Chinese (as an example):
36 | https://github.com/EtienneAb3d/ChatMate
37 |
38 | # Google Colab
39 |
40 | Standard Whisper:
41 | https://colab.research.google.com/drive/1-GpXaNaGFXKX9VXl60JGVVrGO41t09KA?usp=sharing
42 |
43 | Faster Whisper:
44 | https://colab.research.google.com/drive/1RkvOtUTbUD5NVsRI4aKEqJO8BRo8BFIY?usp=sharing
45 |
46 | # Install
47 |
48 | **Check ffmpeg version >=4.4**
49 | ```sh
50 | ffmpeg -version
51 |
52 | Output should be:
53 | =================
54 | ffmpeg version 4.4.3-0ubuntu1~20.04.sav2 Copyright (c) 2000-2022 the FFmpeg developers
55 | [...]
56 |
57 | Install latest:
58 | ===============
59 | sudo add-apt-repository -y ppa:savoury1/ffmpeg4
60 | sudo apt-get -qq install -y ffmpeg
61 |
62 | ```
63 |
64 | **Demucs (if used)**
65 |
66 | ```sh
67 | pip install -U demucs
68 | ```
69 |
70 | **Spleeter (if used)**
71 |
72 | ```sh
73 | pip install spleeter
74 | ```
75 |
76 | **Standard Whisper (if used)**
77 |
78 | ```sh
79 | sudo apt update && sudo apt install ffmpeg
80 |
81 | sudo apt install python3
82 | sudo apt install python3-pip
83 | sudo apt install virtualenv
84 |
85 | virtualenv -p python3 ../venvWhisper
86 | . ../venvWhisper/bin/activate
87 |
88 | pip install -U openai-whisper
89 |
90 | pip3 install torchaudio
91 | ```
92 |
93 | **Faster Whisper (if used in place of Whisper)**
94 |
95 | ```sh
96 | sudo apt update && sudo apt install ffmpeg
97 |
98 | sudo apt install python3
99 | sudo apt install python3-pip
100 | sudo apt install virtualenv
101 |
102 | virtualenv -p python3 ../venvFasterWhisper
103 | . ../venvFasterWhisper/bin/activate
104 |
105 | git clone https://github.com/guillaumekln/faster-whisper.git
106 | cd faster-whisper/
107 |
108 | pip install -e .[conversion]
109 | pip install -e .
110 |
111 | cd ..
112 |
113 | ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --quantization float16
114 | ct2-transformers-converter --model openai/whisper-large --output_dir whisper-large-ct2 --quantization float16
115 |
116 | pip3 install torchaudio
117 | ```
118 |
119 | **SM4T (if used in place of Whisper)**
120 |
121 | ```sh
122 | sudo apt update && sudo apt install ffmpeg
123 |
124 | sudo apt install python3
125 | sudo apt install python3-pip
126 | sudo apt install virtualenv
127 |
128 | virtualenv -p python3 ../venvSM4T
129 | . ../venvSM4T/bin/activate
130 |
131 | git clone https://github.com/facebookresearch/seamless_communication.git
132 | cd seamless_communication/
133 |
134 | pip install --upgrade pip
135 | pip install .
136 |
137 | m4t_predict "On ne fait pas d'omelette sans casser des oeufs." t2tt eng --src_lang fra
138 |
139 | pip3 install torchaudio
140 | ```
141 |
142 | # Code
143 |
144 | ```python
145 | from transcribeHallu import loadModel
146 | from transcribeHallu import transcribePrompt
147 |
148 | ##### The audio language may be different from the one for the output transcription.
149 | path="/path/to/your/en/sound/file"
150 | lngInput="en"
151 |
152 | ##### Activate this for music file to get a minimal processing
153 | isMusic=False
154 |
155 | ##### Need to be adapted for each language.
156 | ##### For prompt examples, see transcribeHallu.py getPrompt(lng:str)
157 | lng="en"
158 | prompt= "Whisper, Ok. "\
159 | +"A pertinent sentence for your purpose in your language. "\
160 | +"Ok, Whisper. Whisper, Ok. "\
161 | +"Ok, Whisper. Whisper, Ok. "\
162 | +"Please find here, an unlikely ordinary sentence. "\
163 | +"This is to avoid a repetition to be deleted. "\
164 | +"Ok, Whisper. "
165 |
166 | ##### Model size to use
167 | modelSize="medium"
168 | loadModel("0",modelSize=modelSize)
169 |
170 | result = transcribePrompt(path=path, lng=lng, prompt=prompt, lngInput=lngInput,isMusic=isMusic)
171 | ```
172 |
173 |
174 | This tool is a demonstration of our know-how.
175 | If you are interested in a commercial/industrial AI linguistic project, contact us:
176 | https://cubaix.com
177 |
--------------------------------------------------------------------------------
/data/KatyPerry-Firework.mp3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/data/KatyPerry-Firework.mp3
--------------------------------------------------------------------------------
/data/silence.webm:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/data/silence.webm
--------------------------------------------------------------------------------
/demucsWrapper.py:
--------------------------------------------------------------------------------
1 | import torch
2 | import torchaudio
3 | import demucs
4 | from demucs.pretrained import get_model_from_args
5 | from demucs.apply import apply_model
6 | from demucs.separate import load_track
7 | from torch._C import device
8 | import re
9 |
10 | def load_demucs_model():
11 | return get_model_from_args(type('args', (object,), dict(name='htdemucs', repo=None))).cpu().eval()
12 |
13 |
14 | def demucs_audio(pathIn: str,
15 | model=None,
16 | device=None,
17 | pathVocals: str = None,
18 | pathOther: str = None):
19 | if model is None:
20 | model = load_demucs_model()
21 |
22 | audio = load_track(pathIn, model.audio_channels, model.samplerate)
23 |
24 | audio_dims = audio.dim()
25 | if audio_dims == 1:
26 | audio = audio[None, None].repeat_interleave(2, -2)
27 | else:
28 | if audio.shape[-2] == 1:
29 | audio = audio.repeat_interleave(2, -2)
30 | if audio_dims < 3:
31 | audio = audio[None]
32 |
33 | if device is None:
34 | device = "cuda" if torch.cuda.is_available() else "cpu"
35 | print("Demucs using device: "+device)
36 | result = apply_model(model, audio, device=device, split=True, overlap=.25)
37 | if device != 'cpu':
38 | torch.cuda.empty_cache()
39 |
40 | for name in model.sources:
41 | print("Source: "+name)
42 | if name == "vocals":
43 | source_idx=model.sources.index(name)
44 | source=result[0, source_idx].mean(0)
45 | torchaudio.save(re.sub(r'[.](mp3|wav)$',"."+name+".wav",pathIn), source[None], model.samplerate)
46 |
47 |
48 |
--------------------------------------------------------------------------------
/lang2to3.py:
--------------------------------------------------------------------------------
1 | def lang2to3():
2 | return {
3 | "aa": "aar"# Afar
4 | ,"ab": "abk"# Abkhazian
5 | ,"ae": "ave"# Avestan
6 | ,"af": "afr"# Afrikaans
7 | ,"ak": "aka"# Akan
8 | ,"am": "amh"# Amharic
9 | ,"an": "arg"# Aragonese
10 | ,"ar": "arb"# Modern Standard Arabic
11 | ,"as": "asm"# Assamese
12 | ,"av": "ava"# Avaric
13 | ,"ay": "aym"# Aymara
14 | ,"az": "aze"# Azerbaijani
15 | ,"ba": "bak"# Bashkir
16 | ,"be": "bel"# Belarusian
17 | ,"bg": "bul"# Bulgarian
18 | ,"bh": "bih"# maïthil
19 | ,"bi": "bis"# Bislama
20 | ,"bm": "bam"# Bambara
21 | ,"bn": "ben"# Bengali
22 | ,"bo": "bod"# Tibetan
23 | ,"br": "bre"# Breton
24 | ,"bs": "bos"# Bosnian
25 | ,"ca": "cat"# Catalan
26 | ,"ce": "che"# Chechen
27 | ,"ch": "cha"# Chamorro
28 | ,"co": "cos"# Corsican
29 | ,"cr": "cre"# Cree
30 | ,"cs": "ces"# Czech
31 | ,"cu": "chu"# Slavonic
32 | ,"cv": "chv"# Chuvash
33 | ,"cy": "cym"# Welsh
34 | ,"da": "dan"# Danish
35 | ,"de": "deu"# German
36 | ,"dv": "div"# Divehi
37 | ,"dz": "dzo"# Dzongkha
38 | ,"ee": "ewe"# Ewe
39 | ,"el": "ell"# Greek
40 | ,"en": "eng"# English
41 | ,"eo": "epo"# Esperanto
42 | ,"es": "spa"# Spanish
43 | ,"et": "est"# Estonian
44 | ,"eu": "eus"# Basque
45 | ,"fa": "fas"# Persian
46 | ,"ff": "ful"# Fulah
47 | ,"fi": "fin"# Finnish
48 | ,"fj": "fij"# Fijian
49 | ,"fo": "fao"# Faroese
50 | ,"fr": "fra"# French
51 | ,"fy": "fry"# Frisian
52 | ,"ga": "gle"# Irish
53 | ,"gd": "gla"# Gaelic
54 | ,"gl": "glg"# Galician
55 | ,"gn": "grn"# Guarani
56 | ,"gu": "guj"# Gujarati
57 | ,"gv": "glv"# Manx
58 | ,"ha": "hau"# Hausa
59 | ,"he": "heb"# Hebrew
60 | ,"hi": "hin"# Hindi
61 | ,"ho": "hmo"# Motu
62 | ,"hr": "hrv"# Croatian
63 | ,"ht": "hat"# Haitian
64 | ,"hu": "hun"# Hungarian
65 | ,"hy": "hye"# Armenian
66 | ,"hz": "her"# Herero
67 | ,"ia": "ina"# Interlingua
68 | ,"id": "ind"# Indonesian
69 | ,"ie": "ile"# Interlingue
70 | ,"ig": "ibo"# Igbo
71 | ,"ii": "iii"# Yi
72 | ,"ik": "ipk"# Inupiaq
73 | ,"io": "ido"# Ido
74 | ,"is": "isl"# Icelandic
75 | ,"it": "ita"# Italian
76 | ,"iu": "iku"# Inuktitut
77 | ,"ja": "jpn"# Japanese
78 | ,"jv": "jav"# Javanese
79 | ,"ka": "kat"# Georgian
80 | ,"kg": "kon"# Kongo
81 | ,"ki": "kik"# Kikuyu
82 | ,"kj": "kua"# Kwanyama
83 | ,"kk": "kaz"# Kazakh
84 | ,"kl": "kal"# Greenlandic
85 | ,"km": "khm"# Khmer
86 | ,"kn": "kan"# Kannada
87 | ,"ko": "kor"# Korean
88 | ,"kr": "kau"# Kanuri
89 | ,"ks": "kas"# Kashmiri
90 | ,"ku": "kur"# Kurdish
91 | ,"kv": "kom"# Komi
92 | ,"kw": "cor"# Cornish
93 | ,"ky": "kir"# Kirghiz
94 | ,"la": "lat"# Latin
95 | ,"lb": "ltz"# Luxembourgish
96 | ,"lg": "lug"# Ganda
97 | ,"li": "lim"# Limburgish
98 | ,"ln": "lin"# Lingala
99 | ,"lo": "lao"# Lao
100 | ,"lt": "lit"# Lithuanian
101 | ,"lu": "lub"# Luba
102 | ,"lv": "lav"# Latvian
103 | ,"mg": "mlg"# Malagasy
104 | ,"mh": "mah"# Marshallese
105 | ,"mi": "mri"# Maori
106 | ,"mk": "mkd"# Macedonian
107 | ,"ml": "mal"# Malayalam
108 | ,"mn": "mon"# Mongolian
109 | ,"mo": "mol"# roumain1
110 | ,"mr": "mar"# Marathi
111 | ,"ms": "msa"# Malay
112 | ,"mt": "mlt"# Maltese
113 | ,"my": "mya"# Burmese
114 | ,"na": "nau"# Nauru
115 | ,"nb": "nob"# Bokmål
116 | ,"nd": "nde"# Ndebele
117 | ,"ne": "nep"# Nepali
118 | ,"ng": "ndo"# Ndonga
119 | ,"nl": "nld"# Dutch
120 | ,"nn": "nno"# Nynorsk
121 | ,"no": "nob"# Norwegian Bokmål
122 | ,"nr": "nbl"# Ndebele
123 | ,"nv": "nav"# Navajo
124 | ,"ny": "nya"# Chichewa
125 | ,"oc": "oci"# Occitan
126 | ,"oj": "oji"# Ojibwa
127 | ,"om": "orm"# Oromo
128 | ,"or": "ori"# Oriya
129 | ,"os": "oss"# Ossetian
130 | ,"pa": "pan"# Panjabi
131 | ,"pi": "pli"# Pāli
132 | ,"pl": "pol"# Polish
133 | ,"ps": "pus"# Pashto
134 | ,"pt": "por"# Portuguese
135 | ,"qu": "que"# Quechua
136 | ,"rm": "roh"# Romansh
137 | ,"rn": "run"# Kirundi
138 | ,"ro": "ron"# Romanian
139 | ,"ru": "rus"# Russian
140 | ,"rw": "kin"# Kinyarwanda
141 | ,"sa": "san"# Sanskrit
142 | ,"sc": "srd"# Sardinian
143 | ,"sd": "snd"# Sindhi
144 | ,"se": "sme"# Sami
145 | ,"sg": "sag"# Sango
146 | ,"sh": "hbs"# Croatian
147 | ,"si": "sin"# Sinhalese
148 | ,"sk": "slk"# Slovak
149 | ,"sl": "slv"# Slovenian
150 | ,"sm": "smo"# Samoan
151 | ,"sn": "sna"# Shona
152 | ,"so": "som"# Somali
153 | ,"sq": "sqi"# Albanian
154 | ,"sr": "srp"# Serbian
155 | ,"ss": "ssw"# Swati
156 | ,"st": "sot"# Sotho
157 | ,"su": "sun"# Sundanese
158 | ,"sv": "swe"# Swedish
159 | ,"sw": "swa"# Swahili
160 | ,"ta": "tam"# Tamil
161 | ,"te": "tel"# Telugu
162 | ,"tg": "tgk"# Tajik
163 | ,"th": "tha"# Thai
164 | ,"ti": "tir"# Tigrinya
165 | ,"tk": "tuk"# Turkmen
166 | ,"tl": "tgl"# Tagalog
167 | ,"tn": "tsn"# Tswana
168 | ,"to": "ton"# Tonga
169 | ,"tr": "tur"# Turkish
170 | ,"ts": "tso"# Tsonga
171 | ,"tt": "tat"# Tatar
172 | ,"tw": "twi"# Twi
173 | ,"ty": "tah"# Tahitian
174 | ,"ug": "uig"# Uighur
175 | ,"uk": "ukr"# Ukrainian
176 | ,"ur": "urd"# Urdu
177 | ,"uz": "uzb"# Uzbek
178 | ,"ve": "ven"# Venda
179 | ,"vi": "vie"# Namese
180 | ,"vo": "vol"# Volapük
181 | ,"wa": "wln"# Walloon
182 | ,"wo": "wol"# Wolof
183 | ,"xh": "xho"# Xhosa
184 | ,"yi": "yid"# Yiddish
185 | ,"yo": "yor"# Yoruba
186 | ,"za": "zha"# Zhuang
187 | ,"zh": "cmn"# MandarinChinese
188 | ,"zu": "zul"# Zulu
189 | }
190 |
--------------------------------------------------------------------------------
/markers/OKW-MRK-da.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-da.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-de.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-de.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-el.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-el.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-en.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-en.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-es.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-es.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-fr.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-fr.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-hi.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-hi.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-hu.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-hu.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-it.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-it.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-ja.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-ja.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-ko.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-ko.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-nl.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-nl.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-pl.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-pl.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-pt.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-pt.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-sv.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-sv.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK-uk.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-uk.wav
--------------------------------------------------------------------------------
/markers/OKW-MRK.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-da.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-da.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-de.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-de.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-el.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-el.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-en.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-en.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-es.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-es.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-fr.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-fr.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-hi.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-hi.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-hu.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-hu.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-it.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-it.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-ja.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-ja.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-ko.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-ko.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-nl.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-nl.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-pl.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-pl.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-pt.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-pt.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-sv.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-sv.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK-uk.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-uk.wav
--------------------------------------------------------------------------------
/markers/WOK-MRK.wav:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK.wav
--------------------------------------------------------------------------------
/transcribeHallu.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | import time
4 | import re
5 | from _io import StringIO
6 |
7 | if sys.version_info.major == 3 and sys.version_info.minor >= 10:
8 | print("Python >= 3.10")
9 | import collections.abc
10 | import collections
11 | collections.MutableMapping = collections.abc.MutableMapping
12 | else:
13 | print("Python < 3.10")
14 | import collections
15 |
16 | import traceback
17 |
18 | import torch
19 |
20 | torch.set_num_threads(1)
21 | useSileroVAD=True
22 | if(useSileroVAD):
23 | modelVAD, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
24 | model='silero_vad',
25 | force_reload=False,
26 | onnx=False)
27 | (get_speech_timestamps,
28 | save_audio,
29 | read_audio,
30 | VADIterator,
31 | collect_chunks) = utils
32 |
33 | useSpleeter=False
34 | if(useSpleeter):
35 | from spleeter.audio import STFTBackend
36 | backend = STFTBackend.LIBROSA
37 | from spleeter.separator import Separator
38 | print("Using spleeter:2stems-16kHz")
39 | separator = Separator('spleeter:2stems-16kHz',stft_backend=backend)
40 |
41 | useDemucs=True
42 | if(useDemucs):
43 | from demucsWrapper import load_demucs_model
44 | from demucsWrapper import demucs_audio
45 | print("Using Demucs")
46 | modelDemucs = load_demucs_model()
47 |
48 | useCompressor=True
49 |
50 | try:
51 | #Standard Whisper: https://github.com/openai/whisper
52 | import whisper
53 | print("Using standard Whisper")
54 | whisperFound = "STD"
55 | from pathlib import Path
56 | from whisper.utils import WriteSRT
57 | except ImportError as e:
58 | pass
59 |
60 | try:
61 | #FasterWhisper: https://github.com/guillaumekln/faster-whisper
62 | from faster_whisper import WhisperModel
63 | print("Using Faster Whisper")
64 | whisperFound = "FSTR"
65 | modelPath = "whisper-medium-ct2/"#"whisper-medium-ct2/" "whisper-large-ct2/"
66 | if not os.path.exists(modelPath):
67 | print("Faster installation found, but "+modelPath+" model not found")
68 | sys.exit(-1)
69 | except ImportError as e:
70 | pass
71 |
72 | try:
73 | from seamless_communication.models.inference import Translator
74 | from lang2to3 import lang2to3
75 | lang2to3 = lang2to3()
76 | whisperFound = "SM4T"
77 | except ImportError as e:
78 | pass
79 |
80 | #large-v3 model seems to be bad with music, thus keep v2 as the default
81 | whisperVersion = "-v2" #May be "", "-V1", "-v2, "-v3"
82 | whisperLoaded = "??"
83 | beam_size=2
84 | patience=0
85 | temperature=0
86 | model = None
87 | device = "cuda" #cuda / cpu
88 | cudaIdx = 0
89 |
90 | SAMPLING_RATE = 16000
91 | MAX_DURATION = 600
92 | TRUNC_DURATION = MAX_DURATION
93 |
94 | from threading import Lock, Thread
95 | lock = Lock()
96 |
97 | def loadModel(gpu: str,modelSize=None):
98 | global model
99 | global device
100 | global cudaIdx
101 | global whisperLoaded
102 | cudaIdx = gpu
103 | try:
104 | if whisperFound == "FSTR":
105 | if(modelSize == "large"):
106 | modelPath = "whisper-large-ct2/"
107 | else:
108 | modelPath = "whisper-medium-ct2/"
109 | print("LOADING: "+modelPath+" GPU: "+gpu+" BS: "+str(beam_size)+" PTC="+str(patience)+" TEMP="+str(temperature))
110 | compute_type="float16"# float16 int8_float16 int8
111 | model = WhisperModel(modelPath, device=device,device_index=int(gpu), compute_type=compute_type)
112 | elif whisperFound == "STD":
113 | if(modelSize == None):
114 | modelSize="medium"#"tiny"#"medium" #"large"
115 | if(modelSize == "large"):
116 | modelSize = "large"+whisperVersion #"large-v1" "large-v2" "large-v3"
117 | print("LOADING: "+modelSize+" GPU:"+gpu+" BS: "+str(beam_size)+" PTC="+str(patience)+" TEMP="+str(temperature))
118 | model = whisper.load_model(modelSize,device=torch.device("cuda:"+gpu)) #May be "cpu"
119 | elif whisperFound == "SM4T":
120 | print("LOADING: "+"seamlessM4T_large"+" GPU:"+gpu)
121 | model = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:"+gpu), torch.float16)
122 | print("LOADED")
123 | whisperLoaded = modelSize
124 | except Exception as e:
125 | print("Can't load Whisper model: "+whisperFound+"/"+modelSize)
126 | print(e)
127 | sys.exit(-1)
128 |
129 | def loadedModel():
130 | return whisperFound+" "+whisperLoaded
131 |
132 | def getDuration(aLog:str):
133 | duration = None
134 | time = None
135 | with open(aLog) as f:
136 | lines = f.readlines()
137 | for line in lines:
138 | if(re.match(r"^ *Duration: [0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9][0-9], .*$", line, re.IGNORECASE)):
139 | duration = re.sub(r"(.*Duration: *|[,. ].*)", "", line, re.IGNORECASE)
140 | return sum(x * int(t) for x, t in zip([3600, 60, 1], duration.split(":")))
141 | for aSub in line.split("[\r\n]"):
142 | if(re.match(r"^.*time=[0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9][0-9] .*$", aSub, re.IGNORECASE)):
143 | #print("SUB="+aSub)
144 | time = re.sub(r"(.*time=|[,. ].*)", "", aSub, re.IGNORECASE)
145 | #Return last found time value
146 | if(time != "00:00:00"):
147 | print("TIME="+str(time))
148 | return sum(x * int(t) for x, t in zip([3600, 60, 1], time.split(":")))
149 | return None
150 |
151 | def formatTimeStamp(aT=0):
152 | aH = int(aT/3600)
153 | aM = int((aT%3600)/60)
154 | aS = (aT%60)
155 | return "%02d:%02d:%06.3f" % (aH,aM,aS)
156 |
157 | def getPrompt(lng:str):
158 | if(lng == "en"):
159 | aOk=""
160 | return "Whisper, Ok. "\
161 | +"A pertinent sentence for your purpose in your language. "\
162 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\
163 | +"Please find here, an unlikely ordinary sentence. "\
164 | +"This is to avoid a repetition to be deleted. "\
165 | +"Ok, Whisper. "
166 |
167 | if(lng == "fr"):
168 | return "Whisper, Ok. "\
169 | +"Une phrase pertinente pour votre propos dans votre langue. "\
170 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\
171 | +"Merci de trouver ci-joint, une phrase ordinaire improbable. "\
172 | +"Pour éviter une répétition à être supprimée. "\
173 | +"Ok, Whisper. "
174 |
175 | if(lng == "uk"):
176 | return "Whisper, Ok. "\
177 | +"Доречне речення вашою мовою для вашої мети. "\
178 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\
179 | +"Будь ласка, знайдіть тут навряд чи звичайне речення. "\
180 | +"Це зроблено для того, щоб уникнути повторення, яке потрібно видалити. "\
181 | +"Ok, Whisper. "
182 |
183 | if(lng == "hi"):
184 | return "विस्पर, ओके. "\
185 | +"आपकी भाषा में आपके उद्देश्य के लिए एक प्रासंगिक वाक्य। "\
186 | +"ओके, विस्पर. विस्पर, ओके. ओके, विस्पर. विस्पर, ओके. "\
187 | +"कृपया यहां खोजें, एक असंभावित सामान्य वाक्य। "\
188 | +"यह हटाए जाने की पुनरावृत्ति से बचने के लिए है। "\
189 | +"ओके, विस्पर. "
190 |
191 | #Not Already defined?
192 | return ""
193 |
194 |
195 | def transcribePrompt(path: str,lng: str,prompt=None,lngInput=None,isMusic=False,addSRT=False,truncDuration=TRUNC_DURATION,maxDuration=MAX_DURATION):
196 | """Whisper transcribe."""
197 |
198 | if(lngInput == None):
199 | lngInput=lng
200 | print("Using output language as input language: "+lngInput)
201 |
202 | if(prompt == None):
203 | if(not isMusic):
204 | prompt=getPrompt(lng)
205 | else:
206 | prompt="";
207 |
208 | print("=====transcribePrompt",flush=True)
209 | print("PATH="+path,flush=True)
210 | print("LNGINPUT="+lngInput,flush=True)
211 | print("LNG="+lng,flush=True)
212 | print("PROMPT="+prompt,flush=True)
213 | opts = dict(language=lng,initial_prompt=prompt)
214 | return transcribeOpts(path, opts,lngInput,isMusic=isMusic,addSRT=addSRT,subEnd=truncDuration,maxDuration=maxDuration)
215 |
216 | def transcribeOpts(path: str,opts: dict
217 | ,lngInput=None,isMusic=False,onlySRT=False,addSRT=False
218 | ,subBeg="0",subEnd=str(TRUNC_DURATION)
219 | ,maxDuration=MAX_DURATION
220 | ,stretch=None
221 | ,nbRun=1#Whisper is unstable, especially with music. Multiple run can provide with better results to eval afterward
222 | ,remixFactor="0.3",speechnorm=True
223 | ):
224 | pathIn = path
225 | pathClean = path
226 | pathNoCut = path
227 |
228 | initTime = time.time()
229 |
230 | startTime = time.time()
231 | duration = -1
232 | try:
233 | #Convert to WAV to avoid later possible decoding problem
234 | pathWAV = pathIn+".WAV"+".wav"
235 | aCmd = "ffmpeg -y"+" -i \""+pathIn+"\""+" -ss "+subBeg+" -to "+subEnd + " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathWAV+"\" > \""+pathWAV+".log\" 2>&1"
236 | print("CMD: "+aCmd)
237 | os.system(aCmd)
238 | duration = getDuration(pathWAV+".log")
239 | print("T=",(time.time()-startTime))
240 | print("DURATION="+str(duration)+" subBeg="+str(subBeg)+" subEnd="+str(subEnd))
241 | print("PATH="+pathWAV,flush=True)
242 | pathIn = pathClean = pathWAV
243 | except Exception as e:
244 | print("Warning: can't convert to WAV")
245 | print(e)
246 |
247 | try:
248 | if(stretch != None):
249 | pathSTRETCH = pathIn+".STRETCH"+".wav"
250 | #ffmpeg STRECH
251 | aCmd = "ffmpeg -y -i \""+pathIn+"\""+" -t "+str(truncDuration) + " -filter:a \"atempo="+stretch+"\"" + " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathSTRETCH+"\" > \""+pathSTRETCH+".log\" 2>&1"
252 | #sox STRECH
253 | #aCmd = "sox \""+pathIn+"\""+" \""+pathSTRETCH+"\" tempo "+stretch+" > \""+pathSTRETCH+".log\" 2>&1"
254 | #soundstretch STRECH
255 | #aCmd = "soundstretch \""+pathIn+"\""+" \""+pathSTRETCH+"\" -tempo="+str(int(100*float(stretch)) - 100)+" > \""+pathSTRETCH+".log\" 2>&1"
256 | #rubberband STRECH
257 | #aCmd = "rubberband \""+pathIn+"\""+" \""+pathSTRETCH+"\" --tempo "+stretch+" > \""+pathSTRETCH+".log\" 2>&1"
258 | print("CMD: "+aCmd)
259 | os.system(aCmd)
260 | print("T=",(time.time()-startTime))
261 | print("PATH="+pathWAV,flush=True)
262 | pathIn = pathClean = pathWAV = pathSTRETCH
263 | except Exception as e:
264 | print("Warning: can't STRETCH")
265 | print(e)
266 |
267 | startTime = time.time()
268 | try:
269 | #Check for duration
270 | aCmd = "ffmpeg -y -i \""+pathIn+"\" "+ " -f null - > \""+pathIn+".dur\" 2>&1"
271 | print("CMD: "+aCmd)
272 | os.system(aCmd)
273 | print("T=",(time.time()-startTime))
274 | duration = getDuration(pathIn+".dur")
275 | print("DURATION="+str(duration)+" max "+str(maxDuration))
276 | if(duration > maxDuration):
277 | return "[Too long ("+str(duration)+"s)]"
278 | except Exception as e:
279 | print("Warning: can't analyze duration")
280 | print(e)
281 |
282 | try:
283 | if(useSpleeter):
284 | startTime = time.time()
285 | spleeterDir=pathIn+".spleeter"
286 | if(not os.path.exists(spleeterDir)):
287 | os.mkdir(spleeterDir)
288 | pathSpleeter=spleeterDir+"/"+os.path.splitext(os.path.basename(pathIn))[0]+"/vocals.wav"
289 | separator.separate_to_file(pathIn, spleeterDir)
290 | print("T=",(time.time()-startTime))
291 | print("PATH="+pathSpleeter,flush=True)
292 | pathNoCut = pathIn = pathSpleeter
293 | except Exception as e:
294 | print("Warning: can't split vocals")
295 | print(e)
296 |
297 | if(useDemucs):
298 | startTime = time.time()
299 | try:
300 | #demucsDir=pathIn+".demucs"
301 | #if(not os.path.exists(demucsDir)):
302 | # os.mkdir(demucsDir)
303 | pathDemucsVocals=pathIn+".vocals.wav" #demucsDir+"/htdemucs/"+os.path.splitext(os.path.basename(pathIn))[0]+"/vocals.wav"
304 | pathDemucsDrums=pathIn+".drums.wav"
305 | pathDemucsBass=pathIn+".bass.wav"
306 | pathDemucsOther=pathIn+".other.wav"
307 | #Demucs seems complex, using CLI cmd for now
308 | #aCmd = "python -m demucs --two-stems=vocals -d "+device+":"+cudaIdx+" --out "+demucsDir+" "+pathIn
309 | #print("CMD: "+aCmd)
310 | #os.system(aCmd)
311 | demucs_audio(pathIn=pathIn,model=modelDemucs,device="cuda:"+cudaIdx,pathVocals=pathDemucsVocals,pathOther=pathIn+".other.wav")
312 | print("T=",(time.time()-startTime))
313 | print("PATH="+pathDemucsVocals,flush=True)
314 | pathNoCut = pathIn = pathDemucsVocals
315 | except Exception as e:
316 | print("Warning: can't split vocals")
317 | print(e)
318 |
319 | startTime = time.time()
320 | try:
321 | pathSILCUT = pathIn+".SILCUT"+".wav"
322 | aCmd = "ffmpeg -y -i \""+pathIn+"\" -af \"silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=0.2:stop_silence=0.2, loudnorm\" "+ " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathSILCUT+"\" > \""+pathSILCUT+".log\" 2>&1"
323 | print("CMD: "+aCmd)
324 | os.system(aCmd)
325 | print("T=",(time.time()-startTime))
326 | print("PATH="+pathSILCUT,flush=True)
327 | pathIn = pathSILCUT
328 | except Exception as e:
329 | print("Warning: can't filter blanks")
330 | print(e)
331 |
332 | try:
333 | if(not isMusic and useSileroVAD):
334 | startTime = time.time()
335 |
336 | pathVAD = pathIn+".VAD.wav"
337 | wav = read_audio(pathIn, sampling_rate=SAMPLING_RATE)
338 | #https://github.com/snakers4/silero-vad/blob/master/utils_vad.py#L161
339 | speech_timestamps = get_speech_timestamps(wav, modelVAD,threshold=0.5,min_silence_duration_ms=500, sampling_rate=SAMPLING_RATE)
340 | save_audio(pathVAD,collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE)
341 | print("T=",(time.time()-startTime))
342 | print("PATH="+pathVAD,flush=True)
343 | pathIn = pathVAD
344 | except Exception as e:
345 | print("Warning: can't filter noises")
346 | print(e)
347 |
348 | try:
349 | if(float(remixFactor) >= 1):
350 | pathREMIXN = pathClean
351 | elif (float(remixFactor) <= 0 and useDemucs):
352 | pathREMIXN = pathDemucsVocals;
353 | elif (isMusic and useDemucs):
354 | startTime = time.time()
355 |
356 | if(speechnorm):
357 | pathNORM = pathDemucsVocals+".NORM.wav"
358 | aCmd = ("ffmpeg -y -i \""+pathDemucsVocals+"\""
359 | #+ " -filter:a loudnorm"
360 | +" -af \"speechnorm=e=50:r=0.0005:l=1\""
361 | +" \""+pathNORM+"\" > \""+pathNORM+".log\" 2>&1")
362 | print("CMD: "+aCmd)
363 | os.system(aCmd)
364 | print("T=",(time.time()-startTime))
365 | print("PATH="+pathNORM,flush=True)
366 | else:
367 | pathNORM = pathDemucsVocals
368 |
369 | pathREMIXN = pathNORM+".REMIX.wav"
370 | aCmd = ("ffmpeg -y -i \""+pathNORM+"\" -i \""+pathDemucsDrums+"\" -i \""+pathDemucsBass+"\" -i \""+pathDemucsOther+"\""
371 | +" -filter_complex amix=inputs=4:duration=longest:dropout_transition=0:weights=\"1 "+remixFactor+" "+remixFactor+" "+remixFactor+"\""
372 | +" \""+pathREMIXN+"\" > \""+pathREMIXN+".log\" 2>&1")
373 | print("CMD: "+aCmd)
374 | os.system(aCmd)
375 | print("T=",(time.time()-startTime))
376 | print("PATH="+pathREMIXN,flush=True)
377 | except Exception as e:
378 | print("Warning: can't remix")
379 | print(e)
380 |
381 | mode=1
382 | if(duration > 30):
383 | print("NOT USING MARKS FOR DURATION > 30s")
384 | mode=0
385 |
386 | startTime = time.time()
387 | if(onlySRT):
388 | result = {}
389 | result["text"] = ""
390 | else:
391 | result = transcribeMARK(pathIn, opts, mode=mode,lngInput=lngInput,isMusic=isMusic)
392 | if len(result["text"]) <= 0:
393 | result["text"] = "--"
394 |
395 | if(onlySRT or addSRT):
396 | #Better timestamps using original music clip
397 | if(isMusic
398 | #V3 is very bad with music!?
399 | and not whisperVersion == "-v3"
400 | ):
401 | if(pathREMIXN is not None):
402 | resultSRT = transcribeMARK(pathREMIXN, opts, mode=3,lngInput=lngInput,isMusic=isMusic
403 | ,nbRun=nbRun)
404 | else:
405 | resultSRT = transcribeMARK(pathClean, opts, mode=3,lngInput=lngInput,isMusic=isMusic
406 | ,nbRun=nbRun)
407 | else:
408 | resultSRT = transcribeMARK(pathNoCut, opts, mode=3,lngInput=lngInput,isMusic=isMusic
409 | ,nbRun=nbRun)
410 |
411 | result["text"] += resultSRT["text"]
412 |
413 | print("T=",(time.time()-initTime))
414 | if(len(result["text"]) > 0):
415 | print("s/c=",(time.time()-initTime)/len(result["text"]))
416 | print("c/s=",len(result["text"])/(time.time()-initTime))
417 |
418 | return result["text"]
419 |
420 | def transcribeMARK(path: str,opts: dict,mode = 1,lngInput=None,aLast=None,isMusic=False
421 | #Whisper is unstable, especially with music. Multiple run can provide with better results to eval afterward
422 | ,nbRun=1):
423 | print("transcribeMARK(): "+path)
424 | pathIn = path
425 |
426 | lng = opts["language"]
427 |
428 | if(lngInput == None):
429 | lngInput = lng
430 |
431 | noMarkRE = "^(ar|he|ru|zh)$"
432 | if(lng != None and re.match(noMarkRE,lng) and mode != 3):
433 | #Need special voice marks
434 | mode = 0
435 |
436 | if(isMusic and mode != 3):
437 | #Markers are not really interesting with music
438 | mode = 0
439 |
440 | if(whisperFound == "SM4T"):
441 | #Not marker with SM4T
442 | mode = 0
443 |
444 | if os.path.exists("markers/WOK-MRK-"+lngInput+".wav"):
445 | mark1="markers/WOK-MRK-"+lngInput+".wav"
446 | else:
447 | mark1="markers/WOK-MRK.wav"
448 | if os.path.exists("markers/OKW-MRK-"+lngInput+".wav"):
449 | mark2="markers/OKW-MRK-"+lngInput+".wav"
450 | else:
451 | mark2="markers/OKW-MRK.wav"
452 |
453 | if(mode == 2):
454 | mark = mark1
455 | mark1 = mark2
456 | mark2 = mark
457 |
458 | if(mode == 0):
459 | print("["+str(mode)+"] PATH="+pathIn,flush=True)
460 | else:
461 | try:
462 | if(mode != 3):
463 | startTime = time.time()
464 | pathMRK = pathIn+".MRK"+".wav"
465 | aCmd = "ffmpeg -y -i "+mark1+" -i \""+pathIn+"\" -i "+mark2+" -filter_complex \"[0:a][1:a][2:a]concat=n=3:v=0:a=1[a]\" -map \"[a]\" -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathMRK+"\" > \""+pathMRK+".log\" 2>&1"
466 | print("CMD: "+aCmd)
467 | os.system(aCmd)
468 | print("T=",(time.time()-startTime))
469 | print("["+str(mode)+"] PATH="+pathMRK,flush=True)
470 | pathIn = pathMRK
471 |
472 | if(useCompressor
473 | and not isMusic
474 | ):
475 | startTime = time.time()
476 | pathCPS = pathIn+".CPS"+".wav"
477 | aCmd = "ffmpeg -y -i \""+pathIn+"\" -af \"speechnorm=e=50:r=0.0005:l=1\" "+ " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathCPS+"\" > \""+pathCPS+".log\" 2>&1"
478 | print("CMD: "+aCmd)
479 | os.system(aCmd)
480 | print("T=",(time.time()-startTime))
481 | print("["+str(mode)+"] PATH="+pathCPS,flush=True)
482 | pathIn = pathCPS
483 | except Exception as e:
484 | print("Warning: can't add markers")
485 | print(e)
486 |
487 | startTime = time.time()
488 | lock.acquire()
489 | try:
490 | transcribe_options = dict(**opts)#avoid to add beam_size opt several times
491 | if beam_size > 1:
492 | transcribe_options["beam_size"] = beam_size
493 | if patience > 0:
494 | transcribe_options["patience"] = patience
495 | if temperature > 0:
496 | transcribe_options["temperature"] = temperature
497 |
498 | if whisperFound == "FSTR":
499 | result = {}
500 | result["text"] = ""
501 | multiRes = ""
502 | for r in range(nbRun):
503 | print("RUN: "+str(r))
504 | segments, info = model.transcribe(pathIn,**transcribe_options)
505 | resSegs = []
506 | if(mode == 3):
507 | aSegCount = 0
508 | for segment in segments:
509 | if("word_timestamps" in transcribe_options):
510 | for word in segment.words:
511 | aSegCount += 1
512 | resSegs.append("\n"+str(aSegCount)+"\n"+formatTimeStamp(word.start)+" --> "+formatTimeStamp(word.end)+"\n"+word.word.strip()+"\n")
513 | else:
514 | aSegCount += 1
515 | resSegs.append("\n"+str(aSegCount)+"\n"+formatTimeStamp(segment.start)+" --> "+formatTimeStamp(segment.end)+"\n"+segment.text.strip()+"\n")
516 | else:
517 | for segment in segments:
518 | resSegs.append(segment.text)
519 |
520 | result["text"] = "".join(resSegs)
521 | if(r > 0):
522 | multiRes += "=====\n"
523 | multiRes += result["text"]
524 |
525 | if(nbRun > 1):
526 | result["text"] = multiRes
527 | elif whisperFound == "SM4T":
528 | src_lang = lang2to3[lngInput];
529 | tgt_lang = lang2to3[lng];
530 | # S2TT
531 | #translated_text, _, _ = translator.predict(, "s2tt", )
532 | translated_text, _, _ = model.predict(pathIn, "s2tt", tgt_lang)
533 | result = {}
534 | result["text"] = str(translated_text)
535 | else:
536 | transcribe_options = dict(task="transcribe", **transcribe_options)
537 | multiRes = ""
538 | for r in range(nbRun):
539 | print("RUN: "+str(r))
540 | result = model.transcribe(pathIn,**transcribe_options)
541 | if(mode == 3):
542 | p = Path(pathIn)
543 | writer = WriteSRT(p.parent)
544 | srtOpts = { "max_line_width" : 80, "max_line_count" : 2, "highlight_words" : False}
545 | if("word_timestamps" in transcribe_options and transcribe_options["word_timestamps"]):
546 | srtOpts = { "max_line_width" : 30, "max_line_count" : 1, "highlight_words" : transcribe_options["word_timestamps"]}
547 | writer(result, pathIn,srtOpts)
548 | audio_basename = os.path.basename(pathIn)
549 | audio_basename = os.path.splitext(audio_basename)[0]
550 | output_path = os.path.join(
551 | p.parent, audio_basename + ".srt"
552 | )
553 | with open(output_path) as f:
554 | result["text"] = f.read()
555 |
556 | if("word_timestamps" in transcribe_options and transcribe_options["word_timestamps"]):
557 | result["text"] = re.sub("(\n[^<\n]*|[^<\n]*\n)"#Remove lines without highlighted words
558 | ,"\n",re.sub(r"\n[^<\n]*\n\n","\n\n"#Keep only highlighted words
559 | ,result["text"]))
560 | if(r > 0):
561 | multiRes += "=====\n"
562 | multiRes += result["text"]
563 | if(nbRun > 1):
564 | result["text"] = multiRes
565 |
566 | print("T=",(time.time()-startTime))
567 | print("TRANS="+result["text"],flush=True)
568 | except Exception as e:
569 | print(e)
570 | traceback.print_exc()
571 | lock.release()
572 | result = {}
573 | result["text"] = ""
574 | return result
575 |
576 | lock.release()
577 |
578 | if(mode == 0 or mode == 3):
579 | return result
580 | #Too restrictive
581 | #if(result["text"] == aLast):
582 | # #Only if confirmed
583 | # return result
584 | #result["text"] = ""
585 | #return result
586 |
587 | aWhisper="(Whisper|Wisper|Wyspę|Wysper|Wispa|Уіспер|Ου ίσπερ|위스퍼드|ウィスパー|विस्पर|विसपर)"
588 | aOk="(o[.]?k[.]?|okay|oké|okej|Окей|οκέι|오케이|オーケー|ओके)"
589 | aSep="[.,!? ]*"
590 | if(mode == 1):
591 | aCleaned = re.sub(r"(^ *"+aWhisper+aSep+aOk+aSep+"|"+aOk+aSep+aWhisper+aSep+" *$)", "", result["text"], 2, re.IGNORECASE)
592 | if(re.match(r"^ *("+aOk+"|"+aSep+"|"+aWhisper+")*"+aWhisper+"("+aOk+"|"+aSep+"|"+aWhisper+")* *$", result["text"], re.IGNORECASE)):
593 | #Empty sound ?
594 | return transcribeMARK(path, opts, mode=2,lngInput=lngInput,aLast="")
595 |
596 | if(re.match(r"^ *"+aWhisper+aSep+aOk+aSep+".*"+aOk+aSep+aWhisper+aSep+" *$", result["text"], re.IGNORECASE)):
597 | #GOOD!
598 | result["text"] = aCleaned
599 | return result
600 |
601 | return transcribeMARK(path, opts, mode=2,lngInput=lngInput,aLast=aCleaned)
602 |
603 | if(mode == 2):
604 | aCleaned = re.sub(r"(^ *"+aOk+aSep+aWhisper+aSep+"|"+aWhisper+aSep+aOk+aSep+" *$)", "", result["text"], 2, re.IGNORECASE)
605 | if(aCleaned == aLast):
606 | #CONFIRMED!
607 | result["text"] = aCleaned
608 | return result
609 |
610 | if(re.match(r"^ *("+aOk+"|"+aSep+"|"+aWhisper+")*"+aWhisper+"("+aOk+"|"+aSep+"|"+aWhisper+")* *$", result["text"], re.IGNORECASE)):
611 | #Empty sound ?
612 | result["text"] = ""
613 | return result
614 |
615 | if(re.match(r"^ *"+aOk+aSep+aWhisper+aSep+".*"+aWhisper+aSep+aOk+aSep+" *$", result["text"], re.IGNORECASE)):
616 | #GOOD!
617 | result["text"] = aCleaned
618 | return result
619 |
620 | return transcribeMARK(path, opts, mode=0,lngInput=lngInput,aLast=aCleaned)
621 |
622 |
--------------------------------------------------------------------------------