├── README.md ├── data ├── KatyPerry-Firework.mp3 └── silence.webm ├── demucsWrapper.py ├── lang2to3.py ├── markers ├── OKW-MRK-da.wav ├── OKW-MRK-de.wav ├── OKW-MRK-el.wav ├── OKW-MRK-en.wav ├── OKW-MRK-es.wav ├── OKW-MRK-fr.wav ├── OKW-MRK-hi.wav ├── OKW-MRK-hu.wav ├── OKW-MRK-it.wav ├── OKW-MRK-ja.wav ├── OKW-MRK-ko.wav ├── OKW-MRK-nl.wav ├── OKW-MRK-pl.wav ├── OKW-MRK-pt.wav ├── OKW-MRK-sv.wav ├── OKW-MRK-uk.wav ├── OKW-MRK.wav ├── WOK-MRK-da.wav ├── WOK-MRK-de.wav ├── WOK-MRK-el.wav ├── WOK-MRK-en.wav ├── WOK-MRK-es.wav ├── WOK-MRK-fr.wav ├── WOK-MRK-hi.wav ├── WOK-MRK-hu.wav ├── WOK-MRK-it.wav ├── WOK-MRK-ja.wav ├── WOK-MRK-ko.wav ├── WOK-MRK-nl.wav ├── WOK-MRK-pl.wav ├── WOK-MRK-pt.wav ├── WOK-MRK-sv.wav ├── WOK-MRK-uk.wav └── WOK-MRK.wav └── transcribeHallu.py /README.md: -------------------------------------------------------------------------------- 1 | # WhisperHallu 2 | Experimental code: sound file preprocessing to optimize Whisper transcriptions without hallucinated texts 3 | 4 | See this discussion: https://github.com/openai/whisper/discussions/679 5 | 6 | # Main algo 7 | - remove noise by voice extraction using [Facebook Demucs](https://github.com/facebookresearch/demucs) or [Deezer Spleeter](https://github.com/deezer/spleeter). 8 | - remove silences, and normalize loudness with ffmpeg. 9 | - remove noise parts using [Silero VAD](https://github.com/snakers4/silero-vad). 10 | - add voice markers. 11 | - apply speech compressor (requires `ffmpeg` 4.4, while Google Colab is 4.2, it has to be upgraded, see below). 12 | - try to transcribe. If markers are present in output, transcription is OK. 13 | - if not, try to invert markers. If markers are present in output, transcription is OK. 14 | - if not, try without markers. 15 | 16 | # Processing options and parameters 17 | - use Whisper V1, V2 or V3 (V2 by default, because V3 seems bad with music). 18 | - beam_size (2 by default), patience, temperature. 19 | - process only a subpart of the input file (needs a post-processing of timestamp values). 20 | - various time stretching methods tested (see in-code comments. Needs a post-processing of timestamp values. It was an interesting suggested idea, but no real gain obtained on my side). 21 | - vocals remix (with or without speech normalization). 22 | - multiple final transcriptions (get multiple results, knowing Whisper is not stable from one run to an other, without doing pre-processing several times) 23 | 24 | # Complement 25 | 26 | May be used to produce "accurate transcriptions" for WhisperTimeSync:
27 | https://github.com/EtienneAb3d/WhisperTimeSync 28 | 29 | May be tested using NeuroSpell Dictaphone:
30 | https://neurospell.com/ 31 | 32 | WhisperHallu and WhisperTimeSync are used to extract vocals and lyrics in karaok-AI:
33 | https://github.com/EtienneAb3d/karaok-AI 34 | 35 | ChatMate is a complete versatile ChatGPT automation tool, including explanations to produce a SRT file translator to Chinese (as an example):
36 | https://github.com/EtienneAb3d/ChatMate 37 | 38 | # Google Colab 39 | 40 | Standard Whisper:
41 | https://colab.research.google.com/drive/1-GpXaNaGFXKX9VXl60JGVVrGO41t09KA?usp=sharing 42 | 43 | Faster Whisper:
44 | https://colab.research.google.com/drive/1RkvOtUTbUD5NVsRI4aKEqJO8BRo8BFIY?usp=sharing 45 | 46 | # Install 47 | 48 | **Check ffmpeg version >=4.4** 49 | ```sh 50 | ffmpeg -version 51 | 52 | Output should be: 53 | ================= 54 | ffmpeg version 4.4.3-0ubuntu1~20.04.sav2 Copyright (c) 2000-2022 the FFmpeg developers 55 | [...] 56 | 57 | Install latest: 58 | =============== 59 | sudo add-apt-repository -y ppa:savoury1/ffmpeg4 60 | sudo apt-get -qq install -y ffmpeg 61 | 62 | ``` 63 | 64 | **Demucs (if used)** 65 | 66 | ```sh 67 | pip install -U demucs 68 | ``` 69 | 70 | **Spleeter (if used)** 71 | 72 | ```sh 73 | pip install spleeter 74 | ``` 75 | 76 | **Standard Whisper (if used)** 77 | 78 | ```sh 79 | sudo apt update && sudo apt install ffmpeg 80 | 81 | sudo apt install python3 82 | sudo apt install python3-pip 83 | sudo apt install virtualenv 84 | 85 | virtualenv -p python3 ../venvWhisper 86 | . ../venvWhisper/bin/activate 87 | 88 | pip install -U openai-whisper 89 | 90 | pip3 install torchaudio 91 | ``` 92 | 93 | **Faster Whisper (if used in place of Whisper)** 94 | 95 | ```sh 96 | sudo apt update && sudo apt install ffmpeg 97 | 98 | sudo apt install python3 99 | sudo apt install python3-pip 100 | sudo apt install virtualenv 101 | 102 | virtualenv -p python3 ../venvFasterWhisper 103 | . ../venvFasterWhisper/bin/activate 104 | 105 | git clone https://github.com/guillaumekln/faster-whisper.git 106 | cd faster-whisper/ 107 | 108 | pip install -e .[conversion] 109 | pip install -e . 110 | 111 | cd .. 112 | 113 | ct2-transformers-converter --model openai/whisper-medium --output_dir whisper-medium-ct2 --quantization float16 114 | ct2-transformers-converter --model openai/whisper-large --output_dir whisper-large-ct2 --quantization float16 115 | 116 | pip3 install torchaudio 117 | ``` 118 | 119 | **SM4T (if used in place of Whisper)** 120 | 121 | ```sh 122 | sudo apt update && sudo apt install ffmpeg 123 | 124 | sudo apt install python3 125 | sudo apt install python3-pip 126 | sudo apt install virtualenv 127 | 128 | virtualenv -p python3 ../venvSM4T 129 | . ../venvSM4T/bin/activate 130 | 131 | git clone https://github.com/facebookresearch/seamless_communication.git 132 | cd seamless_communication/ 133 | 134 | pip install --upgrade pip 135 | pip install . 136 | 137 | m4t_predict "On ne fait pas d'omelette sans casser des oeufs." t2tt eng --src_lang fra 138 | 139 | pip3 install torchaudio 140 | ``` 141 | 142 | # Code 143 | 144 | ```python 145 | from transcribeHallu import loadModel 146 | from transcribeHallu import transcribePrompt 147 | 148 | ##### The audio language may be different from the one for the output transcription. 149 | path="/path/to/your/en/sound/file" 150 | lngInput="en" 151 | 152 | ##### Activate this for music file to get a minimal processing 153 | isMusic=False 154 | 155 | ##### Need to be adapted for each language. 156 | ##### For prompt examples, see transcribeHallu.py getPrompt(lng:str) 157 | lng="en" 158 | prompt= "Whisper, Ok. "\ 159 | +"A pertinent sentence for your purpose in your language. "\ 160 | +"Ok, Whisper. Whisper, Ok. "\ 161 | +"Ok, Whisper. Whisper, Ok. "\ 162 | +"Please find here, an unlikely ordinary sentence. "\ 163 | +"This is to avoid a repetition to be deleted. "\ 164 | +"Ok, Whisper. " 165 | 166 | ##### Model size to use 167 | modelSize="medium" 168 | loadModel("0",modelSize=modelSize) 169 | 170 | result = transcribePrompt(path=path, lng=lng, prompt=prompt, lngInput=lngInput,isMusic=isMusic) 171 | ``` 172 | 173 |

174 | This tool is a demonstration of our know-how.
175 | If you are interested in a commercial/industrial AI linguistic project, contact us:
176 | https://cubaix.com 177 | -------------------------------------------------------------------------------- /data/KatyPerry-Firework.mp3: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/data/KatyPerry-Firework.mp3 -------------------------------------------------------------------------------- /data/silence.webm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/data/silence.webm -------------------------------------------------------------------------------- /demucsWrapper.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torchaudio 3 | import demucs 4 | from demucs.pretrained import get_model_from_args 5 | from demucs.apply import apply_model 6 | from demucs.separate import load_track 7 | from torch._C import device 8 | import re 9 | 10 | def load_demucs_model(): 11 | return get_model_from_args(type('args', (object,), dict(name='htdemucs', repo=None))).cpu().eval() 12 | 13 | 14 | def demucs_audio(pathIn: str, 15 | model=None, 16 | device=None, 17 | pathVocals: str = None, 18 | pathOther: str = None): 19 | if model is None: 20 | model = load_demucs_model() 21 | 22 | audio = load_track(pathIn, model.audio_channels, model.samplerate) 23 | 24 | audio_dims = audio.dim() 25 | if audio_dims == 1: 26 | audio = audio[None, None].repeat_interleave(2, -2) 27 | else: 28 | if audio.shape[-2] == 1: 29 | audio = audio.repeat_interleave(2, -2) 30 | if audio_dims < 3: 31 | audio = audio[None] 32 | 33 | if device is None: 34 | device = "cuda" if torch.cuda.is_available() else "cpu" 35 | print("Demucs using device: "+device) 36 | result = apply_model(model, audio, device=device, split=True, overlap=.25) 37 | if device != 'cpu': 38 | torch.cuda.empty_cache() 39 | 40 | for name in model.sources: 41 | print("Source: "+name) 42 | if name == "vocals": 43 | source_idx=model.sources.index(name) 44 | source=result[0, source_idx].mean(0) 45 | torchaudio.save(re.sub(r'[.](mp3|wav)$',"."+name+".wav",pathIn), source[None], model.samplerate) 46 | 47 | 48 | -------------------------------------------------------------------------------- /lang2to3.py: -------------------------------------------------------------------------------- 1 | def lang2to3(): 2 | return { 3 | "aa": "aar"# Afar 4 | ,"ab": "abk"# Abkhazian 5 | ,"ae": "ave"# Avestan 6 | ,"af": "afr"# Afrikaans 7 | ,"ak": "aka"# Akan 8 | ,"am": "amh"# Amharic 9 | ,"an": "arg"# Aragonese 10 | ,"ar": "arb"# Modern Standard Arabic 11 | ,"as": "asm"# Assamese 12 | ,"av": "ava"# Avaric 13 | ,"ay": "aym"# Aymara 14 | ,"az": "aze"# Azerbaijani 15 | ,"ba": "bak"# Bashkir 16 | ,"be": "bel"# Belarusian 17 | ,"bg": "bul"# Bulgarian 18 | ,"bh": "bih"# maïthil 19 | ,"bi": "bis"# Bislama 20 | ,"bm": "bam"# Bambara 21 | ,"bn": "ben"# Bengali 22 | ,"bo": "bod"# Tibetan 23 | ,"br": "bre"# Breton 24 | ,"bs": "bos"# Bosnian 25 | ,"ca": "cat"# Catalan 26 | ,"ce": "che"# Chechen 27 | ,"ch": "cha"# Chamorro 28 | ,"co": "cos"# Corsican 29 | ,"cr": "cre"# Cree 30 | ,"cs": "ces"# Czech 31 | ,"cu": "chu"# Slavonic 32 | ,"cv": "chv"# Chuvash 33 | ,"cy": "cym"# Welsh 34 | ,"da": "dan"# Danish 35 | ,"de": "deu"# German 36 | ,"dv": "div"# Divehi 37 | ,"dz": "dzo"# Dzongkha 38 | ,"ee": "ewe"# Ewe 39 | ,"el": "ell"# Greek 40 | ,"en": "eng"# English 41 | ,"eo": "epo"# Esperanto 42 | ,"es": "spa"# Spanish 43 | ,"et": "est"# Estonian 44 | ,"eu": "eus"# Basque 45 | ,"fa": "fas"# Persian 46 | ,"ff": "ful"# Fulah 47 | ,"fi": "fin"# Finnish 48 | ,"fj": "fij"# Fijian 49 | ,"fo": "fao"# Faroese 50 | ,"fr": "fra"# French 51 | ,"fy": "fry"# Frisian 52 | ,"ga": "gle"# Irish 53 | ,"gd": "gla"# Gaelic 54 | ,"gl": "glg"# Galician 55 | ,"gn": "grn"# Guarani 56 | ,"gu": "guj"# Gujarati 57 | ,"gv": "glv"# Manx 58 | ,"ha": "hau"# Hausa 59 | ,"he": "heb"# Hebrew 60 | ,"hi": "hin"# Hindi 61 | ,"ho": "hmo"# Motu 62 | ,"hr": "hrv"# Croatian 63 | ,"ht": "hat"# Haitian 64 | ,"hu": "hun"# Hungarian 65 | ,"hy": "hye"# Armenian 66 | ,"hz": "her"# Herero 67 | ,"ia": "ina"# Interlingua 68 | ,"id": "ind"# Indonesian 69 | ,"ie": "ile"# Interlingue 70 | ,"ig": "ibo"# Igbo 71 | ,"ii": "iii"# Yi 72 | ,"ik": "ipk"# Inupiaq 73 | ,"io": "ido"# Ido 74 | ,"is": "isl"# Icelandic 75 | ,"it": "ita"# Italian 76 | ,"iu": "iku"# Inuktitut 77 | ,"ja": "jpn"# Japanese 78 | ,"jv": "jav"# Javanese 79 | ,"ka": "kat"# Georgian 80 | ,"kg": "kon"# Kongo 81 | ,"ki": "kik"# Kikuyu 82 | ,"kj": "kua"# Kwanyama 83 | ,"kk": "kaz"# Kazakh 84 | ,"kl": "kal"# Greenlandic 85 | ,"km": "khm"# Khmer 86 | ,"kn": "kan"# Kannada 87 | ,"ko": "kor"# Korean 88 | ,"kr": "kau"# Kanuri 89 | ,"ks": "kas"# Kashmiri 90 | ,"ku": "kur"# Kurdish 91 | ,"kv": "kom"# Komi 92 | ,"kw": "cor"# Cornish 93 | ,"ky": "kir"# Kirghiz 94 | ,"la": "lat"# Latin 95 | ,"lb": "ltz"# Luxembourgish 96 | ,"lg": "lug"# Ganda 97 | ,"li": "lim"# Limburgish 98 | ,"ln": "lin"# Lingala 99 | ,"lo": "lao"# Lao 100 | ,"lt": "lit"# Lithuanian 101 | ,"lu": "lub"# Luba 102 | ,"lv": "lav"# Latvian 103 | ,"mg": "mlg"# Malagasy 104 | ,"mh": "mah"# Marshallese 105 | ,"mi": "mri"# Maori 106 | ,"mk": "mkd"# Macedonian 107 | ,"ml": "mal"# Malayalam 108 | ,"mn": "mon"# Mongolian 109 | ,"mo": "mol"# roumain1 110 | ,"mr": "mar"# Marathi 111 | ,"ms": "msa"# Malay 112 | ,"mt": "mlt"# Maltese 113 | ,"my": "mya"# Burmese 114 | ,"na": "nau"# Nauru 115 | ,"nb": "nob"# Bokmål 116 | ,"nd": "nde"# Ndebele 117 | ,"ne": "nep"# Nepali 118 | ,"ng": "ndo"# Ndonga 119 | ,"nl": "nld"# Dutch 120 | ,"nn": "nno"# Nynorsk 121 | ,"no": "nob"# Norwegian Bokmål 122 | ,"nr": "nbl"# Ndebele 123 | ,"nv": "nav"# Navajo 124 | ,"ny": "nya"# Chichewa 125 | ,"oc": "oci"# Occitan 126 | ,"oj": "oji"# Ojibwa 127 | ,"om": "orm"# Oromo 128 | ,"or": "ori"# Oriya 129 | ,"os": "oss"# Ossetian 130 | ,"pa": "pan"# Panjabi 131 | ,"pi": "pli"# Pāli 132 | ,"pl": "pol"# Polish 133 | ,"ps": "pus"# Pashto 134 | ,"pt": "por"# Portuguese 135 | ,"qu": "que"# Quechua 136 | ,"rm": "roh"# Romansh 137 | ,"rn": "run"# Kirundi 138 | ,"ro": "ron"# Romanian 139 | ,"ru": "rus"# Russian 140 | ,"rw": "kin"# Kinyarwanda 141 | ,"sa": "san"# Sanskrit 142 | ,"sc": "srd"# Sardinian 143 | ,"sd": "snd"# Sindhi 144 | ,"se": "sme"# Sami 145 | ,"sg": "sag"# Sango 146 | ,"sh": "hbs"# Croatian 147 | ,"si": "sin"# Sinhalese 148 | ,"sk": "slk"# Slovak 149 | ,"sl": "slv"# Slovenian 150 | ,"sm": "smo"# Samoan 151 | ,"sn": "sna"# Shona 152 | ,"so": "som"# Somali 153 | ,"sq": "sqi"# Albanian 154 | ,"sr": "srp"# Serbian 155 | ,"ss": "ssw"# Swati 156 | ,"st": "sot"# Sotho 157 | ,"su": "sun"# Sundanese 158 | ,"sv": "swe"# Swedish 159 | ,"sw": "swa"# Swahili 160 | ,"ta": "tam"# Tamil 161 | ,"te": "tel"# Telugu 162 | ,"tg": "tgk"# Tajik 163 | ,"th": "tha"# Thai 164 | ,"ti": "tir"# Tigrinya 165 | ,"tk": "tuk"# Turkmen 166 | ,"tl": "tgl"# Tagalog 167 | ,"tn": "tsn"# Tswana 168 | ,"to": "ton"# Tonga 169 | ,"tr": "tur"# Turkish 170 | ,"ts": "tso"# Tsonga 171 | ,"tt": "tat"# Tatar 172 | ,"tw": "twi"# Twi 173 | ,"ty": "tah"# Tahitian 174 | ,"ug": "uig"# Uighur 175 | ,"uk": "ukr"# Ukrainian 176 | ,"ur": "urd"# Urdu 177 | ,"uz": "uzb"# Uzbek 178 | ,"ve": "ven"# Venda 179 | ,"vi": "vie"# Namese 180 | ,"vo": "vol"# Volapük 181 | ,"wa": "wln"# Walloon 182 | ,"wo": "wol"# Wolof 183 | ,"xh": "xho"# Xhosa 184 | ,"yi": "yid"# Yiddish 185 | ,"yo": "yor"# Yoruba 186 | ,"za": "zha"# Zhuang 187 | ,"zh": "cmn"# MandarinChinese 188 | ,"zu": "zul"# Zulu 189 | } 190 | -------------------------------------------------------------------------------- /markers/OKW-MRK-da.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-da.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-de.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-de.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-el.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-el.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-en.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-en.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-es.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-es.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-fr.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-fr.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-hi.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-hi.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-hu.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-hu.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-it.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-it.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-ja.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-ja.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-ko.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-ko.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-nl.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-nl.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-pl.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-pl.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-pt.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-pt.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-sv.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-sv.wav -------------------------------------------------------------------------------- /markers/OKW-MRK-uk.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK-uk.wav -------------------------------------------------------------------------------- /markers/OKW-MRK.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/OKW-MRK.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-da.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-da.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-de.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-de.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-el.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-el.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-en.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-en.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-es.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-es.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-fr.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-fr.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-hi.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-hi.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-hu.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-hu.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-it.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-it.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-ja.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-ja.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-ko.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-ko.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-nl.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-nl.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-pl.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-pl.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-pt.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-pt.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-sv.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-sv.wav -------------------------------------------------------------------------------- /markers/WOK-MRK-uk.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK-uk.wav -------------------------------------------------------------------------------- /markers/WOK-MRK.wav: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/EtienneAb3d/WhisperHallu/4fa7ad0b3be3987c6465f6f255259a02cba24d9b/markers/WOK-MRK.wav -------------------------------------------------------------------------------- /transcribeHallu.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import time 4 | import re 5 | from _io import StringIO 6 | 7 | if sys.version_info.major == 3 and sys.version_info.minor >= 10: 8 | print("Python >= 3.10") 9 | import collections.abc 10 | import collections 11 | collections.MutableMapping = collections.abc.MutableMapping 12 | else: 13 | print("Python < 3.10") 14 | import collections 15 | 16 | import traceback 17 | 18 | import torch 19 | 20 | torch.set_num_threads(1) 21 | useSileroVAD=True 22 | if(useSileroVAD): 23 | modelVAD, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', 24 | model='silero_vad', 25 | force_reload=False, 26 | onnx=False) 27 | (get_speech_timestamps, 28 | save_audio, 29 | read_audio, 30 | VADIterator, 31 | collect_chunks) = utils 32 | 33 | useSpleeter=False 34 | if(useSpleeter): 35 | from spleeter.audio import STFTBackend 36 | backend = STFTBackend.LIBROSA 37 | from spleeter.separator import Separator 38 | print("Using spleeter:2stems-16kHz") 39 | separator = Separator('spleeter:2stems-16kHz',stft_backend=backend) 40 | 41 | useDemucs=True 42 | if(useDemucs): 43 | from demucsWrapper import load_demucs_model 44 | from demucsWrapper import demucs_audio 45 | print("Using Demucs") 46 | modelDemucs = load_demucs_model() 47 | 48 | useCompressor=True 49 | 50 | try: 51 | #Standard Whisper: https://github.com/openai/whisper 52 | import whisper 53 | print("Using standard Whisper") 54 | whisperFound = "STD" 55 | from pathlib import Path 56 | from whisper.utils import WriteSRT 57 | except ImportError as e: 58 | pass 59 | 60 | try: 61 | #FasterWhisper: https://github.com/guillaumekln/faster-whisper 62 | from faster_whisper import WhisperModel 63 | print("Using Faster Whisper") 64 | whisperFound = "FSTR" 65 | modelPath = "whisper-medium-ct2/"#"whisper-medium-ct2/" "whisper-large-ct2/" 66 | if not os.path.exists(modelPath): 67 | print("Faster installation found, but "+modelPath+" model not found") 68 | sys.exit(-1) 69 | except ImportError as e: 70 | pass 71 | 72 | try: 73 | from seamless_communication.models.inference import Translator 74 | from lang2to3 import lang2to3 75 | lang2to3 = lang2to3() 76 | whisperFound = "SM4T" 77 | except ImportError as e: 78 | pass 79 | 80 | #large-v3 model seems to be bad with music, thus keep v2 as the default 81 | whisperVersion = "-v2" #May be "", "-V1", "-v2, "-v3" 82 | whisperLoaded = "??" 83 | beam_size=2 84 | patience=0 85 | temperature=0 86 | model = None 87 | device = "cuda" #cuda / cpu 88 | cudaIdx = 0 89 | 90 | SAMPLING_RATE = 16000 91 | MAX_DURATION = 600 92 | TRUNC_DURATION = MAX_DURATION 93 | 94 | from threading import Lock, Thread 95 | lock = Lock() 96 | 97 | def loadModel(gpu: str,modelSize=None): 98 | global model 99 | global device 100 | global cudaIdx 101 | global whisperLoaded 102 | cudaIdx = gpu 103 | try: 104 | if whisperFound == "FSTR": 105 | if(modelSize == "large"): 106 | modelPath = "whisper-large-ct2/" 107 | else: 108 | modelPath = "whisper-medium-ct2/" 109 | print("LOADING: "+modelPath+" GPU: "+gpu+" BS: "+str(beam_size)+" PTC="+str(patience)+" TEMP="+str(temperature)) 110 | compute_type="float16"# float16 int8_float16 int8 111 | model = WhisperModel(modelPath, device=device,device_index=int(gpu), compute_type=compute_type) 112 | elif whisperFound == "STD": 113 | if(modelSize == None): 114 | modelSize="medium"#"tiny"#"medium" #"large" 115 | if(modelSize == "large"): 116 | modelSize = "large"+whisperVersion #"large-v1" "large-v2" "large-v3" 117 | print("LOADING: "+modelSize+" GPU:"+gpu+" BS: "+str(beam_size)+" PTC="+str(patience)+" TEMP="+str(temperature)) 118 | model = whisper.load_model(modelSize,device=torch.device("cuda:"+gpu)) #May be "cpu" 119 | elif whisperFound == "SM4T": 120 | print("LOADING: "+"seamlessM4T_large"+" GPU:"+gpu) 121 | model = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:"+gpu), torch.float16) 122 | print("LOADED") 123 | whisperLoaded = modelSize 124 | except Exception as e: 125 | print("Can't load Whisper model: "+whisperFound+"/"+modelSize) 126 | print(e) 127 | sys.exit(-1) 128 | 129 | def loadedModel(): 130 | return whisperFound+" "+whisperLoaded 131 | 132 | def getDuration(aLog:str): 133 | duration = None 134 | time = None 135 | with open(aLog) as f: 136 | lines = f.readlines() 137 | for line in lines: 138 | if(re.match(r"^ *Duration: [0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9][0-9], .*$", line, re.IGNORECASE)): 139 | duration = re.sub(r"(.*Duration: *|[,. ].*)", "", line, re.IGNORECASE) 140 | return sum(x * int(t) for x, t in zip([3600, 60, 1], duration.split(":"))) 141 | for aSub in line.split("[\r\n]"): 142 | if(re.match(r"^.*time=[0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9][0-9] .*$", aSub, re.IGNORECASE)): 143 | #print("SUB="+aSub) 144 | time = re.sub(r"(.*time=|[,. ].*)", "", aSub, re.IGNORECASE) 145 | #Return last found time value 146 | if(time != "00:00:00"): 147 | print("TIME="+str(time)) 148 | return sum(x * int(t) for x, t in zip([3600, 60, 1], time.split(":"))) 149 | return None 150 | 151 | def formatTimeStamp(aT=0): 152 | aH = int(aT/3600) 153 | aM = int((aT%3600)/60) 154 | aS = (aT%60) 155 | return "%02d:%02d:%06.3f" % (aH,aM,aS) 156 | 157 | def getPrompt(lng:str): 158 | if(lng == "en"): 159 | aOk="" 160 | return "Whisper, Ok. "\ 161 | +"A pertinent sentence for your purpose in your language. "\ 162 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\ 163 | +"Please find here, an unlikely ordinary sentence. "\ 164 | +"This is to avoid a repetition to be deleted. "\ 165 | +"Ok, Whisper. " 166 | 167 | if(lng == "fr"): 168 | return "Whisper, Ok. "\ 169 | +"Une phrase pertinente pour votre propos dans votre langue. "\ 170 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\ 171 | +"Merci de trouver ci-joint, une phrase ordinaire improbable. "\ 172 | +"Pour éviter une répétition à être supprimée. "\ 173 | +"Ok, Whisper. " 174 | 175 | if(lng == "uk"): 176 | return "Whisper, Ok. "\ 177 | +"Доречне речення вашою мовою для вашої мети. "\ 178 | +"Ok, Whisper. Whisper, Ok. Ok, Whisper. Whisper, Ok. "\ 179 | +"Будь ласка, знайдіть тут навряд чи звичайне речення. "\ 180 | +"Це зроблено для того, щоб уникнути повторення, яке потрібно видалити. "\ 181 | +"Ok, Whisper. " 182 | 183 | if(lng == "hi"): 184 | return "विस्पर, ओके. "\ 185 | +"आपकी भाषा में आपके उद्देश्य के लिए एक प्रासंगिक वाक्य। "\ 186 | +"ओके, विस्पर. विस्पर, ओके. ओके, विस्पर. विस्पर, ओके. "\ 187 | +"कृपया यहां खोजें, एक असंभावित सामान्य वाक्य। "\ 188 | +"यह हटाए जाने की पुनरावृत्ति से बचने के लिए है। "\ 189 | +"ओके, विस्पर. " 190 | 191 | #Not Already defined? 192 | return "" 193 | 194 | 195 | def transcribePrompt(path: str,lng: str,prompt=None,lngInput=None,isMusic=False,addSRT=False,truncDuration=TRUNC_DURATION,maxDuration=MAX_DURATION): 196 | """Whisper transcribe.""" 197 | 198 | if(lngInput == None): 199 | lngInput=lng 200 | print("Using output language as input language: "+lngInput) 201 | 202 | if(prompt == None): 203 | if(not isMusic): 204 | prompt=getPrompt(lng) 205 | else: 206 | prompt=""; 207 | 208 | print("=====transcribePrompt",flush=True) 209 | print("PATH="+path,flush=True) 210 | print("LNGINPUT="+lngInput,flush=True) 211 | print("LNG="+lng,flush=True) 212 | print("PROMPT="+prompt,flush=True) 213 | opts = dict(language=lng,initial_prompt=prompt) 214 | return transcribeOpts(path, opts,lngInput,isMusic=isMusic,addSRT=addSRT,subEnd=truncDuration,maxDuration=maxDuration) 215 | 216 | def transcribeOpts(path: str,opts: dict 217 | ,lngInput=None,isMusic=False,onlySRT=False,addSRT=False 218 | ,subBeg="0",subEnd=str(TRUNC_DURATION) 219 | ,maxDuration=MAX_DURATION 220 | ,stretch=None 221 | ,nbRun=1#Whisper is unstable, especially with music. Multiple run can provide with better results to eval afterward 222 | ,remixFactor="0.3",speechnorm=True 223 | ): 224 | pathIn = path 225 | pathClean = path 226 | pathNoCut = path 227 | 228 | initTime = time.time() 229 | 230 | startTime = time.time() 231 | duration = -1 232 | try: 233 | #Convert to WAV to avoid later possible decoding problem 234 | pathWAV = pathIn+".WAV"+".wav" 235 | aCmd = "ffmpeg -y"+" -i \""+pathIn+"\""+" -ss "+subBeg+" -to "+subEnd + " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathWAV+"\" > \""+pathWAV+".log\" 2>&1" 236 | print("CMD: "+aCmd) 237 | os.system(aCmd) 238 | duration = getDuration(pathWAV+".log") 239 | print("T=",(time.time()-startTime)) 240 | print("DURATION="+str(duration)+" subBeg="+str(subBeg)+" subEnd="+str(subEnd)) 241 | print("PATH="+pathWAV,flush=True) 242 | pathIn = pathClean = pathWAV 243 | except Exception as e: 244 | print("Warning: can't convert to WAV") 245 | print(e) 246 | 247 | try: 248 | if(stretch != None): 249 | pathSTRETCH = pathIn+".STRETCH"+".wav" 250 | #ffmpeg STRECH 251 | aCmd = "ffmpeg -y -i \""+pathIn+"\""+" -t "+str(truncDuration) + " -filter:a \"atempo="+stretch+"\"" + " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathSTRETCH+"\" > \""+pathSTRETCH+".log\" 2>&1" 252 | #sox STRECH 253 | #aCmd = "sox \""+pathIn+"\""+" \""+pathSTRETCH+"\" tempo "+stretch+" > \""+pathSTRETCH+".log\" 2>&1" 254 | #soundstretch STRECH 255 | #aCmd = "soundstretch \""+pathIn+"\""+" \""+pathSTRETCH+"\" -tempo="+str(int(100*float(stretch)) - 100)+" > \""+pathSTRETCH+".log\" 2>&1" 256 | #rubberband STRECH 257 | #aCmd = "rubberband \""+pathIn+"\""+" \""+pathSTRETCH+"\" --tempo "+stretch+" > \""+pathSTRETCH+".log\" 2>&1" 258 | print("CMD: "+aCmd) 259 | os.system(aCmd) 260 | print("T=",(time.time()-startTime)) 261 | print("PATH="+pathWAV,flush=True) 262 | pathIn = pathClean = pathWAV = pathSTRETCH 263 | except Exception as e: 264 | print("Warning: can't STRETCH") 265 | print(e) 266 | 267 | startTime = time.time() 268 | try: 269 | #Check for duration 270 | aCmd = "ffmpeg -y -i \""+pathIn+"\" "+ " -f null - > \""+pathIn+".dur\" 2>&1" 271 | print("CMD: "+aCmd) 272 | os.system(aCmd) 273 | print("T=",(time.time()-startTime)) 274 | duration = getDuration(pathIn+".dur") 275 | print("DURATION="+str(duration)+" max "+str(maxDuration)) 276 | if(duration > maxDuration): 277 | return "[Too long ("+str(duration)+"s)]" 278 | except Exception as e: 279 | print("Warning: can't analyze duration") 280 | print(e) 281 | 282 | try: 283 | if(useSpleeter): 284 | startTime = time.time() 285 | spleeterDir=pathIn+".spleeter" 286 | if(not os.path.exists(spleeterDir)): 287 | os.mkdir(spleeterDir) 288 | pathSpleeter=spleeterDir+"/"+os.path.splitext(os.path.basename(pathIn))[0]+"/vocals.wav" 289 | separator.separate_to_file(pathIn, spleeterDir) 290 | print("T=",(time.time()-startTime)) 291 | print("PATH="+pathSpleeter,flush=True) 292 | pathNoCut = pathIn = pathSpleeter 293 | except Exception as e: 294 | print("Warning: can't split vocals") 295 | print(e) 296 | 297 | if(useDemucs): 298 | startTime = time.time() 299 | try: 300 | #demucsDir=pathIn+".demucs" 301 | #if(not os.path.exists(demucsDir)): 302 | # os.mkdir(demucsDir) 303 | pathDemucsVocals=pathIn+".vocals.wav" #demucsDir+"/htdemucs/"+os.path.splitext(os.path.basename(pathIn))[0]+"/vocals.wav" 304 | pathDemucsDrums=pathIn+".drums.wav" 305 | pathDemucsBass=pathIn+".bass.wav" 306 | pathDemucsOther=pathIn+".other.wav" 307 | #Demucs seems complex, using CLI cmd for now 308 | #aCmd = "python -m demucs --two-stems=vocals -d "+device+":"+cudaIdx+" --out "+demucsDir+" "+pathIn 309 | #print("CMD: "+aCmd) 310 | #os.system(aCmd) 311 | demucs_audio(pathIn=pathIn,model=modelDemucs,device="cuda:"+cudaIdx,pathVocals=pathDemucsVocals,pathOther=pathIn+".other.wav") 312 | print("T=",(time.time()-startTime)) 313 | print("PATH="+pathDemucsVocals,flush=True) 314 | pathNoCut = pathIn = pathDemucsVocals 315 | except Exception as e: 316 | print("Warning: can't split vocals") 317 | print(e) 318 | 319 | startTime = time.time() 320 | try: 321 | pathSILCUT = pathIn+".SILCUT"+".wav" 322 | aCmd = "ffmpeg -y -i \""+pathIn+"\" -af \"silenceremove=start_periods=1:stop_periods=-1:start_threshold=-50dB:stop_threshold=-50dB:start_silence=0.2:stop_silence=0.2, loudnorm\" "+ " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathSILCUT+"\" > \""+pathSILCUT+".log\" 2>&1" 323 | print("CMD: "+aCmd) 324 | os.system(aCmd) 325 | print("T=",(time.time()-startTime)) 326 | print("PATH="+pathSILCUT,flush=True) 327 | pathIn = pathSILCUT 328 | except Exception as e: 329 | print("Warning: can't filter blanks") 330 | print(e) 331 | 332 | try: 333 | if(not isMusic and useSileroVAD): 334 | startTime = time.time() 335 | 336 | pathVAD = pathIn+".VAD.wav" 337 | wav = read_audio(pathIn, sampling_rate=SAMPLING_RATE) 338 | #https://github.com/snakers4/silero-vad/blob/master/utils_vad.py#L161 339 | speech_timestamps = get_speech_timestamps(wav, modelVAD,threshold=0.5,min_silence_duration_ms=500, sampling_rate=SAMPLING_RATE) 340 | save_audio(pathVAD,collect_chunks(speech_timestamps, wav), sampling_rate=SAMPLING_RATE) 341 | print("T=",(time.time()-startTime)) 342 | print("PATH="+pathVAD,flush=True) 343 | pathIn = pathVAD 344 | except Exception as e: 345 | print("Warning: can't filter noises") 346 | print(e) 347 | 348 | try: 349 | if(float(remixFactor) >= 1): 350 | pathREMIXN = pathClean 351 | elif (float(remixFactor) <= 0 and useDemucs): 352 | pathREMIXN = pathDemucsVocals; 353 | elif (isMusic and useDemucs): 354 | startTime = time.time() 355 | 356 | if(speechnorm): 357 | pathNORM = pathDemucsVocals+".NORM.wav" 358 | aCmd = ("ffmpeg -y -i \""+pathDemucsVocals+"\"" 359 | #+ " -filter:a loudnorm" 360 | +" -af \"speechnorm=e=50:r=0.0005:l=1\"" 361 | +" \""+pathNORM+"\" > \""+pathNORM+".log\" 2>&1") 362 | print("CMD: "+aCmd) 363 | os.system(aCmd) 364 | print("T=",(time.time()-startTime)) 365 | print("PATH="+pathNORM,flush=True) 366 | else: 367 | pathNORM = pathDemucsVocals 368 | 369 | pathREMIXN = pathNORM+".REMIX.wav" 370 | aCmd = ("ffmpeg -y -i \""+pathNORM+"\" -i \""+pathDemucsDrums+"\" -i \""+pathDemucsBass+"\" -i \""+pathDemucsOther+"\"" 371 | +" -filter_complex amix=inputs=4:duration=longest:dropout_transition=0:weights=\"1 "+remixFactor+" "+remixFactor+" "+remixFactor+"\"" 372 | +" \""+pathREMIXN+"\" > \""+pathREMIXN+".log\" 2>&1") 373 | print("CMD: "+aCmd) 374 | os.system(aCmd) 375 | print("T=",(time.time()-startTime)) 376 | print("PATH="+pathREMIXN,flush=True) 377 | except Exception as e: 378 | print("Warning: can't remix") 379 | print(e) 380 | 381 | mode=1 382 | if(duration > 30): 383 | print("NOT USING MARKS FOR DURATION > 30s") 384 | mode=0 385 | 386 | startTime = time.time() 387 | if(onlySRT): 388 | result = {} 389 | result["text"] = "" 390 | else: 391 | result = transcribeMARK(pathIn, opts, mode=mode,lngInput=lngInput,isMusic=isMusic) 392 | if len(result["text"]) <= 0: 393 | result["text"] = "--" 394 | 395 | if(onlySRT or addSRT): 396 | #Better timestamps using original music clip 397 | if(isMusic 398 | #V3 is very bad with music!? 399 | and not whisperVersion == "-v3" 400 | ): 401 | if(pathREMIXN is not None): 402 | resultSRT = transcribeMARK(pathREMIXN, opts, mode=3,lngInput=lngInput,isMusic=isMusic 403 | ,nbRun=nbRun) 404 | else: 405 | resultSRT = transcribeMARK(pathClean, opts, mode=3,lngInput=lngInput,isMusic=isMusic 406 | ,nbRun=nbRun) 407 | else: 408 | resultSRT = transcribeMARK(pathNoCut, opts, mode=3,lngInput=lngInput,isMusic=isMusic 409 | ,nbRun=nbRun) 410 | 411 | result["text"] += resultSRT["text"] 412 | 413 | print("T=",(time.time()-initTime)) 414 | if(len(result["text"]) > 0): 415 | print("s/c=",(time.time()-initTime)/len(result["text"])) 416 | print("c/s=",len(result["text"])/(time.time()-initTime)) 417 | 418 | return result["text"] 419 | 420 | def transcribeMARK(path: str,opts: dict,mode = 1,lngInput=None,aLast=None,isMusic=False 421 | #Whisper is unstable, especially with music. Multiple run can provide with better results to eval afterward 422 | ,nbRun=1): 423 | print("transcribeMARK(): "+path) 424 | pathIn = path 425 | 426 | lng = opts["language"] 427 | 428 | if(lngInput == None): 429 | lngInput = lng 430 | 431 | noMarkRE = "^(ar|he|ru|zh)$" 432 | if(lng != None and re.match(noMarkRE,lng) and mode != 3): 433 | #Need special voice marks 434 | mode = 0 435 | 436 | if(isMusic and mode != 3): 437 | #Markers are not really interesting with music 438 | mode = 0 439 | 440 | if(whisperFound == "SM4T"): 441 | #Not marker with SM4T 442 | mode = 0 443 | 444 | if os.path.exists("markers/WOK-MRK-"+lngInput+".wav"): 445 | mark1="markers/WOK-MRK-"+lngInput+".wav" 446 | else: 447 | mark1="markers/WOK-MRK.wav" 448 | if os.path.exists("markers/OKW-MRK-"+lngInput+".wav"): 449 | mark2="markers/OKW-MRK-"+lngInput+".wav" 450 | else: 451 | mark2="markers/OKW-MRK.wav" 452 | 453 | if(mode == 2): 454 | mark = mark1 455 | mark1 = mark2 456 | mark2 = mark 457 | 458 | if(mode == 0): 459 | print("["+str(mode)+"] PATH="+pathIn,flush=True) 460 | else: 461 | try: 462 | if(mode != 3): 463 | startTime = time.time() 464 | pathMRK = pathIn+".MRK"+".wav" 465 | aCmd = "ffmpeg -y -i "+mark1+" -i \""+pathIn+"\" -i "+mark2+" -filter_complex \"[0:a][1:a][2:a]concat=n=3:v=0:a=1[a]\" -map \"[a]\" -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathMRK+"\" > \""+pathMRK+".log\" 2>&1" 466 | print("CMD: "+aCmd) 467 | os.system(aCmd) 468 | print("T=",(time.time()-startTime)) 469 | print("["+str(mode)+"] PATH="+pathMRK,flush=True) 470 | pathIn = pathMRK 471 | 472 | if(useCompressor 473 | and not isMusic 474 | ): 475 | startTime = time.time() 476 | pathCPS = pathIn+".CPS"+".wav" 477 | aCmd = "ffmpeg -y -i \""+pathIn+"\" -af \"speechnorm=e=50:r=0.0005:l=1\" "+ " -c:a pcm_s16le -ar "+str(SAMPLING_RATE)+" \""+pathCPS+"\" > \""+pathCPS+".log\" 2>&1" 478 | print("CMD: "+aCmd) 479 | os.system(aCmd) 480 | print("T=",(time.time()-startTime)) 481 | print("["+str(mode)+"] PATH="+pathCPS,flush=True) 482 | pathIn = pathCPS 483 | except Exception as e: 484 | print("Warning: can't add markers") 485 | print(e) 486 | 487 | startTime = time.time() 488 | lock.acquire() 489 | try: 490 | transcribe_options = dict(**opts)#avoid to add beam_size opt several times 491 | if beam_size > 1: 492 | transcribe_options["beam_size"] = beam_size 493 | if patience > 0: 494 | transcribe_options["patience"] = patience 495 | if temperature > 0: 496 | transcribe_options["temperature"] = temperature 497 | 498 | if whisperFound == "FSTR": 499 | result = {} 500 | result["text"] = "" 501 | multiRes = "" 502 | for r in range(nbRun): 503 | print("RUN: "+str(r)) 504 | segments, info = model.transcribe(pathIn,**transcribe_options) 505 | resSegs = [] 506 | if(mode == 3): 507 | aSegCount = 0 508 | for segment in segments: 509 | if("word_timestamps" in transcribe_options): 510 | for word in segment.words: 511 | aSegCount += 1 512 | resSegs.append("\n"+str(aSegCount)+"\n"+formatTimeStamp(word.start)+" --> "+formatTimeStamp(word.end)+"\n"+word.word.strip()+"\n") 513 | else: 514 | aSegCount += 1 515 | resSegs.append("\n"+str(aSegCount)+"\n"+formatTimeStamp(segment.start)+" --> "+formatTimeStamp(segment.end)+"\n"+segment.text.strip()+"\n") 516 | else: 517 | for segment in segments: 518 | resSegs.append(segment.text) 519 | 520 | result["text"] = "".join(resSegs) 521 | if(r > 0): 522 | multiRes += "=====\n" 523 | multiRes += result["text"] 524 | 525 | if(nbRun > 1): 526 | result["text"] = multiRes 527 | elif whisperFound == "SM4T": 528 | src_lang = lang2to3[lngInput]; 529 | tgt_lang = lang2to3[lng]; 530 | # S2TT 531 | #translated_text, _, _ = translator.predict(, "s2tt", ) 532 | translated_text, _, _ = model.predict(pathIn, "s2tt", tgt_lang) 533 | result = {} 534 | result["text"] = str(translated_text) 535 | else: 536 | transcribe_options = dict(task="transcribe", **transcribe_options) 537 | multiRes = "" 538 | for r in range(nbRun): 539 | print("RUN: "+str(r)) 540 | result = model.transcribe(pathIn,**transcribe_options) 541 | if(mode == 3): 542 | p = Path(pathIn) 543 | writer = WriteSRT(p.parent) 544 | srtOpts = { "max_line_width" : 80, "max_line_count" : 2, "highlight_words" : False} 545 | if("word_timestamps" in transcribe_options and transcribe_options["word_timestamps"]): 546 | srtOpts = { "max_line_width" : 30, "max_line_count" : 1, "highlight_words" : transcribe_options["word_timestamps"]} 547 | writer(result, pathIn,srtOpts) 548 | audio_basename = os.path.basename(pathIn) 549 | audio_basename = os.path.splitext(audio_basename)[0] 550 | output_path = os.path.join( 551 | p.parent, audio_basename + ".srt" 552 | ) 553 | with open(output_path) as f: 554 | result["text"] = f.read() 555 | 556 | if("word_timestamps" in transcribe_options and transcribe_options["word_timestamps"]): 557 | result["text"] = re.sub("(\n[^<\n]*|[^<\n]*\n)"#Remove lines without highlighted words 558 | ,"\n",re.sub(r"\n[^<\n]*\n\n","\n\n"#Keep only highlighted words 559 | ,result["text"])) 560 | if(r > 0): 561 | multiRes += "=====\n" 562 | multiRes += result["text"] 563 | if(nbRun > 1): 564 | result["text"] = multiRes 565 | 566 | print("T=",(time.time()-startTime)) 567 | print("TRANS="+result["text"],flush=True) 568 | except Exception as e: 569 | print(e) 570 | traceback.print_exc() 571 | lock.release() 572 | result = {} 573 | result["text"] = "" 574 | return result 575 | 576 | lock.release() 577 | 578 | if(mode == 0 or mode == 3): 579 | return result 580 | #Too restrictive 581 | #if(result["text"] == aLast): 582 | # #Only if confirmed 583 | # return result 584 | #result["text"] = "" 585 | #return result 586 | 587 | aWhisper="(Whisper|Wisper|Wyspę|Wysper|Wispa|Уіспер|Ου ίσπερ|위스퍼드|ウィスパー|विस्पर|विसपर)" 588 | aOk="(o[.]?k[.]?|okay|oké|okej|Окей|οκέι|오케이|オーケー|ओके)" 589 | aSep="[.,!? ]*" 590 | if(mode == 1): 591 | aCleaned = re.sub(r"(^ *"+aWhisper+aSep+aOk+aSep+"|"+aOk+aSep+aWhisper+aSep+" *$)", "", result["text"], 2, re.IGNORECASE) 592 | if(re.match(r"^ *("+aOk+"|"+aSep+"|"+aWhisper+")*"+aWhisper+"("+aOk+"|"+aSep+"|"+aWhisper+")* *$", result["text"], re.IGNORECASE)): 593 | #Empty sound ? 594 | return transcribeMARK(path, opts, mode=2,lngInput=lngInput,aLast="") 595 | 596 | if(re.match(r"^ *"+aWhisper+aSep+aOk+aSep+".*"+aOk+aSep+aWhisper+aSep+" *$", result["text"], re.IGNORECASE)): 597 | #GOOD! 598 | result["text"] = aCleaned 599 | return result 600 | 601 | return transcribeMARK(path, opts, mode=2,lngInput=lngInput,aLast=aCleaned) 602 | 603 | if(mode == 2): 604 | aCleaned = re.sub(r"(^ *"+aOk+aSep+aWhisper+aSep+"|"+aWhisper+aSep+aOk+aSep+" *$)", "", result["text"], 2, re.IGNORECASE) 605 | if(aCleaned == aLast): 606 | #CONFIRMED! 607 | result["text"] = aCleaned 608 | return result 609 | 610 | if(re.match(r"^ *("+aOk+"|"+aSep+"|"+aWhisper+")*"+aWhisper+"("+aOk+"|"+aSep+"|"+aWhisper+")* *$", result["text"], re.IGNORECASE)): 611 | #Empty sound ? 612 | result["text"] = "" 613 | return result 614 | 615 | if(re.match(r"^ *"+aOk+aSep+aWhisper+aSep+".*"+aWhisper+aSep+aOk+aSep+" *$", result["text"], re.IGNORECASE)): 616 | #GOOD! 617 | result["text"] = aCleaned 618 | return result 619 | 620 | return transcribeMARK(path, opts, mode=0,lngInput=lngInput,aLast=aCleaned) 621 | 622 | --------------------------------------------------------------------------------