├── README.md ├── approach.png ├── approach.png.1 ├── cog.yaml ├── language-breakdown.svg └── predict.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | **NOTE**: 3 |
4 | Some folks reported signigiface slow down in the lastest version including `large-v2` checkpoint, therefore it has been temporaily removed from https://replicate.com/openai/whisper, but added here https://replicate.com/cjwbw/whisper instead if you want to access it. 5 | 6 | Have personally tested both versions however did not find the slow-down issue as reported. It has been raised to the team and see how to proceed regarding merge back to the mainline model. 7 | 8 | # Whisper 9 | [[Blog]](https://openai.com/blog/whisper) 10 | [[Paper]](https://cdn.openai.com/papers/whisper.pdf) 11 | [[Model card]](model-card.md) 12 | [[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb) 13 | [![Replicate](https://replicate.com/openai/whisper/badge)](https://replicate.com/openai/whisper) 14 | 15 | Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. 16 | 17 | 18 | ## Approach 19 | 20 | ![Approach](approach.png) 21 | 22 | A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets. 23 | 24 | 25 | ## Setup 26 | 27 | We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies 28 | 29 | pip install git+https://github.com/openai/whisper.git 30 | 31 | To update the package to the latest version of this repository, please run: 32 | 33 | pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git 34 | 35 | It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers: 36 | 37 | ```bash 38 | # on Ubuntu or Debian 39 | sudo apt update && sudo apt install ffmpeg 40 | 41 | # on Arch Linux 42 | sudo pacman -S ffmpeg 43 | 44 | # on MacOS using Homebrew (https://brew.sh/) 45 | brew install ffmpeg 46 | 47 | # on Windows using Chocolatey (https://chocolatey.org/) 48 | choco install ffmpeg 49 | 50 | # on Windows using Scoop (https://scoop.sh/) 51 | scoop install ffmpeg 52 | ``` 53 | 54 | You may need [`rust`](http://rust-lang.org) installed as well, in case [tokenizers](https://pypi.org/project/tokenizers/) does not provide a pre-built wheel for your platform. If you see installation errors during the `pip install` command above, please follow the [Getting started page](https://www.rust-lang.org/learn/get-started) to install Rust development environment. Additionally, you may need to configure the `PATH` environment variable, e.g. `export PATH="$HOME/.cargo/bin:$PATH"`. If the installation fails with `No module named 'setuptools_rust'`, you need to install `setuptools_rust`, e.g. by running: 55 | 56 | ```bash 57 | pip install setuptools-rust 58 | ``` 59 | 60 | 61 | ## Available models and languages 62 | 63 | There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed. 64 | 65 | 66 | | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed | 67 | |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:| 68 | | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x | 69 | | base | 74 M | `base.en` | `base` | ~1 GB | ~16x | 70 | | small | 244 M | `small.en` | `small` | ~2 GB | ~6x | 71 | | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x | 72 | | large | 1550 M | N/A | `large` | ~10 GB | 1x | 73 | 74 | For English-only applications, the `.en` models tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models. 75 | 76 | Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://cdn.openai.com/papers/whisper.pdf). 77 | 78 | ![WER breakdown by language](language-breakdown.svg) 79 | 80 | 81 | ## More examples 82 | 83 | Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc. 84 | 85 | 86 | ## License 87 | 88 | The code and the model weights of Whisper are released under the MIT License. See [LICENSE](LICENSE) for further details. 89 | -------------------------------------------------------------------------------- /approach.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chenxwh/cog-whisper/7416f6e1c2dbb05aff8cfdea2e59615d8de0cf42/approach.png -------------------------------------------------------------------------------- /approach.png.1: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | whisper/approach.png at main · openai/whisper · GitHub 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 |
171 | 172 | 173 | 174 |
175 | Skip to content 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 777 | 778 |
779 | 780 |
781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 |
789 | 790 | 791 | 792 | 793 | 810 |
811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 |
827 |
828 |
829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 |
844 | 845 |
846 | 847 |
848 | 849 |
850 | 853 | 854 | 857 | / 858 | 859 | whisper 860 | 861 | 862 | Public 863 |
864 | 865 | 866 |
867 | 868 | 903 | 904 |
905 | 906 |
907 |
908 | 909 | 910 | 1022 | 1023 |
1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 |
1031 | 1032 | 1033 | 1034 | 1035 | 1036 |
1037 | 1038 | 1039 |
1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | Permalink 1048 | 1049 |
1050 | 1051 |
1052 |
1056 | 1059 | 1062 | main 1063 | 1064 | 1065 | 1066 | 1067 |
1068 |
1069 |
1070 | Switch branches/tags 1071 | 1074 |
1075 | 1076 | 1077 | 1078 |
1079 | 1090 |
1091 | 1092 |
1093 | 1094 | 1095 |
1096 | 1097 |
1098 | 1114 | 1115 | 1118 | 1119 | 1122 | 1123 | 1124 | 1132 | 1133 | 1134 | 1135 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 |
1150 | 1151 | 1199 |
1200 |
1201 |
1202 |
1203 | 1204 |
1205 | 1206 |
1207 | 1208 | 1209 |
1210 | 1211 |
1212 |
1213 |
1214 |

Name already in use

1215 |
1216 |
1217 | 1220 |
1221 |
1222 |
1223 |
1224 | 1225 |
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? 1226 |
1227 | 1228 |
1229 |
1230 | 1232 | 1234 |
1235 |
1236 | 1237 | 1238 | 1241 | Go to file 1242 | 1243 |
1244 | 1245 | 1246 | 1247 | 1248 |
1276 |
1277 |
1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1296 | 1297 | 1298 | 1299 |
1300 | 1301 |
1302 |
1303 | 1304 | @jongwook 1305 | 1306 |
1307 |
1308 | jongwook 1309 | 1310 | 1311 | initial commit 1312 | 1313 |
1314 | 1315 | 1316 | 1317 | 1318 | 1319 |
1320 |
1321 | 1322 | Latest commit 1323 | 6e3be77 1324 | Sep 21, 2022 1325 | 1326 | 1327 | 1328 | 1331 | 1332 | History 1333 | 1334 | 1335 |
1336 |
1337 | 1338 |
1339 | 1340 |
1341 |
1342 | 1343 | 1346 | 1 1347 | 1348 | contributor 1349 | 1350 | 1354 |
1355 | 1360 |

1361 | Users who have contributed to this file 1362 |

1363 |
1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 |
1371 |
1372 |
1373 |
1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 |
1384 | 1385 |
1389 | 1390 | 1391 |
1392 | 1393 | 903 KB 1394 |
1395 | 1396 |
1397 | 1398 | 1399 | 1402 | 1403 |
1404 |
1405 | 1411 |
1412 |
1413 | 1414 |
1415 | 1447 |
1448 |
1449 | 1450 | 1451 | 1452 |
1453 | 1454 |
1455 | approach.png 1456 |
1457 |
1458 | 1459 |
1460 | 1461 | 1462 | 1463 | 1464 |
1465 | 1466 | 1467 |
1468 | 1469 | 1471 |
1472 |
1473 | 1474 | 1475 | 1476 |
1477 | 1478 |
1479 | 1480 | 1481 |
1482 | 1483 |
1484 | 1485 | 1486 |
1487 |
1488 | 1489 |
1490 | 1491 | 1529 | 1530 | 1531 | 1532 | 1533 | 1544 | 1545 | 1553 | 1566 | 1567 | 1571 | 1572 | 1584 | 1585 | 1586 | 1587 | 1588 |
1589 | 1590 |
1591 | 1592 | 1593 | 1594 | -------------------------------------------------------------------------------- /cog.yaml: -------------------------------------------------------------------------------- 1 | build: 2 | gpu: true 3 | python_version: "3.8" 4 | system_packages: 5 | - "ffmpeg" 6 | python_packages: 7 | - "tqdm==4.64.1" 8 | - "torch==1.13.0" 9 | - "torchaudio==0.13.0" 10 | - "numpy==1.23.5" 11 | - "transformers==4.25.1" 12 | - "ffmpeg-python==0.2.0" 13 | - "more_itertools==9.0.0" 14 | 15 | run: 16 | - pip install git+https://github.com/openai/whisper.git 17 | 18 | predict: "predict.py:Predictor" 19 | -------------------------------------------------------------------------------- /predict.py: -------------------------------------------------------------------------------- 1 | """ 2 | download the models to ./weights 3 | wget https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt -P ./weights 4 | wget https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt -P ./weights 5 | wget https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt -P ./weights 6 | wget https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt -P ./weights 7 | wget https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt -P ./weights 8 | wget https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt -P ./weights 9 | """ 10 | 11 | import io 12 | import os 13 | from typing import Optional, Any 14 | import torch 15 | import numpy as np 16 | from cog import BasePredictor, Input, Path, BaseModel 17 | 18 | import whisper 19 | from whisper.model import Whisper, ModelDimensions 20 | from whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE 21 | from whisper.utils import format_timestamp 22 | 23 | 24 | class ModelOutput(BaseModel): 25 | detected_language: str 26 | transcription: str 27 | segments: Any 28 | translation: Optional[str] 29 | txt_file: Optional[Path] 30 | srt_file: Optional[Path] 31 | 32 | 33 | class Predictor(BasePredictor): 34 | def setup(self): 35 | """Load the model into memory to make running multiple predictions efficient""" 36 | 37 | self.models = {} 38 | for model in ["tiny", "base", "small", "medium", "large-v1", "large-v2"]: 39 | with open(f"weights/{model}.pt", "rb") as fp: 40 | checkpoint = torch.load(fp, map_location="cpu") 41 | dims = ModelDimensions(**checkpoint["dims"]) 42 | self.models[model] = Whisper(dims) 43 | self.models[model].load_state_dict(checkpoint["model_state_dict"]) 44 | 45 | def predict( 46 | self, 47 | audio: Path = Input(description="Audio file"), 48 | model: str = Input( 49 | default="base", 50 | choices=["tiny", "base", "small", "medium", "large-v1", "large-v2"], 51 | description="Choose a Whisper model.", 52 | ), 53 | transcription: str = Input( 54 | choices=["plain text", "srt", "vtt"], 55 | default="plain text", 56 | description="Choose the format for the transcription", 57 | ), 58 | translate: bool = Input( 59 | default=False, 60 | description="Translate the text to English when set to True", 61 | ), 62 | language: str = Input( 63 | choices=sorted(LANGUAGES.keys()) 64 | + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), 65 | default=None, 66 | description="language spoken in the audio, specify None to perform language detection", 67 | ), 68 | temperature: float = Input( 69 | default=0, 70 | description="temperature to use for sampling", 71 | ), 72 | patience: float = Input( 73 | default=None, 74 | description="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search", 75 | ), 76 | suppress_tokens: str = Input( 77 | default="-1", 78 | description="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations", 79 | ), 80 | initial_prompt: str = Input( 81 | default=None, 82 | description="optional text to provide as a prompt for the first window.", 83 | ), 84 | condition_on_previous_text: bool = Input( 85 | default=True, 86 | description="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop", 87 | ), 88 | temperature_increment_on_fallback: float = Input( 89 | default=0.2, 90 | description="temperature to increase when falling back when the decoding fails to meet either of the thresholds below", 91 | ), 92 | compression_ratio_threshold: float = Input( 93 | default=2.4, 94 | description="if the gzip compression ratio is higher than this value, treat the decoding as failed", 95 | ), 96 | logprob_threshold: float = Input( 97 | default=-1.0, 98 | description="if the average log probability is lower than this value, treat the decoding as failed", 99 | ), 100 | no_speech_threshold: float = Input( 101 | default=0.6, 102 | description="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence", 103 | ), 104 | ) -> ModelOutput: 105 | 106 | """Run a single prediction on the model""" 107 | print(f"Transcribe with {model} model") 108 | model = self.models[model].to("cuda") 109 | 110 | if temperature_increment_on_fallback is not None: 111 | temperature = tuple( 112 | np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback) 113 | ) 114 | else: 115 | temperature = [temperature] 116 | 117 | args = { 118 | "language": language, 119 | "patience": patience, 120 | "suppress_tokens": suppress_tokens, 121 | "initial_prompt": initial_prompt, 122 | "condition_on_previous_text": condition_on_previous_text, 123 | "compression_ratio_threshold": compression_ratio_threshold, 124 | "logprob_threshold": logprob_threshold, 125 | "no_speech_threshold": no_speech_threshold, 126 | } 127 | 128 | result = model.transcribe(str(audio), temperature=temperature, **args) 129 | 130 | if transcription == "plain text": 131 | transcription = result["text"] 132 | elif transcription == "srt": 133 | transcription = write_srt(result["segments"]) 134 | else: 135 | transcription = write_vtt(result["segments"]) 136 | 137 | if translate: 138 | translation = model.transcribe( 139 | str(audio), task="translate", temperature=temperature, **args 140 | ) 141 | 142 | return ModelOutput( 143 | segments=result["segments"], 144 | detected_language=LANGUAGES[result["language"]], 145 | transcription=transcription, 146 | translation=translation["text"] if translate else None, 147 | ) 148 | 149 | 150 | def write_vtt(transcript): 151 | result = "" 152 | for segment in transcript: 153 | result += f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n" 154 | result += f"{segment['text'].strip().replace('-->', '->')}\n" 155 | result += "\n" 156 | return result 157 | 158 | 159 | def write_srt(transcript): 160 | result = "" 161 | for i, segment in enumerate(transcript, start=1): 162 | result += f"{i}\n" 163 | result += f"{format_timestamp(segment['start'], always_include_hours=True, decimal_marker=',')} --> " 164 | result += f"{format_timestamp(segment['end'], always_include_hours=True, decimal_marker=',')}\n" 165 | result += f"{segment['text'].strip().replace('-->', '->')}\n" 166 | result += "\n" 167 | return result 168 | --------------------------------------------------------------------------------