whisper/approach.png at main · openai/whisper · GitHub

├── README.md ├── approach.png ├── approach.png.1 ├── cog.yaml ├── language-breakdown.svg └── predict.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | **NOTE**: 3 |
4 | Some folks reported signigiface slow down in the lastest version including `large-v2` checkpoint, therefore it has been temporaily removed from https://replicate.com/openai/whisper, but added here https://replicate.com/cjwbw/whisper instead if you want to access it. 5 | 6 | Have personally tested both versions however did not find the slow-down issue as reported. It has been raised to the team and see how to proceed regarding merge back to the mainline model. 7 | 8 | # Whisper 9 | [[Blog]](https://openai.com/blog/whisper) 10 | [[Paper]](https://cdn.openai.com/papers/whisper.pdf) 11 | [[Model card]](model-card.md) 12 | [[Colab example]](https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb) 13 | [![Replicate](https://replicate.com/openai/whisper/badge)](https://replicate.com/openai/whisper) 14 | 15 | Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification. 16 | 17 | 18 | ## Approach 19 | 20 | ![Approach](approach.png) 21 | 22 | A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets. 23 | 24 | 25 | ## Setup 26 | 27 | We used Python 3.9.9 and [PyTorch](https://pytorch.org/) 1.10.1 to train and test our models, but the codebase is expected to be compatible with Python 3.7 or later and recent PyTorch versions. The codebase also depends on a few Python packages, most notably [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) for their fast tokenizer implementation and [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for reading audio files. The following command will pull and install the latest commit from this repository, along with its Python dependencies 28 | 29 | pip install git+https://github.com/openai/whisper.git 30 | 31 | To update the package to the latest version of this repository, please run: 32 | 33 | pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git 34 | 35 | It also requires the command-line tool [`ffmpeg`](https://ffmpeg.org/) to be installed on your system, which is available from most package managers: 36 | 37 | ```bash 38 | # on Ubuntu or Debian 39 | sudo apt update && sudo apt install ffmpeg 40 | 41 | # on Arch Linux 42 | sudo pacman -S ffmpeg 43 | 44 | # on MacOS using Homebrew (https://brew.sh/) 45 | brew install ffmpeg 46 | 47 | # on Windows using Chocolatey (https://chocolatey.org/) 48 | choco install ffmpeg 49 | 50 | # on Windows using Scoop (https://scoop.sh/) 51 | scoop install ffmpeg 52 | ``` 53 | 54 | You may need [`rust`](http://rust-lang.org) installed as well, in case [tokenizers](https://pypi.org/project/tokenizers/) does not provide a pre-built wheel for your platform. If you see installation errors during the `pip install` command above, please follow the [Getting started page](https://www.rust-lang.org/learn/get-started) to install Rust development environment. Additionally, you may need to configure the `PATH` environment variable, e.g. `export PATH="$HOME/.cargo/bin:$PATH"`. If the installation fails with `No module named 'setuptools_rust'`, you need to install `setuptools_rust`, e.g. by running: 55 | 56 | ```bash 57 | pip install setuptools-rust 58 | ``` 59 | 60 | 61 | ## Available models and languages 62 | 63 | There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and relative speed. 64 | 65 | 66 | | Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed | 67 | |:------:|:----------:|:------------------:|:------------------:|:-------------:|:--------------:| 68 | | tiny | 39 M | `tiny.en` | `tiny` | ~1 GB | ~32x | 69 | | base | 74 M | `base.en` | `base` | ~1 GB | ~16x | 70 | | small | 244 M | `small.en` | `small` | ~2 GB | ~6x | 71 | | medium | 769 M | `medium.en` | `medium` | ~5 GB | ~2x | 72 | | large | 1550 M | N/A | `large` | ~10 GB | 1x | 73 | 74 | For English-only applications, the `.en` models tend to perform better, especially for the `tiny.en` and `base.en` models. We observed that the difference becomes less significant for the `small.en` and `medium.en` models. 75 | 76 | Whisper's performance varies widely depending on the language. The figure below shows a WER breakdown by languages of Fleurs dataset, using the `large` model. More WER and BLEU scores corresponding to the other models and datasets can be found in Appendix D in [the paper](https://cdn.openai.com/papers/whisper.pdf). 77 | 78 | ![WER breakdown by language](language-breakdown.svg) 79 | 80 | 81 | ## More examples 82 | 83 | Please use the [🙌 Show and tell](https://github.com/openai/whisper/discussions/categories/show-and-tell) category in Discussions for sharing more example usages of Whisper and third-party extensions such as web demos, integrations with other tools, ports for different platforms, etc. 84 | 85 | 86 | ## License 87 | 88 | The code and the model weights of Whisper are released under the MIT License. See [LICENSE](LICENSE) for further details. 89 | -------------------------------------------------------------------------------- /approach.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chenxwh/cog-whisper/7416f6e1c2dbb05aff8cfdea2e59615d8de0cf42/approach.png -------------------------------------------------------------------------------- /approach.png.1: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | whisper/approach.png at main · openai/whisper · GitHub 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 |

171 | 172 | 173 | 174 | 779 | 780 |

781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 |

789 | 790 | 791 | 792 | 793 | 810 |

811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 |

827 |

828 |

829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 |

1031 | 1032 | 1033 | 1034 | 1035 | 1036 |

1037 | 1038 | 1039 |

1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 |

1050 | 1051 |

1052 |

1056 |

1059 | 1062 | main 1063 | 1064 |

1065 | 1066 | 1067 |

1068 | 1202 |

1203 | 1204 |

1205 | 1206 |

1207 | 1208 | 1209 | 1236 | 1237 | 1238 | 1241 | Go to file 1242 | 1243 |

1244 |

1247 |

1248 |

1276 |

1277 |

1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1296 | 1297 | 1298 | 1299 |

1300 | 1301 |

1302 |

1303 | 1304 |

1305 | 1306 |

1307 |

1308 | jongwook 1309 | 1310 | 1311 | initial commit 1312 | 1313 |

1314 | 1315 | 1316 | 1317 | 1318 | 1319 |

1320 |

1321 | 1322 | Latest commit 1323 | 6e3be77 1324 |

Sep 21, 2022

1325 | 1326 | 1327 | 1328 | 1331 | 1332 | History 1333 | 1334 | 1335 |

1336 |

1337 | 1338 |

1339 | 1340 |

1341 |

1342 |

1343 | 1346 | 1 1347 | 1348 | contributor 1349 |

1350 | 1354 |

1355 | 1360 |

1361 | Users who have contributed to this file 1362 |

1363 |

1364 | 1365 | 1369 | 1370 | 1371 |

1372 |

1373 |

1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 |

1384 | 1385 |

1389 | 1390 | 1391 |

1392 | 1393 | 903 KB 1394 |

1395 | 1396 |

1397 | 1398 | 1399 |

1400 | Download 1401 |

1402 | 1403 |

1404 |

1412 |

1413 | 1414 |

1415 | 1447 |

1448 |

1449 | 1450 | 1451 | 1452 |

1453 | 1454 |

1455 |

1456 |

1457 |

1458 | 1459 |

1460 | 1461 | 1462 | 1463 | 1464 |

1465 |

1466 | 1467 | 1472 |

1473 | 1474 | 1475 | 1476 |

1477 | 1478 |

1479 | 1480 | 1481 |

1482 | 1483 | 1484 | 1485 | 1486 |

1487 |

1488 | 1489 |

1490 | 1491 | 1529 | 1530 | 1531 | 1532 | 1533 | 1544 | 1545 | 1553 | 1566 | 1567 | 1571 | 1572 | 1584 | 1585 | 1586 | 1587 | 1588 |

1589 | 1590 | 1591 | 1592 | 1593 | 1594 | -------------------------------------------------------------------------------- /cog.yaml: -------------------------------------------------------------------------------- 1 | build: 2 | gpu: true 3 | python_version: "3.8" 4 | system_packages: 5 | - "ffmpeg" 6 | python_packages: 7 | - "tqdm==4.64.1" 8 | - "torch==1.13.0" 9 | - "torchaudio==0.13.0" 10 | - "numpy==1.23.5" 11 | - "transformers==4.25.1" 12 | - "ffmpeg-python==0.2.0" 13 | - "more_itertools==9.0.0" 14 | 15 | run: 16 | - pip install git+https://github.com/openai/whisper.git 17 | 18 | predict: "predict.py:Predictor" 19 | -------------------------------------------------------------------------------- /predict.py: -------------------------------------------------------------------------------- 1 | """ 2 | download the models to ./weights 3 | wget https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt -P ./weights 4 | wget https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt -P ./weights 5 | wget https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt -P ./weights 6 | wget https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt -P ./weights 7 | wget https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt -P ./weights 8 | wget https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt -P ./weights 9 | """ 10 | 11 | import io 12 | import os 13 | from typing import Optional, Any 14 | import torch 15 | import numpy as np 16 | from cog import BasePredictor, Input, Path, BaseModel 17 | 18 | import whisper 19 | from whisper.model import Whisper, ModelDimensions 20 | from whisper.tokenizer import LANGUAGES, TO_LANGUAGE_CODE 21 | from whisper.utils import format_timestamp 22 | 23 | 24 | class ModelOutput(BaseModel): 25 | detected_language: str 26 | transcription: str 27 | segments: Any 28 | translation: Optional[str] 29 | txt_file: Optional[Path] 30 | srt_file: Optional[Path] 31 | 32 | 33 | class Predictor(BasePredictor): 34 | def setup(self): 35 | """Load the model into memory to make running multiple predictions efficient""" 36 | 37 | self.models = {} 38 | for model in ["tiny", "base", "small", "medium", "large-v1", "large-v2"]: 39 | with open(f"weights/{model}.pt", "rb") as fp: 40 | checkpoint = torch.load(fp, map_location="cpu") 41 | dims = ModelDimensions(**checkpoint["dims"]) 42 | self.models[model] = Whisper(dims) 43 | self.models[model].load_state_dict(checkpoint["model_state_dict"]) 44 | 45 | def predict( 46 | self, 47 | audio: Path = Input(description="Audio file"), 48 | model: str = Input( 49 | default="base", 50 | choices=["tiny", "base", "small", "medium", "large-v1", "large-v2"], 51 | description="Choose a Whisper model.", 52 | ), 53 | transcription: str = Input( 54 | choices=["plain text", "srt", "vtt"], 55 | default="plain text", 56 | description="Choose the format for the transcription", 57 | ), 58 | translate: bool = Input( 59 | default=False, 60 | description="Translate the text to English when set to True", 61 | ), 62 | language: str = Input( 63 | choices=sorted(LANGUAGES.keys()) 64 | + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), 65 | default=None, 66 | description="language spoken in the audio, specify None to perform language detection", 67 | ), 68 | temperature: float = Input( 69 | default=0, 70 | description="temperature to use for sampling", 71 | ), 72 | patience: float = Input( 73 | default=None, 74 | description="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search", 75 | ), 76 | suppress_tokens: str = Input( 77 | default="-1", 78 | description="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations", 79 | ), 80 | initial_prompt: str = Input( 81 | default=None, 82 | description="optional text to provide as a prompt for the first window.", 83 | ), 84 | condition_on_previous_text: bool = Input( 85 | default=True, 86 | description="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop", 87 | ), 88 | temperature_increment_on_fallback: float = Input( 89 | default=0.2, 90 | description="temperature to increase when falling back when the decoding fails to meet either of the thresholds below", 91 | ), 92 | compression_ratio_threshold: float = Input( 93 | default=2.4, 94 | description="if the gzip compression ratio is higher than this value, treat the decoding as failed", 95 | ), 96 | logprob_threshold: float = Input( 97 | default=-1.0, 98 | description="if the average log probability is lower than this value, treat the decoding as failed", 99 | ), 100 | no_speech_threshold: float = Input( 101 | default=0.6, 102 | description="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence", 103 | ), 104 | ) -> ModelOutput: 105 | 106 | """Run a single prediction on the model""" 107 | print(f"Transcribe with {model} model") 108 | model = self.models[model].to("cuda") 109 | 110 | if temperature_increment_on_fallback is not None: 111 | temperature = tuple( 112 | np.arange(temperature, 1.0 + 1e-6, temperature_increment_on_fallback) 113 | ) 114 | else: 115 | temperature = [temperature] 116 | 117 | args = { 118 | "language": language, 119 | "patience": patience, 120 | "suppress_tokens": suppress_tokens, 121 | "initial_prompt": initial_prompt, 122 | "condition_on_previous_text": condition_on_previous_text, 123 | "compression_ratio_threshold": compression_ratio_threshold, 124 | "logprob_threshold": logprob_threshold, 125 | "no_speech_threshold": no_speech_threshold, 126 | } 127 | 128 | result = model.transcribe(str(audio), temperature=temperature, **args) 129 | 130 | if transcription == "plain text": 131 | transcription = result["text"] 132 | elif transcription == "srt": 133 | transcription = write_srt(result["segments"]) 134 | else: 135 | transcription = write_vtt(result["segments"]) 136 | 137 | if translate: 138 | translation = model.transcribe( 139 | str(audio), task="translate", temperature=temperature, **args 140 | ) 141 | 142 | return ModelOutput( 143 | segments=result["segments"], 144 | detected_language=LANGUAGES[result["language"]], 145 | transcription=transcription, 146 | translation=translation["text"] if translate else None, 147 | ) 148 | 149 | 150 | def write_vtt(transcript): 151 | result = "" 152 | for segment in transcript: 153 | result += f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n" 154 | result += f"{segment['text'].strip().replace('-->', '->')}\n" 155 | result += "\n" 156 | return result 157 | 158 | 159 | def write_srt(transcript): 160 | result = "" 161 | for i, segment in enumerate(transcript, start=1): 162 | result += f"{i}\n" 163 | result += f"{format_timestamp(segment['start'], always_include_hours=True, decimal_marker=',')} --> " 164 | result += f"{format_timestamp(segment['end'], always_include_hours=True, decimal_marker=',')}\n" 165 | result += f"{segment['text'].strip().replace('-->', '->')}\n" 166 | result += "\n" 167 | return result 168 | --------------------------------------------------------------------------------