├── codec_bpe ├── core │ ├── __init__.py │ ├── utils.py │ ├── sentencepiece_bpe.py │ ├── converter.py │ └── trainer.py ├── tools │ ├── __init__.py │ ├── extender.py │ ├── lm_dataset_builder.py │ ├── codec_utils.py │ └── audio_encoder.py ├── __init__.py ├── extend_tokenizer.py ├── lm_dataset_stats.py ├── train_tokenizer.py ├── prep_lm_dataset.py └── audio_to_codes.py ├── requirements_neucodec.txt ├── requirements_magicodec.txt ├── requirements_wavtokenizer.txt ├── requirements_funcodec.txt ├── requirements_simvq.txt ├── requirements_xcodec2.txt ├── requirements.txt ├── img └── codec_bpe.png ├── .github └── workflows │ └── pypi-release.yml ├── LICENSE ├── setup.py ├── .gitignore ├── CHANGELOG.md └── README.md /codec_bpe/core/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /codec_bpe/tools/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /requirements_neucodec.txt: -------------------------------------------------------------------------------- 1 | neucodec -------------------------------------------------------------------------------- /requirements_magicodec.txt: -------------------------------------------------------------------------------- 1 | huggingface-hub -------------------------------------------------------------------------------- /requirements_wavtokenizer.txt: -------------------------------------------------------------------------------- 1 | huggingface-hub -------------------------------------------------------------------------------- /requirements_funcodec.txt: -------------------------------------------------------------------------------- 1 | huggingface-hub 2 | funcodec -------------------------------------------------------------------------------- /requirements_simvq.txt: -------------------------------------------------------------------------------- 1 | huggingface-hub 2 | omegaconf -------------------------------------------------------------------------------- /requirements_xcodec2.txt: -------------------------------------------------------------------------------- 1 | huggingface-hub 2 | xcodec2 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | librosa 2 | numpy 3 | tokenizers>=0.19.0 4 | torch 5 | transformers>=4.45.0 -------------------------------------------------------------------------------- /img/codec_bpe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AbrahamSanders/codec-bpe/HEAD/img/codec_bpe.png -------------------------------------------------------------------------------- /codec_bpe/__init__.py: -------------------------------------------------------------------------------- 1 | from .core.converter import ( 2 | codes_to_chars, 3 | chars_to_codes, 4 | UNICODE_OFFSET, 5 | UNICODE_OFFSET_LARGE, 6 | ) 7 | 8 | __version__ = "1.4.1" -------------------------------------------------------------------------------- /.github/workflows/pypi-release.yml: -------------------------------------------------------------------------------- 1 | name: Publish Python package 2 | 3 | on: 4 | release: 5 | types: [ published ] 6 | 7 | permissions: 8 | contents: read 9 | 10 | jobs: 11 | publish: 12 | runs-on: ubuntu-latest 13 | steps: 14 | - uses: actions/checkout@v4 15 | - name: Set up Python 16 | uses: actions/setup-python@v5 17 | with: 18 | python-version: "3.x" 19 | - name: Install dependencies 20 | run: | 21 | python -m pip install --upgrade pip 22 | pip install setuptools wheel 23 | - name: Build a binary wheel 24 | run: >- 25 | python setup.py sdist bdist_wheel 26 | - name: Publish to PyPI 27 | uses: pypa/gh-action-pypi-publish@release/v1 28 | with: 29 | password: ${{ secrets.PYPI_API_TOKEN }} -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Abraham Sanders 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /codec_bpe/extend_tokenizer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from transformers import AutoTokenizer 3 | 4 | from .tools.extender import extend_existing_tokenizer 5 | 6 | if __name__ == "__main__": 7 | parser = argparse.ArgumentParser(description="Extend an existing Transformers tokenizer with codec BPE tokens") 8 | parser.add_argument("--existing_tokenizer", type=str, required=True) 9 | parser.add_argument("--codec_bpe_tokenizer", type=str, required=True) 10 | parser.add_argument("--additional_special_tokens", nargs="+", default=None) 11 | parser.add_argument("--save_path", type=str) 12 | args = parser.parse_args() 13 | 14 | if args.save_path is None: 15 | args.save_path = f"output/{args.existing_tokenizer}_extended" 16 | 17 | existing_tokenizer = AutoTokenizer.from_pretrained(args.existing_tokenizer) 18 | codec_bpe_tokenizer = AutoTokenizer.from_pretrained(args.codec_bpe_tokenizer) 19 | 20 | num_added = extend_existing_tokenizer(existing_tokenizer, codec_bpe_tokenizer, args.additional_special_tokens) 21 | print(f"Added {num_added} tokens to the existing tokenizer {args.existing_tokenizer} and saved it as {args.save_path}.") 22 | existing_tokenizer.save_pretrained(args.save_path) -------------------------------------------------------------------------------- /codec_bpe/tools/extender.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Union, List 2 | from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast 3 | from tqdm import trange 4 | 5 | def extend_existing_tokenizer( 6 | existing_tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], 7 | codec_bpe_tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], 8 | additional_special_tokens: Optional[List[str]] = None, 9 | ) -> int: 10 | target_tokens = [] 11 | skip_token_ids = set([ 12 | codec_bpe_tokenizer.bos_token_id, 13 | codec_bpe_tokenizer.eos_token_id, 14 | codec_bpe_tokenizer.unk_token_id, 15 | codec_bpe_tokenizer.pad_token_id, 16 | ]) 17 | for i in trange(len(codec_bpe_tokenizer)): 18 | if i in skip_token_ids: 19 | continue 20 | token = codec_bpe_tokenizer.convert_ids_to_tokens(i) 21 | target_tokens.append(token) 22 | 23 | num_added = 0 24 | if additional_special_tokens: 25 | num_added += existing_tokenizer.add_special_tokens( 26 | special_tokens_dict={"additional_special_tokens": additional_special_tokens}, 27 | replace_additional_special_tokens=False, 28 | ) 29 | num_added += existing_tokenizer.add_tokens(target_tokens, special_tokens=True) 30 | return num_added 31 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import io 2 | import os 3 | 4 | from setuptools import find_packages, setup 5 | 6 | for line in open("codec_bpe/__init__.py"): 7 | line = line.strip() 8 | if "__version__" in line: 9 | context = {} 10 | exec(line, context) 11 | VERSION = context["__version__"] 12 | 13 | 14 | def read(*paths, **kwargs): 15 | with io.open(os.path.join(os.path.dirname(__file__), *paths), encoding=kwargs.get("encoding", "utf8")) as open_file: 16 | content = open_file.read().strip() 17 | return content 18 | 19 | 20 | def read_requirements(path): 21 | return [line.strip() for line in read(path).split("\n") if not line.startswith(('"', "#", "-", "git+"))] 22 | 23 | 24 | setup( 25 | name="codec-bpe", 26 | version=VERSION, 27 | author="Abraham Sanders", 28 | author_email="abraham.sanders@gmail.com", 29 | description="Implementation of Acoustic BPE (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs", 30 | url="https://github.com/AbrahamSanders/codec-bpe", 31 | long_description=read("README.md"), 32 | long_description_content_type="text/markdown", 33 | packages=find_packages(), 34 | install_requires=read_requirements("requirements.txt"), 35 | extras_require={ 36 | "funcodec": read_requirements("requirements_funcodec.txt"), 37 | "xcodec2": read_requirements("requirements_xcodec2.txt"), 38 | "wavtokenizer": read_requirements("requirements_wavtokenizer.txt"), 39 | "simvq": read_requirements("requirements_simvq.txt"), 40 | "magicodec": read_requirements("requirements_magicodec.txt"), 41 | "neucodec": read_requirements("requirements_neucodec.txt"), 42 | }, 43 | ) -------------------------------------------------------------------------------- /codec_bpe/core/utils.py: -------------------------------------------------------------------------------- 1 | from typing import List, Optional, Union 2 | from argparse import Namespace 3 | import json 4 | import os 5 | 6 | def get_codes_files( 7 | codes_path: str, 8 | codes_filter: Optional[Union[str, List[str]]] = None, 9 | num_files: Optional[int] = None, 10 | ) -> List[str]: 11 | return get_files(codes_path, ".npy", codes_filter, num_files) 12 | 13 | def get_files( 14 | path: str, 15 | extension: str, 16 | filter: Optional[Union[str, List[str]]] = None, 17 | num_files: Optional[int] = None, 18 | ) -> List[str]: 19 | if isinstance(filter, str): 20 | filter = [filter] 21 | result_files = [] 22 | for root, _, files in os.walk(path): 23 | for file in files: 24 | file_path = os.path.join(root, file) 25 | if not file_path.endswith(extension): 26 | continue 27 | if filter and not any([f in file_path for f in filter]): 28 | continue 29 | result_files.append(file_path) 30 | result_files.sort() 31 | if num_files is not None: 32 | result_files = result_files[:num_files] 33 | return result_files 34 | 35 | def get_codec_info(codes_path: str) -> dict: 36 | codec_info_file = os.path.join(codes_path, "codec_info.json") 37 | if not os.path.exists(codec_info_file): 38 | return None 39 | with open(codec_info_file, "r") as f: 40 | codec_info = json.load(f) 41 | return codec_info 42 | 43 | def update_args_from_codec_info(args: Namespace, codec_info: dict) -> Namespace: 44 | if codec_info is not None: 45 | if "num_codebooks" in args and args.num_codebooks is None: 46 | args.num_codebooks = codec_info["num_codebooks"] 47 | if "codebook_size" in args and args.codebook_size is None: 48 | args.codebook_size = codec_info["codebook_size"] 49 | if "codec_framerate" in args and args.codec_framerate is None: 50 | args.codec_framerate = codec_info["framerate"] 51 | return args -------------------------------------------------------------------------------- /codec_bpe/lm_dataset_stats.py: -------------------------------------------------------------------------------- 1 | from tqdm import tqdm 2 | import numpy as np 3 | import argparse 4 | 5 | from .core.utils import get_codec_info, update_args_from_codec_info 6 | 7 | if __name__ == "__main__": 8 | parser = argparse.ArgumentParser(description="Compute statistics for a plain-text codec BPE dataset") 9 | parser.add_argument("--dataset_path", type=str, required=True) 10 | parser.add_argument("--codes_path", type=str) 11 | parser.add_argument("--num_codebooks", type=int, default=None) 12 | parser.add_argument("--codec_framerate", type=float, default=None) 13 | parser.add_argument("--audio_start_token", type=str) 14 | parser.add_argument("--audio_end_token", type=str) 15 | parser.add_argument("--num_examples", type=int, default=None) 16 | args = parser.parse_args() 17 | 18 | if args.codes_path is not None: 19 | codec_info = get_codec_info(args.codes_path) 20 | update_args_from_codec_info(args, codec_info) 21 | if args.num_codebooks is None or args.codec_framerate is None: 22 | error_cause = "codec_info.json does not exist in --codes_path" if args.codes_path is not None else "--codes_path is not specified" 23 | raise ValueError(f"{error_cause} so you must specify --num_codebooks and --codec_framerate manually.") 24 | 25 | lengths = [] 26 | with open(args.dataset_path, encoding="utf-8") as f: 27 | for i, line in tqdm(enumerate(f), desc="Examples"): 28 | if i == args.num_examples: 29 | break 30 | line = line.rstrip() 31 | if args.audio_start_token is not None: 32 | line = line.lstrip(args.audio_start_token) 33 | if args.audio_end_token is not None: 34 | line = line.rstrip(args.audio_end_token) 35 | if line[0] == "<": 36 | line = line.replace("<", "").replace(">", "") 37 | num_units = len(line) / args.num_codebooks 38 | num_seconds = num_units / args.codec_framerate 39 | lengths.append(num_seconds) 40 | total_seconds = np.sum(lengths) 41 | 42 | print(f"{len(lengths)} examples") 43 | print(f"Total: {total_seconds:.2f} seconds ({(total_seconds / 3600):.2f} hours)") 44 | print(f"Max: {np.max(lengths):.2f} seconds") 45 | print(f"Min: {np.min(lengths):.2f} seconds") 46 | print(f"Median: {np.median(lengths):.2f} seconds") 47 | print(f"Mean: {np.mean(lengths):.2f} seconds") 48 | print(f"Std: {np.std(lengths):.2f} seconds") 49 | -------------------------------------------------------------------------------- /codec_bpe/train_tokenizer.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import functools 3 | 4 | from .core.trainer import Trainer 5 | from .core.utils import get_codec_info, update_args_from_codec_info 6 | from . import UNICODE_OFFSET 7 | 8 | if __name__ == "__main__": 9 | parser = argparse.ArgumentParser(description="Train a codec BPE tokenizer from numpy files containing audio codes") 10 | parser.add_argument("--codes_path", type=str, required=True) 11 | parser.add_argument("--num_codebooks", type=int, default=None) 12 | parser.add_argument("--codebook_size", type=int, default=None) 13 | parser.add_argument("--codec_framerate", type=float, default=None) 14 | parser.add_argument("--chunk_size_secs", type=int, default=None) 15 | parser.add_argument("--vocab_size", type=int, default=30000) 16 | parser.add_argument("--min_frequency", type=int, default=2) 17 | parser.add_argument("--special_tokens", nargs="+", default=None) 18 | parser.add_argument("--bos_token", type=str) 19 | parser.add_argument("--eos_token", type=str) 20 | parser.add_argument("--unk_token", type=str) 21 | parser.add_argument("--pad_token", type=str) 22 | parser.add_argument("--max_token_codebook_ngrams", type=int, default=None) 23 | # handle hex values for unicode_offset with argparse: https://stackoverflow.com/a/25513044 24 | parser.add_argument("--unicode_offset", type=functools.partial(int, base=0), default=UNICODE_OFFSET) 25 | parser.add_argument("--save_path", type=str) 26 | parser.add_argument("--codes_filter", type=str, nargs="+") 27 | parser.add_argument("--num_files", type=int, default=None) 28 | args = parser.parse_args() 29 | 30 | codec_info = get_codec_info(args.codes_path) 31 | update_args_from_codec_info(args, codec_info) 32 | if args.num_codebooks is None or args.codebook_size is None: 33 | raise ValueError( 34 | "codec_info.json does not exist in --codes_path so you must specify --num_codebooks and --codebook_size manually." 35 | ) 36 | 37 | codec_type = codec_info["codec_type"] if codec_info is not None else "codec" 38 | if args.save_path is None: 39 | args.save_path = f"output/{codec_type}_bpe_{args.num_codebooks}cb_{round(args.vocab_size/1000)}k" 40 | 41 | trainer = Trainer( 42 | args.num_codebooks, 43 | args.codebook_size, 44 | args.codec_framerate, 45 | args.chunk_size_secs, 46 | args.vocab_size, 47 | args.min_frequency, 48 | args.special_tokens, 49 | args.bos_token, 50 | args.eos_token, 51 | args.unk_token, 52 | args.pad_token, 53 | args.max_token_codebook_ngrams, 54 | args.unicode_offset, 55 | ) 56 | tokenizer = trainer.train(args.codes_path, args.codes_filter, args.num_files) 57 | tokenizer.save_pretrained(args.save_path) 58 | -------------------------------------------------------------------------------- /codec_bpe/prep_lm_dataset.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import functools 4 | from tqdm import tqdm 5 | from transformers import AutoTokenizer 6 | 7 | from .tools.lm_dataset_builder import LMDatasetBuilder 8 | from .core.utils import get_codec_info, update_args_from_codec_info 9 | from . import UNICODE_OFFSET 10 | 11 | if __name__ == "__main__": 12 | parser = argparse.ArgumentParser( 13 | description="Use numpy files containing audio codes to construct a plain-text codec BPE dataset suitable for language modeling" 14 | ) 15 | parser.add_argument("--tokenizer", type=str, required=True) 16 | parser.add_argument("--codes_path", type=str, required=True) 17 | parser.add_argument("--num_codebooks", type=int, default=None) 18 | parser.add_argument("--codebook_size", type=int, default=None) 19 | parser.add_argument("--audio_start_token", type=str) 20 | parser.add_argument("--audio_end_token", type=str) 21 | # handle hex values for unicode_offset with argparse: https://stackoverflow.com/a/25513044 22 | parser.add_argument("--unicode_offset", type=functools.partial(int, base=0), default=UNICODE_OFFSET) 23 | parser.add_argument("--sequence_length", type=int, default=4096) 24 | parser.add_argument("--overlap_length", type=int, default=1024) 25 | parser.add_argument("--drop_last", action="store_true") 26 | parser.add_argument("--save_path", type=str, default="output/lm_dataset.txt") 27 | parser.add_argument("--codes_filter", type=str, nargs="+") 28 | parser.add_argument("--num_examples", type=int, default=None) 29 | args = parser.parse_args() 30 | 31 | codec_info = get_codec_info(args.codes_path) 32 | update_args_from_codec_info(args, codec_info) 33 | if args.num_codebooks is None or args.codebook_size is None: 34 | raise ValueError( 35 | "codec_info.json does not exist in --codes_path so you must specify --num_codebooks and --codebook_size manually." 36 | ) 37 | 38 | tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) 39 | 40 | lm_dataset_builder = LMDatasetBuilder( 41 | tokenizer=tokenizer, 42 | num_codebooks=args.num_codebooks, 43 | codebook_size=args.codebook_size, 44 | audio_start_token=args.audio_start_token, 45 | audio_end_token=args.audio_end_token, 46 | unicode_offset=args.unicode_offset, 47 | sequence_length=args.sequence_length, 48 | overlap_length=args.overlap_length, 49 | drop_last=args.drop_last, 50 | ) 51 | 52 | save_dir = os.path.dirname(args.save_path) 53 | if save_dir: 54 | os.makedirs(save_dir, exist_ok=True) 55 | 56 | with open(args.save_path, "w", encoding="utf-8") as f: 57 | for i, example in tqdm(enumerate(lm_dataset_builder.iterate_examples(args.codes_path, args.codes_filter)), desc="Examples"): 58 | if i == args.num_examples: 59 | break 60 | f.write(example) 61 | f.write("\n") 62 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control 110 | .pdm.toml 111 | .pdm-python 112 | .pdm-build/ 113 | 114 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 115 | __pypackages__/ 116 | 117 | # Celery stuff 118 | celerybeat-schedule 119 | celerybeat.pid 120 | 121 | # SageMath parsed files 122 | *.sage.py 123 | 124 | # Environments 125 | .env 126 | .venv 127 | env/ 128 | venv/ 129 | ENV/ 130 | env.bak/ 131 | venv.bak/ 132 | 133 | # Spyder project settings 134 | .spyderproject 135 | .spyproject 136 | 137 | # Rope project settings 138 | .ropeproject 139 | 140 | # mkdocs documentation 141 | /site 142 | 143 | # mypy 144 | .mypy_cache/ 145 | .dmypy.json 146 | dmypy.json 147 | 148 | # Pyre type checker 149 | .pyre/ 150 | 151 | # pytype static type analyzer 152 | .pytype/ 153 | 154 | # Cython debug symbols 155 | cython_debug/ 156 | 157 | # PyCharm 158 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 159 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 160 | # and can be added to the global gitignore or merged into this file. For a more nuclear 161 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 162 | #.idea/ 163 | 164 | # VSCode 165 | .vscode/ 166 | output/ 167 | audio/ 168 | codes/ -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | **2025-12-01** 2 | - Added support for [NeuCodec](https://huggingface.co/neuphonic/neucodec), a new high-quality single-level codec with a 50 Hz framerate! NeuCodec extends XCodec2 with inference speedups and a commercially permissive license. Use `--codec_model neuphonic/neucodec` when encoding audio with `codec_bpe.audio_to_codes` to encode using the NeuCodec model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example. 3 | 4 | **2025-06-22** 5 | - Added support for [MagiCodec](https://github.com/Ereboas/MagiCodec), a new **streaming** single-level codec with a 50 Hz framerate! Use `--codec_model MagiCodec-50Hz-Base` when encoding audio with `codec_bpe.audio_to_codes` to encode using the MagiCodec model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example. 6 | 7 | **2025-06-19** 8 | - Added ability to encode audio into subsecond chunk sizes with a sliding window of prior audio as context. This helps support use-cases where the encoded audio should simulate a streaming setting. For example, many codecs will encode the same audio differently depending on the encoder's receptive field size - even with native streaming codecs like Mimi. So, when training a streaming speech-to-text audio LM, we want to encode the training audio in tiny chunks so that it resembles what will be received during live streaming. This helps prevent throwing the model out of distribution at inference time. 9 | - Use the `--chunk_size_secs` and `--context_secs` parameters with `codec_bpe.audio_to_codes` to configure this. 10 | - By default `--chunk_size_secs=30` and `--context_secs=0.0` for non-streaming usage. 11 | - `--context_secs` controls the sliding window encoding size, which is useful to avoid codec degradation at tiny chunk sizes. For example, `--chunk_size_secs=0.08` with `--context_secs=0.4` will encode audio in chunks of 80ms, each chunk receiving the previous 320ms of audio as context to the encoder's receptive field (we encode 320 + 80 = 400ms of audio at a time but only keep the final 80ms of codes). 12 | 13 | **2025-06-16** 14 | - Added support for [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) and [SimVQ](https://github.com/youngsheen/SimVQ)! Both are single-level codecs that share the same architecture but differ in their VQ strategy. WavTokenizer comes in 40Hz and 75Hz variants with a vocabulary size of 4096. SimVQ variants have a 75Hz framerate with vocabulary sizes ranging from 4096 to 262144 codes. SimVQ also features a causal encoder and partially causal decoder, making it suitable for streaming use cases. 15 | - Use `--codec_model WavTokenizer-large-320-24k-4096` (or any other from the `Model` column on [this table](#supported-codecs)) with `codec_bpe.audio_to_codes` to encode audio using WavTokenizer. 16 | - Use `--codec_model simvq_4k` (or any other from the `Model` column on [this table](#supported-codecs)) with `codec_bpe.audio_to_codes` to encode audio using SimVQ. 17 | - See [here](README.md#train-a-tokenizer-from-audio-files) for usage examples. 18 | 19 | **2025-04-07** 20 | - Added support for [XCodec2](https://huggingface.co/HKUSTAudio/xcodec2), a high-quality multilingual single-level codec with a 50 Hz framerate! Use `--codec_model HKUSTAudio/xcodec2` when encoding audio with `codec_bpe.audio_to_codes` to encode using the XCodec2 model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example. 21 | 22 | **2025-03-09** 23 | - Added support for [FunCodec](https://funcodec.github.io/) from Alibaba DAMO Speech Lab! Use `--codec_model alibaba-damo/...` when encoding audio with `codec_bpe.audio_to_codes` to encode using the FunCodec model. Model paths on the HuggingFace hub are listed [here](https://github.com/modelscope/FunCodec?tab=readme-ov-file#available-models). See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example. 24 | 25 | **2024-09-20** 26 | - Added support for Kyutai Lab's [Mimi codec](https://huggingface.co/kyutai/mimi), an amazing new codec with a 12.5 Hz framerate! Use `--codec_model kyutai/mimi` when encoding audio with `codec_bpe.audio_to_codes` to encode using the Mimi model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example. 27 | 28 | **2024-09-19** 29 | - Initial Release! -------------------------------------------------------------------------------- /codec_bpe/core/sentencepiece_bpe.py: -------------------------------------------------------------------------------- 1 | from typing import Dict, Iterator, List, Optional, Tuple, Union 2 | 3 | from tokenizers import AddedToken, Tokenizer, decoders, pre_tokenizers, trainers 4 | from tokenizers.models import BPE 5 | 6 | from tokenizers.implementations.base_tokenizer import BaseTokenizer 7 | 8 | 9 | class SentencePieceBPETokenizer(BaseTokenizer): 10 | """SentencePiece BPE Tokenizer 11 | 12 | Represents the BPE algorithm, with the pretokenization used by SentencePiece 13 | 14 | ------------------------------------------------------------------------------------------------ 15 | Adapted from: 16 | https://github.com/huggingface/tokenizers/blob/v0.19.1/bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py 17 | 18 | Changes: 19 | (1) Removed NFKC Unicode normalization: We're using unicode characters as a base alphabet and their content is arbitrary 20 | for our purpose, so we don't need to normalize them. 21 | (2) Added the max_token_length parameter for BpeTrainer to the `train` and `train_from_iterator` methods. 22 | ------------------------------------------------------------------------------------------------ 23 | """ 24 | 25 | def __init__( 26 | self, 27 | vocab: Optional[Union[str, Dict[str, int]]] = None, 28 | merges: Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] = None, 29 | unk_token: Union[str, AddedToken] = "", 30 | replacement: str = "▁", 31 | add_prefix_space: bool = True, 32 | dropout: Optional[float] = None, 33 | fuse_unk: Optional[bool] = False, 34 | ): 35 | if vocab is not None and merges is not None: 36 | tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk)) 37 | else: 38 | tokenizer = Tokenizer(BPE(dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk)) 39 | 40 | if tokenizer.token_to_id(str(unk_token)) is not None: 41 | tokenizer.add_special_tokens([str(unk_token)]) 42 | 43 | prepend_scheme = "always" if add_prefix_space else "never" 44 | tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme) 45 | tokenizer.decoder = decoders.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme) 46 | 47 | parameters = { 48 | "model": "SentencePieceBPE", 49 | "unk_token": unk_token, 50 | "replacement": replacement, 51 | "add_prefix_space": add_prefix_space, 52 | "dropout": dropout, 53 | } 54 | 55 | super().__init__(tokenizer, parameters) 56 | 57 | @staticmethod 58 | def from_file(vocab_filename: str, merges_filename: str, **kwargs): 59 | vocab, merges = BPE.read_file(vocab_filename, merges_filename) 60 | return SentencePieceBPETokenizer(vocab, merges, **kwargs) 61 | 62 | def train( 63 | self, 64 | files: Union[str, List[str]], 65 | vocab_size: int = 30000, 66 | min_frequency: int = 2, 67 | special_tokens: List[Union[str, AddedToken]] = [""], 68 | limit_alphabet: int = 1000, 69 | initial_alphabet: List[str] = [], 70 | max_token_length: Optional[int] = None, 71 | show_progress: bool = True, 72 | ): 73 | """Train the model using the given files""" 74 | 75 | trainer = trainers.BpeTrainer( 76 | vocab_size=vocab_size, 77 | min_frequency=min_frequency, 78 | special_tokens=special_tokens, 79 | limit_alphabet=limit_alphabet, 80 | initial_alphabet=initial_alphabet, 81 | max_token_length=max_token_length, 82 | show_progress=show_progress, 83 | ) 84 | if isinstance(files, str): 85 | files = [files] 86 | self._tokenizer.train(files, trainer=trainer) 87 | 88 | def train_from_iterator( 89 | self, 90 | iterator: Union[Iterator[str], Iterator[Iterator[str]]], 91 | vocab_size: int = 30000, 92 | min_frequency: int = 2, 93 | special_tokens: List[Union[str, AddedToken]] = [""], 94 | limit_alphabet: int = 1000, 95 | initial_alphabet: List[str] = [], 96 | max_token_length: Optional[int] = None, 97 | show_progress: bool = True, 98 | length: Optional[int] = None, 99 | ): 100 | """Train the model using the given iterator""" 101 | 102 | trainer = trainers.BpeTrainer( 103 | vocab_size=vocab_size, 104 | min_frequency=min_frequency, 105 | special_tokens=special_tokens, 106 | limit_alphabet=limit_alphabet, 107 | initial_alphabet=initial_alphabet, 108 | max_token_length=max_token_length, 109 | show_progress=show_progress, 110 | ) 111 | self._tokenizer.train_from_iterator( 112 | iterator, 113 | trainer=trainer, 114 | length=length, 115 | ) 116 | -------------------------------------------------------------------------------- /codec_bpe/tools/lm_dataset_builder.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, Union, Iterator, List 2 | from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast 3 | from tqdm import tqdm 4 | import numpy as np 5 | import re 6 | 7 | from ..core.converter import codes_to_chars, validate_unicode_offset, UNICODE_OFFSET 8 | from ..core.utils import get_codes_files 9 | 10 | class LMDatasetBuilder: 11 | def __init__( 12 | self, 13 | tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], 14 | num_codebooks: int, 15 | codebook_size: int, 16 | audio_start_token: Optional[str] = None, 17 | audio_end_token: Optional[str] = None, 18 | unicode_offset: int = UNICODE_OFFSET, 19 | sequence_length: int = 4096, 20 | overlap_length: int = 1024, 21 | drop_last: bool = False, 22 | ): 23 | self.tokenizer = tokenizer 24 | self.num_codebooks = num_codebooks 25 | self.codebook_size = codebook_size 26 | self.unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size) 27 | self.sequence_length = sequence_length 28 | self.overlap_length = overlap_length 29 | self.drop_last = drop_last 30 | 31 | self.audio_start_token_id = None 32 | if audio_start_token is not None: 33 | self.audio_start_token_id = self.tokenizer.convert_tokens_to_ids(audio_start_token) 34 | if self.audio_start_token_id is None: 35 | raise ValueError(f"Token '{audio_start_token}' not found in tokenizer") 36 | self.audio_end_token_id = None 37 | if audio_end_token is not None: 38 | self.audio_end_token_id = self.tokenizer.convert_tokens_to_ids(audio_end_token) 39 | if self.audio_end_token_id is None: 40 | raise ValueError(f"Token '{audio_end_token}' not found in tokenizer") 41 | 42 | def _group_codes_files(self, codes_files: List[str]) -> List[List[str]]: 43 | grouped_codes_files = [] 44 | last_file_root = None 45 | for codes_file in codes_files: 46 | file_root = re.match(r"(.+)_c\d+[_.]", codes_file).group(1) 47 | if file_root != last_file_root: 48 | grouped_codes_files.append([]) 49 | last_file_root = file_root 50 | grouped_codes_files[-1].append(codes_file) 51 | return grouped_codes_files 52 | 53 | def iterate_examples(self, codes_path: str, codes_filter: Optional[Union[str, List[str]]] = None) -> Iterator[str]: 54 | codes_files = get_codes_files(codes_path, codes_filter) 55 | # group codes files by root filename (minus channel and starting timestamp) 56 | grouped_codes_files = self._group_codes_files(codes_files) 57 | for file_group in tqdm(grouped_codes_files, desc="Codes file groups"): 58 | # concatenate all codes files in each group 59 | codes = np.concatenate([np.load(file) for file in file_group], axis=-1) 60 | if len(codes.shape) == 4: 61 | codes = codes[0, 0] 62 | elif len(codes.shape) == 3: 63 | codes = codes[0] 64 | codes = codes[:self.num_codebooks] 65 | # convert to unicode string 66 | chars = codes_to_chars( 67 | codes, 68 | self.codebook_size, 69 | copy_before_conversion=False, 70 | unicode_offset=self.unicode_offset, 71 | ) 72 | # encode the unicode string with the tokenizer 73 | tokens = self.tokenizer.encode(chars, return_tensors="np")[0] 74 | sequence_length = self.sequence_length 75 | if self.tokenizer.bos_token_id is not None and tokens[0] == self.tokenizer.bos_token_id: 76 | tokens = tokens[1:] 77 | sequence_length -= 1 78 | if self.tokenizer.eos_token_id is not None and tokens[-1] == self.tokenizer.eos_token_id: 79 | tokens = tokens[:-1] 80 | sequence_length -= 1 81 | if self.audio_start_token_id is not None: 82 | sequence_length -= 1 83 | if self.audio_end_token_id is not None: 84 | sequence_length -= 1 85 | # yield examples from the sequence with the specified sequence length and overlap 86 | start = 0 87 | while True: 88 | end = start + sequence_length 89 | if self.drop_last and end > len(tokens): 90 | break 91 | example_tokens = tokens[start:end] 92 | # add audio start and end tokens if specified 93 | if self.audio_start_token_id is not None: 94 | example_tokens = np.concatenate([[self.audio_start_token_id], example_tokens]) 95 | if self.audio_end_token_id is not None: 96 | example_tokens = np.concatenate([example_tokens, [self.audio_end_token_id]]) 97 | example = self.tokenizer.decode(example_tokens) 98 | yield example 99 | if end >= len(tokens): 100 | break 101 | start = end - self.overlap_length -------------------------------------------------------------------------------- /codec_bpe/core/converter.py: -------------------------------------------------------------------------------- 1 | """ 2 | Converter utility for converting discrete codec codes to and from unicode characters used for BPE tokenization. 3 | """ 4 | from typing import List, Optional, Union, Tuple 5 | import logging 6 | import numpy as np 7 | import torch 8 | 9 | logger = logging.getLogger(__name__) 10 | 11 | UNICODE_OFFSET: int = 0x4E00 12 | """Original unicode offset from the Acoustic BPE paper (Shen et al., 2024)""" 13 | UNICODE_OFFSET_LARGE: int = 0xE000 14 | """For very large codebook size (e.g. > 32768), use this higher unicode offset to avoid running into surrogates 15 | which are not printable and won't work with BPE tokenization.""" 16 | 17 | def codes_to_chars( 18 | codes: Union[List[List[int]], np.ndarray, torch.Tensor], 19 | codebook_size: int, 20 | copy_before_conversion: bool = True, 21 | unicode_offset: int = UNICODE_OFFSET, 22 | ) -> str: 23 | if isinstance(codes, list): 24 | codes = np.array(codes) 25 | copy_before_conversion = False 26 | elif isinstance(codes, torch.Tensor): 27 | codes = codes.cpu().numpy() 28 | if len(codes.shape) != 2: 29 | raise ValueError("codes must be a 2D array of shape (num_codebooks, seq_length).") 30 | unicode_offset = validate_unicode_offset(unicode_offset, codes.shape[0], codebook_size) 31 | if copy_before_conversion: 32 | codes = codes.copy() 33 | for i in range(codes.shape[0]): 34 | codes[i] += unicode_offset + i*codebook_size 35 | codes = codes.T.reshape(-1) 36 | chars = "".join([chr(c) for c in codes]) 37 | return chars 38 | 39 | def chars_to_codes( 40 | chars: str, 41 | num_codebooks: int, 42 | codebook_size: int, 43 | drop_inconsistent_codes: bool = True, 44 | drop_hanging_codes: bool = True, 45 | return_hanging_codes_chars: bool = False, 46 | return_tensors: Optional[str] = None, 47 | unicode_offset: int = UNICODE_OFFSET, 48 | ) -> Union[List[List[int]], np.ndarray, torch.Tensor]: 49 | unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size) 50 | codes = np.array([ord(c) for c in chars]) 51 | if drop_inconsistent_codes: 52 | codes = _drop_inconsistent_codes(codes, num_codebooks, codebook_size, unicode_offset) 53 | if drop_hanging_codes: 54 | codes, begin_hanging, end_hanging = _drop_hanging_codes(codes, num_codebooks, codebook_size, unicode_offset) 55 | codes = codes.reshape(-1, num_codebooks).T 56 | for i in range(codes.shape[0]): 57 | codes[i] -= unicode_offset + i*codebook_size 58 | if return_tensors is None: 59 | codes = codes.tolist() 60 | elif return_tensors == "pt": 61 | codes = torch.tensor(codes) 62 | if return_hanging_codes_chars: 63 | begin_hanging = "".join([chr(c) for c in begin_hanging]) 64 | end_hanging = "".join([chr(c) for c in end_hanging]) 65 | return codes, begin_hanging, end_hanging 66 | return codes 67 | 68 | def validate_unicode_offset(unicode_offset: int, num_codebooks: int, codebook_size: int) -> int: 69 | # If the range [unicode_offset, unicode_offset+num_codebooks*codebook_size) intersects with the 70 | # surrogate range [0xD800, 0xDFFF], then we need to use the large unicode offset. 71 | lower = unicode_offset 72 | upper = unicode_offset + num_codebooks * codebook_size 73 | surrogate_lower = 0xD800 74 | surrogate_upper = 0xDFFF 75 | if lower < surrogate_upper and upper > surrogate_lower: 76 | raise ValueError( 77 | f"You are using unicode offset {hex(unicode_offset)}, however your base vocabulary size (num_codebooks x codebook_size) " 78 | f"is {num_codebooks*codebook_size} which will intersect with the non-printable surrogate range 0xD800-0xDFFF if starting from this offset.\n" 79 | f"To avoid this issue, use a unicode offset starting after the surrogate range, such as {hex(UNICODE_OFFSET_LARGE)}." 80 | ) 81 | return unicode_offset 82 | 83 | def _resolve_codebook(code: int, num_codebooks: int, codebook_size: int, unicode_offset: int) -> int: 84 | codebook = num_codebooks-1 85 | while codebook > -1 and code < unicode_offset + codebook*codebook_size: 86 | codebook -= 1 87 | return codebook 88 | 89 | def _drop_inconsistent_codes( 90 | codes: np.ndarray, 91 | num_codebooks: int, 92 | codebook_size: int, 93 | unicode_offset: int, 94 | ) -> np.ndarray: 95 | mask = np.ones_like(codes, dtype=bool) 96 | expected_codebook = _resolve_codebook(codes[0], num_codebooks, codebook_size, unicode_offset) 97 | if expected_codebook < 0: 98 | expected_codebook = 0 99 | for i in range(len(codes)): 100 | # figure out which codebook the character belongs to 101 | actual_codebook = _resolve_codebook(codes[i], num_codebooks, codebook_size, unicode_offset) 102 | # mark it to be dropped if it doesn't match the expected codebook 103 | if actual_codebook != expected_codebook: 104 | mask[i] = False 105 | logger.warning( 106 | f"Dropped inconsistent audio code at position {i}. " 107 | f"Expected codebook {expected_codebook} but got codebook {actual_codebook}." 108 | ) 109 | else: 110 | expected_codebook = (expected_codebook + 1) % num_codebooks 111 | codes = codes[mask] 112 | return codes 113 | 114 | def _drop_hanging_codes( 115 | codes: np.ndarray, 116 | num_codebooks: int, 117 | codebook_size: int, 118 | unicode_offset: int, 119 | ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: 120 | # first check for hanging codes at the beginning 121 | begin_hanging = [] 122 | while len(codes) > 0: 123 | actual_codebook = _resolve_codebook(codes[0], num_codebooks, codebook_size, unicode_offset) 124 | if actual_codebook == 0: 125 | break 126 | begin_hanging.append(codes[0]) 127 | codes = codes[1:] 128 | logger.info(f"Dropped hanging audio code (codebook {actual_codebook}) at beginning of sequence.") 129 | # then check for hanging codes at the end 130 | end_hanging = [] 131 | while len(codes) > 0: 132 | actual_codebook = _resolve_codebook(codes[-1], num_codebooks, codebook_size, unicode_offset) 133 | if actual_codebook == num_codebooks-1: 134 | break 135 | end_hanging.append(codes[-1]) 136 | codes = codes[:-1] 137 | logger.info(f"Dropped hanging audio code (codebook {actual_codebook}) at end of sequence.") 138 | begin_hanging = np.array(begin_hanging) 139 | end_hanging = np.array(end_hanging)[::-1] 140 | return codes, begin_hanging, end_hanging 141 | 142 | 143 | -------------------------------------------------------------------------------- /codec_bpe/audio_to_codes.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import json 3 | import os 4 | 5 | from .tools.audio_encoder import AudioEncoder, CodecTypes, SUPPORTED_EXTENSIONS 6 | 7 | if __name__ == "__main__": 8 | parser = argparse.ArgumentParser( 9 | description="Convert audio files to numpy files containing audio codes using a Codec" 10 | ) 11 | parser.add_argument( 12 | "--audio_path", 13 | type=str, 14 | default="audio", 15 | help="Directory containing the audio files", 16 | ) 17 | parser.add_argument( 18 | "--codes_path", 19 | type=str, 20 | default="output/codes", 21 | help="Directory to save the numpy codes files", 22 | ) 23 | parser.add_argument( 24 | "--chunk_size_secs", 25 | type=float, 26 | default=30.0, help="Chunk size in seconds", 27 | ) 28 | parser.add_argument( 29 | "--context_secs", 30 | type=float, 31 | default=0.0, 32 | help=( 33 | "Context size in seconds for encoding (default: 0.0, no context). " 34 | "If set, chunks will be left-padded with max(0, context_secs-chunk_size_secs) " 35 | "seconds of previous audio, while only chunk_size_secs worth of codes will be saved. " 36 | "This is useful for codecs that require context for better encoding quality at " 37 | "very small chunk sizes." 38 | ), 39 | ) 40 | parser.add_argument( 41 | "--batch_size", 42 | type=int, 43 | default=1, 44 | help="Number of audio chunks to process in a single batch", 45 | ) 46 | parser.add_argument( 47 | "--codec_type", 48 | type=str, 49 | choices=list(CodecTypes), 50 | default=None, 51 | help="Type of codec to use for encoding. None to infer the type from --codec_model.", 52 | ) 53 | parser.add_argument( 54 | "--codec_model", 55 | type=str, 56 | default="facebook/encodec_24khz", 57 | help="Codec model path on the HuggingFace Model Hub.", 58 | ) 59 | parser.add_argument( 60 | "--bandwidth", 61 | type=float, 62 | default=None, 63 | help=( 64 | "Bandwidth for encoding. Only applies if --codec_type is 'encodec' or 'funcodec'. " 65 | "Values may be provided in kbps (e.g. 1.5) or in bps (e.g. 1500)." 66 | "For FunCodec, valid ranges for this parameter are listed in the 'Bitrate' column at " 67 | "https://github.com/modelscope/FunCodec?tab=readme-ov-file#available-models. " 68 | "For EnCodec, valid values are 1.5, 3.0, 6.0, 12.0, and 24.0 (kpbs). " 69 | "None uses the max bandwidth with FunCodec and the min bandwidth with EnCodec." 70 | ), 71 | ) 72 | parser.add_argument( 73 | "--n_quantizers", 74 | type=int, 75 | default=None, 76 | help=( 77 | "Number of quantizers (codebooks) to use for encoding. None to use all quantizers. " 78 | "Only applies if --codec_type is 'dac' or 'mimi'." 79 | ), 80 | ) 81 | parser.add_argument( 82 | "--stereo", 83 | action="store_true", 84 | help="Encode stereo audio channels separately instead of converting to mono", 85 | ) 86 | parser.add_argument( 87 | "--file_per_chunk", 88 | action="store_true", 89 | help=( 90 | "Save each audio chunk as a separate numpy file with the start timestamp (secs) in the filename " 91 | "instead of the default behavior of concatenating all chunks into a single numpy file corresponding " 92 | "to the original audio file." 93 | ), 94 | ) 95 | parser.add_argument( 96 | "--extensions", 97 | nargs="+", 98 | default=SUPPORTED_EXTENSIONS, 99 | help="Audio file extensions to convert. Formats must be supported by a librosa backend.", 100 | ) 101 | parser.add_argument( 102 | "--audio_filter", 103 | nargs="+", 104 | help=( 105 | "Audio file filters. If provided, file paths must match one of the filters to be converted." 106 | ) 107 | ) 108 | parser.add_argument( 109 | "--overwrite", 110 | action="store_true", 111 | help=( 112 | "Overwrite existing numpy codes directories. If not set, audio corresponding to existing " 113 | "numpy codes directories will be skipped." 114 | ), 115 | ) 116 | parser.add_argument( 117 | "--codec_info_only", 118 | action="store_true", 119 | help="Only write codec info and do not convert any audio files.", 120 | ) 121 | args = parser.parse_args() 122 | 123 | codec_name_for_path = args.codec_model.split("/")[-1] 124 | codec_setting_for_path = f"{args.chunk_size_secs}s_{args.context_secs}s" 125 | args.codes_path = os.path.join( 126 | args.codes_path, codec_name_for_path, codec_setting_for_path, "stereo" if args.stereo else "mono" 127 | ) 128 | 129 | audio_encoder = AudioEncoder( 130 | args.codec_model, 131 | codec_type=args.codec_type, 132 | chunk_size_secs=args.chunk_size_secs, 133 | context_secs=args.context_secs, 134 | batch_size=args.batch_size, 135 | bandwidth=args.bandwidth, 136 | n_quantizers=args.n_quantizers, 137 | stereo=args.stereo, 138 | file_per_chunk=args.file_per_chunk, 139 | ) 140 | 141 | codec_info = audio_encoder.get_codec_info() 142 | 143 | # iterate and convert 144 | if args.codec_info_only: 145 | os.makedirs(args.codes_path, exist_ok=True) 146 | else: 147 | result = audio_encoder.encode_audio( 148 | args.audio_path, 149 | args.codes_path, 150 | extensions=args.extensions, 151 | audio_filter=args.audio_filter, 152 | overwrite=args.overwrite, 153 | ) 154 | # Print summary 155 | print(f"Attempted to convert {result.num_audio_files} audio files:") 156 | print(f"{result.num_audio_files-len(result.errored_audio_files)} Succeeded.") 157 | print(f"{len(result.errored_audio_files)} Errored.") 158 | print(f"{result.num_numpy_files} numpy files created.") 159 | print(f"{result.num_skipped_dirs} directories skipped.") 160 | if result.errored_audio_files: 161 | print("\nErrored files:") 162 | for file in result.errored_audio_files: 163 | print(file) 164 | 165 | # write codec info to the base codes directory 166 | codec_info_path = os.path.join(args.codes_path, "codec_info.json") 167 | with open(codec_info_path, "w") as f: 168 | json.dump(codec_info, f, indent=4) 169 | print("\nCodec info written.") 170 | print("\nDone.") 171 | -------------------------------------------------------------------------------- /codec_bpe/core/trainer.py: -------------------------------------------------------------------------------- 1 | from typing import Optional, List, Union, Iterator 2 | import warnings 3 | import numpy as np 4 | from tokenizers import AddedToken 5 | from transformers import PreTrainedTokenizerFast 6 | 7 | from .sentencepiece_bpe import SentencePieceBPETokenizer 8 | from .converter import codes_to_chars, validate_unicode_offset, UNICODE_OFFSET 9 | from .utils import get_codes_files 10 | 11 | class Trainer: 12 | def __init__( 13 | self, 14 | num_codebooks: int, 15 | codebook_size: int, 16 | codec_framerate: Optional[float] = None, 17 | chunk_size_secs: Optional[int] = None, 18 | vocab_size: int = 30000, 19 | min_frequency: int = 2, 20 | special_tokens: Optional[List[Union[str, AddedToken]]] = None, 21 | bos_token: Optional[str] = None, 22 | eos_token: Optional[str] = None, 23 | unk_token: Optional[str] = None, 24 | pad_token: Optional[str] = None, 25 | max_token_codebook_ngrams: Optional[int] = None, 26 | unicode_offset: int = UNICODE_OFFSET, 27 | ): 28 | if chunk_size_secs is not None: 29 | if codec_framerate is None: 30 | raise ValueError("If chunk_size_secs is set, codec_framerate must also be set.") 31 | if chunk_size_secs < 1: 32 | raise ValueError("chunk_size_secs must be a positive integer >= 1.") 33 | if eos_token is None and pad_token is None: 34 | raise ValueError( 35 | "Either pad_token or eos_token should be set, otherwise padded batching will not work with this tokenizer." 36 | ) 37 | if max_token_codebook_ngrams is not None and max_token_codebook_ngrams < 0: 38 | raise ValueError("max_token_codebook_ngrams must be a non-negative integer (0 or greater).") 39 | 40 | self.num_codebooks = num_codebooks 41 | self.codebook_size = codebook_size 42 | self.codec_framerate = codec_framerate 43 | self.chunk_size_secs = chunk_size_secs 44 | self.vocab_size = vocab_size 45 | self.min_frequency = min_frequency 46 | self.special_tokens = special_tokens 47 | self.bos_token = bos_token 48 | self.eos_token = eos_token 49 | self.unk_token = unk_token 50 | self.pad_token = pad_token 51 | self.max_token_codebook_ngrams = max_token_codebook_ngrams 52 | self.unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size) 53 | 54 | if self.special_tokens is None: 55 | self.special_tokens = [] 56 | for special_token in [self.eos_token, self.bos_token, self.unk_token, self.pad_token]: 57 | if special_token is not None and special_token not in self.special_tokens: 58 | self.special_tokens.insert(0, special_token) 59 | 60 | min_vocab_size = self.num_codebooks*self.codebook_size + len(self.special_tokens) 61 | if self.vocab_size < min_vocab_size: 62 | raise ValueError( 63 | f"vocab_size is set to {self.vocab_size} but it must be at least {min_vocab_size} to accommodate " 64 | f"{self.num_codebooks} x {self.codebook_size} codes and {len(self.special_tokens)} special token(s).\n" 65 | f"Consider setting vocab_size to {min_vocab_size} + K, where K is the number of tokens you want to " 66 | "reserve for codebook ngrams (learned merges). K should be a sufficiently large number (e.g. >= 10,000) " 67 | "to allow for wide coverage of the most common codebook ngrams in your training data." 68 | ) 69 | 70 | def _iterate_and_convert(self, codes_files: List[str]) -> Iterator[str]: 71 | for codes_file in codes_files: 72 | codes = np.load(codes_file) 73 | if len(codes.shape) == 4: 74 | codes = codes[0, 0] 75 | elif len(codes.shape) == 3: 76 | codes = codes[0] 77 | codes = codes[:self.num_codebooks] 78 | chunk_size = int(self.chunk_size_secs * self.codec_framerate) if self.chunk_size_secs else codes.shape[1] 79 | for i in range(0, codes.shape[1], chunk_size): 80 | chars = codes_to_chars( 81 | codes[:, i:i+chunk_size], 82 | self.codebook_size, 83 | copy_before_conversion=False, 84 | unicode_offset=self.unicode_offset, 85 | ) 86 | yield chars 87 | 88 | def train( 89 | self, 90 | codes_path: str, 91 | codes_filter: Optional[Union[str, List[str]]] = None, 92 | num_files: Optional[int] = None, 93 | ) -> SentencePieceBPETokenizer: 94 | # Compute base alphabet. This should be num_codebooks * codebook_size so that we never split a codeword 95 | # into smaller units. 96 | initial_alphabet = [ 97 | chr(i) for i in range( 98 | self.unicode_offset, 99 | self.unicode_offset + self.num_codebooks * self.codebook_size 100 | ) 101 | ] 102 | 103 | # If max_token_codebook_ngrams is set, we need to limit the token length to avoid creating tokens that are larger than 104 | # that number of codebook ngrams. A codebook ngram is a sequence of length num_codebooks with one codeword taken from 105 | # each codebook, representing a complete acoustic unit. 106 | # For example if num_codebooks = 4 and max_token_codebook_ngrams = 5, the maximum token length would be 20. 107 | max_token_length = None 108 | if self.max_token_codebook_ngrams is not None: 109 | max_token_length = max(1, self.max_token_codebook_ngrams * self.num_codebooks) 110 | 111 | # Train tokenizer 112 | if max_token_length == 1: 113 | # We don't need to actually train the tokenizer here, just create one with the initial alphabet. 114 | codes_iterator = [] 115 | else: 116 | codes_files = get_codes_files(codes_path, codes_filter, num_files) 117 | if not self.chunk_size_secs and codes_files[0].split("_")[-1].startswith("c"): 118 | warnings.warn( 119 | "The codes files do not have start timestamps, indicating they represent full-length encoded audio files rather than chunks. " 120 | "It is recommended to set `--chunk_size_secs` to a small value (e.g. 30) to avoid the tokenizer training on very long sequences. " 121 | "Training on very long sequences of audio codes can lead to memory issues and poor BPE merges." 122 | ) 123 | codes_iterator = self._iterate_and_convert(codes_files) 124 | # the +1 is because max_token_length is exclusive (e.g., max_token_length of n yields an actual max token length of n-1). 125 | # not sure if this is a bug in Tokenizers or intended behavior. 126 | max_token_length = max_token_length + 1 if max_token_length is not None else None 127 | 128 | tokenizer = SentencePieceBPETokenizer(unk_token=self.unk_token, add_prefix_space=False) 129 | tokenizer.train_from_iterator( 130 | codes_iterator, 131 | vocab_size=self.vocab_size, 132 | min_frequency=self.min_frequency, 133 | special_tokens=self.special_tokens, 134 | limit_alphabet=len(initial_alphabet), 135 | initial_alphabet=initial_alphabet, 136 | max_token_length=max_token_length, 137 | ) 138 | tokenizer = PreTrainedTokenizerFast( 139 | tokenizer_object=tokenizer, 140 | bos_token=self.bos_token, 141 | eos_token=self.eos_token, 142 | unk_token=self.unk_token, 143 | pad_token=self.pad_token, 144 | clean_up_tokenization_spaces=False, 145 | model_input_names=['input_ids', 'attention_mask'], 146 | ) 147 | return tokenizer 148 | -------------------------------------------------------------------------------- /codec_bpe/tools/codec_utils.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn.functional as F 3 | import numpy as np 4 | import os 5 | from enum import Enum 6 | from typing import Tuple, Union, List 7 | from transformers.feature_extraction_utils import BatchFeature, FeatureExtractionMixin 8 | 9 | MAGICODEC_MODELS = { 10 | "magicodec-50hz-base": { 11 | "ckpt": { 12 | "repo_id": "Ereboas/MagiCodec_16k_50hz", 13 | "filename": "MagiCodec-50Hz-Base.ckpt", 14 | }, 15 | }, 16 | } 17 | 18 | WAVTOKENIZER_MODELS = { 19 | "wavtokenizer-small-600-24k-4096": { 20 | "config": { 21 | "repo_id": "novateur/WavTokenizer", 22 | "filename": "wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 23 | }, 24 | "ckpt": { 25 | "repo_id": "novateur/WavTokenizer", 26 | "filename": "WavTokenizer_small_600_24k_4096.ckpt" 27 | }, 28 | }, 29 | "wavtokenizer-small-320-24k-4096": { 30 | "config": { 31 | "repo_id": "novateur/WavTokenizer", 32 | "filename": "wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 33 | }, 34 | "ckpt": { 35 | "repo_id": "novateur/WavTokenizer", 36 | "filename": "WavTokenizer_small_320_24k_4096.ckpt" 37 | }, 38 | }, 39 | "wavtokenizer-medium-speech-320-24k-4096": { 40 | "config": { 41 | "repo_id": "novateur/WavTokenizer-medium-speech-75token", 42 | "filename": "wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 43 | }, 44 | "ckpt": { 45 | "repo_id": "novateur/WavTokenizer-medium-speech-75token", 46 | "filename": "wavtokenizer_medium_speech_320_24k_v2.ckpt" 47 | }, 48 | }, 49 | "wavtokenizer-medium-music-audio-320-24k-4096": { 50 | "config": { 51 | "repo_id": "novateur/WavTokenizer-medium-music-audio-75token", 52 | "filename": "wavtokenizer_mediumdata_music_audio_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 53 | }, 54 | "ckpt": { 55 | "repo_id": "novateur/WavTokenizer-medium-music-audio-75token", 56 | "filename": "wavtokenizer_medium_music_audio_320_24k_v2.ckpt" 57 | }, 58 | }, 59 | "wavtokenizer-large-600-24k-4096": { 60 | "config": { 61 | "repo_id": "novateur/WavTokenizer", 62 | "filename": "wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 63 | }, 64 | "ckpt": { 65 | "repo_id": "novateur/WavTokenizer-large-unify-40token", 66 | "filename": "wavtokenizer_large_unify_600_24k.ckpt" 67 | }, 68 | }, 69 | "wavtokenizer-large-320-24k-4096": { 70 | "config": { 71 | "repo_id": "novateur/WavTokenizer", 72 | "filename": "wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml", 73 | }, 74 | "ckpt": { 75 | "repo_id": "novateur/WavTokenizer-large-speech-75token", 76 | "filename": "wavtokenizer_large_speech_320_v2.ckpt" 77 | }, 78 | } 79 | } 80 | 81 | SIMVQ_MODELS = ["simvq_4k", "simvq_8k", "simvq_65k", "simvq_262k"] 82 | 83 | class CodecTypes(Enum): 84 | ENCODEC = "encodec" 85 | DAC = "dac" 86 | MIMI = "mimi" 87 | FUNCODEC = "funcodec" 88 | XCODEC2 = "xcodec2" 89 | WAVTOKENIZER = "wavtokenizer" 90 | SIMVQ = "simvq" 91 | MAGICODEC = "magicodec" 92 | NEUCODEC = "neucodec" 93 | 94 | @classmethod 95 | def try_get_codec_type(cls, codec_model): 96 | codec_model = codec_model.lower() 97 | if "audio_codec" in codec_model: 98 | return cls.FUNCODEC 99 | if "encodec" in codec_model: 100 | return cls.ENCODEC 101 | if "dac" in codec_model: 102 | return cls.DAC 103 | if "mimi" in codec_model: 104 | return cls.MIMI 105 | if "xcodec2" in codec_model: 106 | return cls.XCODEC2 107 | if "wavtokenizer" in codec_model: 108 | return cls.WAVTOKENIZER 109 | if "simvq" in codec_model: 110 | return cls.SIMVQ 111 | if "magicodec" in codec_model: 112 | return cls.MAGICODEC 113 | if "neucodec" in codec_model: 114 | return cls.NEUCODEC 115 | raise ValueError(f"Could not infer codec type from codec model: {codec_model}. Please specify --codec_type.") 116 | 117 | def __str__(self): 118 | return self.value 119 | def __eq__(self, value): 120 | return str(self) == value 121 | 122 | class DefaultProcessor: 123 | def __call__(self, raw_audio: Union[np.ndarray, List[np.ndarray]], sampling_rate: int, return_tensors: str = "pt") -> BatchFeature: 124 | if not isinstance(raw_audio, list): 125 | raw_audio = [raw_audio] 126 | # Process audio to get padded input tensor 127 | max_audio_len = max([audio.shape[-1] for audio in raw_audio]) 128 | batch_tensors = [F.pad(torch.from_numpy(audio), (0, max_audio_len-audio.shape[-1])) for audio in raw_audio] 129 | inputs = BatchFeature( 130 | data={"input_values": torch.stack(batch_tensors).unsqueeze(1).float()}, 131 | tensor_type=return_tensors, 132 | ) 133 | return inputs 134 | 135 | def load_funcodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 136 | from funcodec.bin.codec_inference import Speech2Token 137 | from huggingface_hub import snapshot_download 138 | cache_path = snapshot_download(codec_model) 139 | config_file = os.path.join(cache_path, "config.yaml") 140 | model_pth = os.path.join(cache_path, "model.pth") 141 | model = Speech2Token(config_file, model_pth, device=str(device)) 142 | model.eval() 143 | processor = DefaultProcessor() 144 | sr_enc = sr_dec = model.model_args.sampling_rate 145 | return model, processor, sr_enc, sr_dec 146 | 147 | def load_xcodec2_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 148 | from huggingface_hub import hf_hub_download 149 | from xcodec2.modeling_xcodec2 import XCodec2Model 150 | from xcodec2.configuration_bigcodec import BigCodecConfig 151 | from safetensors import safe_open 152 | ckpt_path = hf_hub_download(repo_id=codec_model, filename="model.safetensors") 153 | ckpt = {} 154 | with safe_open(ckpt_path, framework="pt", device="cpu") as f: 155 | for k in f.keys(): 156 | ckpt[k.replace(".beta", ".bias")] = f.get_tensor(k) 157 | codec_config = BigCodecConfig.from_pretrained(codec_model) 158 | model = XCodec2Model.from_pretrained(None, config=codec_config, state_dict=ckpt) 159 | model = model.eval().to(device) 160 | processor = DefaultProcessor() 161 | sr_enc = sr_dec = model.feature_extractor.sampling_rate 162 | return model, processor, sr_enc, sr_dec 163 | 164 | def load_wavtokenizer_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 165 | # add `WavTokenizer` directory to the import path. 166 | # TODO: get rid of this if a proper WavTokenizer package is ever released. 167 | if not os.path.exists("WavTokenizer"): 168 | raise ValueError( 169 | "WavTokenizer not found in your working directory. Please clone the WavTokenizer repository: " 170 | "`git clone https://github.com/jishengpeng/WavTokenizer.git`" 171 | ) 172 | import sys 173 | sys.path.append("WavTokenizer") 174 | from huggingface_hub import hf_hub_download 175 | from decoder.pretrained import WavTokenizer 176 | if codec_model.lower() not in WAVTOKENIZER_MODELS: 177 | raise ValueError(f"Unsupported wavtokenizer model: {codec_model}. Supported models: {list(WAVTOKENIZER_MODELS)}") 178 | model_info = WAVTOKENIZER_MODELS[codec_model.lower()] 179 | config_file = hf_hub_download(**model_info["config"]) 180 | model_ckpt = hf_hub_download(**model_info["ckpt"]) 181 | model = WavTokenizer.from_pretrained0802(config_file, model_ckpt).to(device) 182 | processor = DefaultProcessor() 183 | sr_enc = sr_dec = model.feature_extractor.encodec.sample_rate 184 | return model, processor, sr_enc, sr_dec 185 | 186 | def load_simvq_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 187 | # add `SimVQ` directory to the import path. 188 | # TODO: get rid of this if a proper SimVQ package is ever released. 189 | if not os.path.exists("SimVQ"): 190 | raise ValueError( 191 | "SimVQ not found in your working directory. Please clone the SimVQ repository: " 192 | "`git clone https://github.com/youngsheen/SimVQ.git`" 193 | ) 194 | import sys 195 | sys.path.append("SimVQ") 196 | import importlib 197 | from huggingface_hub import hf_hub_download 198 | from omegaconf import OmegaConf 199 | if codec_model.lower() not in SIMVQ_MODELS: 200 | raise ValueError(f"Unsupported SimVQ model: {codec_model}. Supported models: {SIMVQ_MODELS}") 201 | config_file = hf_hub_download(repo_id="youngsheen/SimVQ", filename=f"vq_audio_log/{codec_model.lower()}/1second/config.yaml") 202 | model_ckpt = hf_hub_download(repo_id="youngsheen/SimVQ", filename=f"vq_audio_log/{codec_model.lower()}/epoch=49-step=138600.ckpt") 203 | config = OmegaConf.load(config_file) 204 | module, cls = config.model.class_path.rsplit(".", 1) 205 | cls_init = getattr(importlib.import_module(module, package=None), cls) 206 | model = cls_init(**config.model.init_args) 207 | sd = torch.load(model_ckpt, map_location="cpu")["state_dict"] 208 | model.load_state_dict(sd, strict=False) 209 | model = model.eval().to(device) 210 | processor = DefaultProcessor() 211 | sr_enc = sr_dec = config.model.init_args.sample_rate 212 | return model, processor, sr_enc, sr_dec 213 | 214 | def load_magicodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 215 | # add `MagiCodec` directory to the import path. 216 | # TODO: get rid of this if a proper MagiCodec package is ever released. 217 | if not os.path.exists("MagiCodec"): 218 | raise ValueError( 219 | "MagiCodec not found in your working directory. Please clone the MagiCodec repository: " 220 | "`git clone https://github.com/Ereboas/MagiCodec.git`" 221 | ) 222 | import sys 223 | sys.path.append("MagiCodec") 224 | from huggingface_hub import hf_hub_download 225 | from codec.generator import Generator 226 | if codec_model.lower() not in MAGICODEC_MODELS: 227 | raise ValueError(f"Unsupported magicodec model: {codec_model}. Supported models: {list(MAGICODEC_MODELS)}") 228 | model_info = MAGICODEC_MODELS[codec_model.lower()] 229 | model_ckpt = hf_hub_download(**model_info["ckpt"]) 230 | model = Generator(token_hz=50) 231 | state_dict = torch.load(model_ckpt, map_location='cpu') 232 | model.load_state_dict(state_dict, strict=False) 233 | model = model.eval().to(device) 234 | processor = DefaultProcessor() 235 | sr_enc = sr_dec = model.sample_rate 236 | return model, processor, sr_enc, sr_dec 237 | 238 | def load_neucodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]: 239 | if "distill" in codec_model.lower(): 240 | from neucodec import DistillNeuCodec 241 | model = DistillNeuCodec.from_pretrained(codec_model) 242 | else: 243 | from neucodec import NeuCodec 244 | model = NeuCodec.from_pretrained(codec_model) 245 | model = model.eval().to(device) 246 | processor = DefaultProcessor() 247 | sr_enc = model.feature_extractor.sampling_rate 248 | sr_dec = model.sample_rate 249 | return model, processor, sr_enc, sr_dec 250 | 251 | def load_transformers_codec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, FeatureExtractionMixin, int, int]: 252 | from transformers import AutoModel, AutoProcessor 253 | model = AutoModel.from_pretrained(codec_model).to(device) 254 | processor = AutoProcessor.from_pretrained(codec_model) 255 | sr_enc = sr_dec = model.config.sampling_rate 256 | return model, processor, sr_enc, sr_dec 257 | 258 | def load_codec_model( 259 | codec_type: CodecTypes, 260 | codec_model: str, 261 | device: Union[str, torch.device], 262 | ) -> Tuple[torch.nn.Module, Union[DefaultProcessor, FeatureExtractionMixin], int, int]: 263 | if codec_type == CodecTypes.FUNCODEC: 264 | return load_funcodec_model(codec_model, device) 265 | elif codec_type == CodecTypes.XCODEC2: 266 | return load_xcodec2_model(codec_model, device) 267 | elif codec_type == CodecTypes.WAVTOKENIZER: 268 | return load_wavtokenizer_model(codec_model, device) 269 | elif codec_type == CodecTypes.SIMVQ: 270 | return load_simvq_model(codec_model, device) 271 | elif codec_type == CodecTypes.MAGICODEC: 272 | return load_magicodec_model(codec_model, device) 273 | elif codec_type == CodecTypes.NEUCODEC: 274 | return load_neucodec_model(codec_model, device) 275 | else: 276 | return load_transformers_codec_model(codec_model, device) -------------------------------------------------------------------------------- /codec_bpe/tools/audio_encoder.py: -------------------------------------------------------------------------------- 1 | import librosa 2 | import os 3 | import shutil 4 | import numpy as np 5 | import torch 6 | import math 7 | from typing import Optional, List, Union, Tuple, Dict 8 | from tqdm import tqdm 9 | 10 | from .codec_utils import CodecTypes, load_codec_model 11 | 12 | SUPPORTED_EXTENSIONS = [".mp3", ".wav", ".flac", ".opus"] 13 | 14 | class AudioEncodeResult: 15 | def __init__(self): 16 | self.num_audio_files = 0 17 | self.num_numpy_files = 0 18 | self.num_skipped_dirs = 0 19 | self.errored_audio_files = [] 20 | 21 | class AudioEncoder: 22 | def __init__( 23 | self, 24 | codec_model: str, 25 | codec_type: Optional[CodecTypes] = None, 26 | device: Optional[Union[str, torch.device]] = None, 27 | chunk_size_secs: float = 30.0, 28 | context_secs: float = 0.0, 29 | batch_size: int = 1, 30 | bandwidth: Optional[float] = None, 31 | n_quantizers: Optional[int] = None, 32 | stereo: bool = False, 33 | file_per_chunk: bool = False, 34 | ): 35 | self.codec_model = codec_model 36 | self.codec_type = codec_type 37 | if self.codec_type is None: 38 | self.codec_type = CodecTypes.try_get_codec_type(self.codec_model) 39 | self.device = device 40 | if self.device is None: 41 | self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 42 | elif isinstance(self.device, str): 43 | self.device = torch.device(self.device) 44 | self.chunk_size_secs = chunk_size_secs 45 | self.context_secs = context_secs 46 | self.batch_size = batch_size 47 | if self.batch_size > 1 and self.codec_type in [CodecTypes.XCODEC2, CodecTypes.NEUCODEC]: 48 | raise ValueError("XCodec2 and NeuCodec only support batch size 1 for now.") 49 | self.bandwidth = bandwidth 50 | # support bandwidth in kbps or bps 51 | if self.bandwidth is not None: 52 | if self.codec_type == CodecTypes.FUNCODEC and self.bandwidth <= 16.0: 53 | self.bandwidth *= 1000 54 | if self.codec_type == CodecTypes.ENCODEC and self.bandwidth > 24.0: 55 | self.bandwidth /= 1000 56 | self.n_quantizers = n_quantizers 57 | self.stereo = stereo 58 | self.file_per_chunk = file_per_chunk 59 | 60 | # load the codec model 61 | self.model, self.processor, self.sr_enc, self.sr_dec = load_codec_model(self.codec_type, self.codec_model, self.device) 62 | self.chunk_size_samples = int(self.chunk_size_secs * self.sr_enc) 63 | self.context_samples = int(max(0.0, self.context_secs-self.chunk_size_secs) * self.sr_enc) 64 | 65 | def _encode_batch(self, batch: List[np.ndarray]) -> Tuple[torch.Tensor, float]: 66 | # Process audio to get padded input tensor 67 | inputs = self.processor(raw_audio=batch, sampling_rate=self.sr_enc, return_tensors="pt") 68 | if self.codec_type != CodecTypes.NEUCODEC: 69 | inputs = inputs.to(self.device) 70 | input_values = inputs.input_values 71 | 72 | # Encode the batch 73 | with torch.no_grad(): 74 | if self.codec_type == CodecTypes.FUNCODEC: 75 | encoded_batch, _, _, _ = self.model( 76 | input_values, 77 | bit_width=int(self.bandwidth) if self.bandwidth is not None else None, 78 | run_mod="encode", 79 | ) 80 | # Permute dimensions to match expected format 81 | audio_codes = torch.permute(encoded_batch[0], (1, 0, 2)) 82 | elif self.codec_type == CodecTypes.XCODEC2: 83 | input_values = input_values.squeeze(1) 84 | audio_codes = self.model.encode_code(input_values, sample_rate=self.sr_enc) 85 | elif self.codec_type == CodecTypes.WAVTOKENIZER: 86 | input_values = input_values.squeeze(1) 87 | bandwidth_id = torch.tensor([0]).to(self.device) 88 | _, audio_codes = self.model.encode_infer(input_values, bandwidth_id=bandwidth_id) 89 | # Permute dimensions to match expected format 90 | audio_codes = torch.permute(audio_codes, (1, 0, 2)) 91 | elif self.codec_type == CodecTypes.SIMVQ: 92 | _, _, audio_codes, _ = self.model.encode(input_values) 93 | audio_codes = audio_codes.view(input_values.shape[0], 1, -1) 94 | elif self.codec_type == CodecTypes.MAGICODEC: 95 | with torch.autocast( 96 | device_type = "cuda", 97 | dtype = torch.bfloat16, 98 | enabled = self.device.type == "cuda" and torch.cuda.is_bf16_supported(), 99 | ): 100 | x = self.model.pad_audio(input_values) 101 | z_e = self.model.encoder(x) 102 | _, audio_codes = self.model.quantizer.inference(z_e) 103 | audio_codes = audio_codes.unsqueeze(1) 104 | elif self.codec_type == CodecTypes.NEUCODEC: 105 | audio_codes = self.model.encode_code(input_values) 106 | else: 107 | encode_kwargs = {} 108 | if self.codec_type == CodecTypes.DAC: 109 | encode_kwargs["n_quantizers"] = self.n_quantizers 110 | elif self.codec_type == CodecTypes.MIMI: 111 | encode_kwargs["num_quantizers"] = self.n_quantizers 112 | elif self.codec_type == CodecTypes.ENCODEC: 113 | encode_kwargs["bandwidth"] = self.bandwidth 114 | outputs = self.model.encode(**inputs, **encode_kwargs) 115 | audio_codes = outputs.audio_codes 116 | 117 | samples_per_frame = math.ceil(input_values.shape[-1] / audio_codes.shape[-1]) 118 | return audio_codes, samples_per_frame 119 | 120 | def _process_batch( 121 | self, 122 | batch: List[np.ndarray], 123 | batch_info: List[Tuple[str, str, int, float, bool]], 124 | encoded_file_chunks: List[List[np.ndarray]], 125 | ) -> List[str]: 126 | errored_files = [] 127 | num_numpy_files = 0 128 | if not batch: 129 | return errored_files, num_numpy_files 130 | 131 | try: 132 | audio_codes, samples_per_frame = self._encode_batch(batch) 133 | 134 | # Save the non-padded part of the encoded audio 135 | batch_dim = 1 if self.codec_type == CodecTypes.ENCODEC else 0 136 | for i, (file_path, numpy_root, channel, start_secs, end_of_file) in enumerate(batch_info): 137 | encoded_chunk = audio_codes.select(batch_dim, i).unsqueeze(batch_dim) 138 | context_len = math.ceil(self.context_samples / samples_per_frame) 139 | non_padded_len = math.ceil(batch[i].shape[-1] / samples_per_frame) 140 | encoded_chunk = encoded_chunk[..., context_len:non_padded_len] 141 | 142 | # Save encoded chunk to numpy file 143 | if not self.file_per_chunk: 144 | encoded_file_chunks[channel].append(encoded_chunk.cpu().numpy()) 145 | if self.file_per_chunk or end_of_file: 146 | file_name_noext = os.path.basename(os.path.splitext(file_path)[0]) 147 | start_secs_whole = int(start_secs) 148 | start_secs_ms = round((start_secs - start_secs_whole) * 1000) 149 | timestamp_slot = f"_t{start_secs_whole:06d}_{start_secs_ms:03d}" if self.file_per_chunk else "" 150 | numpy_filepath = os.path.join(numpy_root, f"{file_name_noext}_c{channel}{timestamp_slot}.npy") 151 | os.makedirs(os.path.dirname(numpy_filepath), exist_ok=True) 152 | if self.file_per_chunk: 153 | np.save(numpy_filepath, encoded_chunk.cpu().numpy(), allow_pickle=False) 154 | else: 155 | encoded_file = np.concatenate(encoded_file_chunks[channel], axis=-1) 156 | np.save(numpy_filepath, encoded_file, allow_pickle=False) 157 | encoded_file_chunks[channel].clear() 158 | num_numpy_files += 1 159 | 160 | except Exception as e: 161 | print(f"Error encoding batch: {e}") 162 | errored_files.extend(set([info[0] for info in batch_info])) 163 | 164 | return num_numpy_files, errored_files 165 | 166 | def encode_audio( 167 | self, 168 | audio_path: str, 169 | codes_path: str, 170 | extensions: List[str] = SUPPORTED_EXTENSIONS, 171 | audio_filter: Optional[Union[str, List[str]]] = None, 172 | overwrite: bool = False, 173 | ) -> AudioEncodeResult: 174 | # traverse the audio directory recursively and convert in each subdirectory containing 175 | # audio fileswith the specified extensions 176 | if isinstance(audio_filter, str): 177 | audio_filter = [audio_filter] 178 | result = AudioEncodeResult() 179 | batch = [] 180 | batch_info = [] 181 | encoded_file_chunks = [[], []] if self.stereo else [[]] 182 | for root, _, files in os.walk(audio_path): 183 | files = sorted([os.path.join(root, f) for f in files if os.path.splitext(f)[1] in extensions]) 184 | if audio_filter: 185 | files = [f for f in files if any([filter_ in f for filter_ in audio_filter])] 186 | if len(files) == 0: 187 | continue 188 | numpy_root = root.replace(audio_path, codes_path) 189 | if os.path.exists(numpy_root): 190 | if overwrite: 191 | shutil.rmtree(numpy_root) 192 | else: 193 | print(f"Skipping {root} because {numpy_root} already exists.") 194 | result.num_skipped_dirs += 1 195 | continue 196 | print(f"Converting in {root}...") 197 | for file_path in tqdm(files, desc="Files"): 198 | result.num_audio_files += 1 199 | try: 200 | # Load the audio file 201 | audio, _ = librosa.load(file_path, sr=self.sr_enc, mono=not self.stereo) 202 | except Exception as e: 203 | print(f"Error loading {file_path}: {e}") 204 | result.errored_audio_files.append(file_path) 205 | continue 206 | 207 | # Encode it in chunks of size chunk_size_secs on each channel independently 208 | start = 0 209 | while True: 210 | end = start + self.chunk_size_samples 211 | end_of_file = end >= audio.shape[-1] 212 | start_with_context = start - self.context_samples 213 | audio_chunk = audio[..., max(0, start_with_context):end] 214 | if audio_chunk.ndim == 1: 215 | audio_chunk = np.expand_dims(audio_chunk, axis=0) 216 | if start_with_context < 0: 217 | # if we are at the beginning of the audio, pad with a silent context 218 | audio_chunk = np.pad(audio_chunk, ((0, 0), (-start_with_context, 0)), mode='constant') 219 | for channel in range(audio_chunk.shape[0]): 220 | batch.append(audio_chunk[channel]) 221 | batch_info.append((file_path, numpy_root, channel, start / self.sr_enc, end_of_file)) 222 | 223 | # Process batch if it reaches the specified size 224 | if len(batch) == self.batch_size: 225 | num_numpy_files, errored_files = self._process_batch(batch, batch_info, encoded_file_chunks) 226 | result.num_numpy_files += num_numpy_files 227 | result.errored_audio_files.extend(errored_files) 228 | batch.clear() 229 | batch_info.clear() 230 | 231 | if end_of_file: 232 | break 233 | start = end 234 | 235 | # Process any remaining chunks in the batch 236 | if batch: 237 | num_numpy_files, errored_files = self._process_batch(batch, batch_info, encoded_file_chunks) 238 | result.num_numpy_files += num_numpy_files 239 | result.errored_audio_files.extend(errored_files) 240 | 241 | result.errored_audio_files = sorted(set(result.errored_audio_files)) 242 | return result 243 | 244 | def get_codec_info(self) -> Dict[str, Union[str, int, float]]: 245 | # encode ten seconds of audio and get the number of codebooks and framerate 246 | dummy_audio = np.zeros(10 * self.sr_enc) 247 | audio_codes, samples_per_frame = self._encode_batch([dummy_audio]) 248 | # get stats 249 | if self.codec_type == CodecTypes.FUNCODEC: 250 | codebook_size = self.model.model_args.quantizer_conf["codebook_size"] 251 | elif self.codec_type == CodecTypes.XCODEC2: 252 | codebook_size = 65536 253 | elif self.codec_type == CodecTypes.WAVTOKENIZER: 254 | codebook_size = self.model.feature_extractor.encodec.quantizer.bins 255 | elif self.codec_type == CodecTypes.SIMVQ: 256 | codebook_size = self.model.quantize.n_e 257 | elif self.codec_type == CodecTypes.MAGICODEC: 258 | codebook_size = self.model.codebook_size 259 | elif self.codec_type == CodecTypes.NEUCODEC: 260 | codebook_size = 65536 261 | else: 262 | codebook_size = self.model.config.codebook_size 263 | 264 | # write codec info to json 265 | codec_info = { 266 | "codec_type": str(self.codec_type), 267 | "codec_model": self.codec_model, 268 | "sampling_rate_encoder": self.sr_enc, 269 | "sampling_rate_decoder": self.sr_dec, 270 | "num_codebooks": audio_codes.shape[-2], 271 | "codebook_size": codebook_size, 272 | "framerate": self.sr_enc / samples_per_frame, 273 | } 274 | return codec_info 275 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # codec-bpe 2 | ![codec_bpe.png](img/codec_bpe.png) 3 | 4 | Codec BPE is an implementation of [Acoustic BPE](https://arxiv.org/abs/2310.14580) (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs such as [EnCodec](https://github.com/facebookresearch/encodec) (Défossez et al., 2022), [DAC](https://github.com/descriptinc/descript-audio-codec) (Kumar et al., 2023), [Mimi](https://huggingface.co/kyutai/mimi) (Défossez et al., 2024), and [FunCodec](https://funcodec.github.io/) (Du et al., 2024). Built on top of the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) library. 5 | 6 | Codec BPE flattens multi-level codes from Residual Vector Quantizers (RVQ) and converts them into unicode strings for tokenization into compressed token sequences. For example, a single Codec BPE token might represent a 4-gram of codes from 4 codebooks representing a single acoustic unit, a 6-gram comprising a whole acoustic unit and half of the next one, or even an 8-gram represnting two whole acoustic units. Depending on the codec, vocab size and type of audio, this can yield savings of 2-5x in sequence length compared to directly modeling the flattened codebooks. 7 | 8 | Codec BPE can also be used with single-level codecs such as [XCodec2](https://github.com/zhenye234/X-Codec-2.0) (Ye et al., 2025), [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) (Ji et al., 2024), [SimVQ](https://github.com/youngsheen/SimVQ) (Zhu et al., 2024), [MagiCodec](https://github.com/Ereboas/MagiCodec) (Song et al., 2025), and [NeuCodec](https://github.com/neuphonic/neucodec) (Julian et al., 2025). In this case, a single Codec BPE token could represent one or more codes where each code represents a whole acoustic unit. 9 | 10 | **Using Codec BPE allows efficient audio language modeling with multi-level codecs to be done with vanilla LLM architectures, meaning no custom architecture is needed to deal with modeling the RVQ. Your model will already be compatible with the full ecosystem of training and inference tools available for [HuggingFace Transformers](https://github.com/huggingface/transformers), such as [vLLM](https://github.com/vllm-project/vllm) and [Ollama](https://ollama.com/)!** 11 | 12 | ## 🚀 Updates 13 | **2025-12-01** 14 | - Added support for [NeuCodec](https://huggingface.co/neuphonic/neucodec), a new high-quality single-level codec with a 50 Hz framerate! NeuCodec extends XCodec2 with inference speedups, an upsampling decoder, and a commercially permissive license. Use `--codec_model neuphonic/neucodec` when encoding audio with `codec_bpe.audio_to_codes` to encode using the NeuCodec model. See [here](#train-a-tokenizer-from-audio-files) for a usage example. 15 | 16 | **2025-06-22** 17 | - Added support for [MagiCodec](https://github.com/Ereboas/MagiCodec), a new **streaming** single-level codec with a 50 Hz framerate! Use `--codec_model MagiCodec-50Hz-Base` when encoding audio with `codec_bpe.audio_to_codes` to encode using the MagiCodec model. See [here](#train-a-tokenizer-from-audio-files) for a usage example. 18 | 19 | **Older updates** 20 | - See [CHANGELOG.md](CHANGELOG.md) for a complete list of updates. 21 | 22 | ## Setup 23 | ```bash 24 | pip install codec-bpe 25 | ``` 26 | If you want to use the `--codec_type funcodec` or `--codec_model alibaba-damo/...` options with `codec_bpe.audio_to_codes`, run: 27 | ```bash 28 | pip install codec-bpe[funcodec] 29 | ``` 30 | If you want to use the `--codec_type xcodec2` or `--codec_model HKUSTAudio/xcodec2` options with `codec_bpe.audio_to_codes`, run: 31 | ```bash 32 | pip install codec-bpe[xcodec2] 33 | ``` 34 | If you want to use the `--codec_type wavtokenizer` or `--codec_model wavtokenizer-*` options with `codec_bpe.audio_to_codes`, run: 35 | ```bash 36 | pip install codec-bpe[wavtokenizer] 37 | # WavTokenizer is not an installable package so you need to clone the repository into your working directory manually: 38 | cd your/working/dir 39 | git clone https://github.com/jishengpeng/WavTokenizer.git 40 | # Note: WavTokenizer requirements are all version pinned and include both training and inference dependencies. 41 | # I recommend either using a dedicated environment or cherry-picking the requirements you need for inference and installing them manually. 42 | # For example, I had no issue running inference with latest versions of torch, numpy, and transformers. 43 | pip install -r WavTokenizer/requirements.txt 44 | ``` 45 | If you want to use the `--codec_type simvq` or `--codec_model simvq_*` options with `codec_bpe.audio_to_codes`, run: 46 | ```bash 47 | pip install codec-bpe[simvq] 48 | # SimVQ is not an installable package so you need to clone the repository into your working directory manually: 49 | cd your/working/dir 50 | git clone https://github.com/youngsheen/SimVQ.git 51 | pip install -r SimVQ/requirements.txt 52 | ``` 53 | If you want to use the `--codec_type magicodec` or `--codec_model MagiCodec-50Hz-Base` options with `codec_bpe.audio_to_codes`, run: 54 | ```bash 55 | pip install codec-bpe[magicodec] 56 | # MagiCodec is not an installable package so you need to clone the repository into your working directory manually: 57 | cd your/working/dir 58 | git clone https://github.com/Ereboas/MagiCodec.git 59 | cd MagiCodec 60 | # Follow setup instructions for MagiCodec [here](https://github.com/Ereboas/MagiCodec#env-setup) 61 | ``` 62 | If you want to use the `--codec_type neucodec` or `--codec_model neuphonic/neucodec` options with `codec_bpe.audio_to_codes`, run: 63 | ```bash 64 | pip install codec-bpe[neucodec] 65 | ``` 66 | 67 | ## Supported Codecs 68 | | Model | Sample Rate (kHz)* | Framerate (Hz)* | Max Codebooks | Codebook Size | Max Bandwidth (kbps)* | Training Domain | 69 | |:--------------------------------------------------------------------|:------------------:|:--------------:|:--------------:|:-------------:|:-------------------------:|:---------------:| 70 | | [🤗 EnCodec 24khz](https://huggingface.co/facebook/encodec_24khz) | 24 | 75 | 32 | 1024 | 24 | General | 71 | | [🤗 DAC 44khz](https://huggingface.co/descript/dac_44khz) | 44.1 | 86.1328125 | 9 | 1024 | 7.8 | General | 72 | | [🤗 DAC 24khz](https://huggingface.co/descript/dac_24khz) | 24 | 75 | 32 | 1024 | 24 | General | 73 | | [🤗 DAC 16khz](https://huggingface.co/descript/dac_16khz) | 16 | 50 | 12 | 1024 | 6 | General | 74 | | [🤗 Mimi](https://huggingface.co/kyutai/mimi) | 24 | 12.5 | 32 | 2048 | 4.4 | Speech | 75 | | [🤗 XCodec2](https://huggingface.co/HKUSTAudio/xcodec2) | 16 | 50 | 1 | 65536 | 0.8 | Speech | 76 | | [🤗 FunCodec zh_en-general-16k-nq32ds640](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch) | 16 | 25 | 32 | 1024 | 8 | General | 77 | | [🤗 FunCodec zh_en-general-16k-nq32ds320](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch) | 16 | 50 | 32 | 1024 | 16 | General | 78 | | [🤗 FunCodec en-libritts-16k-nq32ds640](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch) | 16 | 25 | 32 | 1024 | 8 | Audiobooks | 79 | | [🤗 FunCodec en-libritts-16k-nq32ds320](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch) | 16 | 50 | 32 | 1024 | 16 | Audiobooks | 80 | | [🤗 WavTokenizer-small-600-24k-4096](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_600_24k_4096.ckpt) | 24 | 40 | 1 | 4096 | 0.48 | Speech | 81 | | [🤗 WavTokenizer-small-320-24k-4096](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_320_24k_4096.ckpt) | 24 | 75 | 1 | 4096 | 0.9 | Speech | 82 | | [🤗 WavTokenizer-medium-speech-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-medium-speech-75token) | 24 | 75 | 1 | 4096 | 0.9 | Speech | 83 | | [🤗 WavTokenizer-medium-music-audio-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-medium-music-audio-75token) | 24 | 75 | 1 | 4096 | 0.9 | General | 84 | | [🤗 WavTokenizer-large-600-24k-4096](https://huggingface.co/novateur/WavTokenizer-large-unify-40token) | 24 | 40 | 1 | 4096 | 0.48 | General | 85 | | [🤗 WavTokenizer-large-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-large-speech-75token) | 24 | 75 | 1 | 4096 | 0.9 | General | 86 | | [🤗 simvq_4k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_4k) | 24 | 75 | 1 | 4096 | 0.9 | Speech | 87 | | [🤗 simvq_8k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_8k) | 24 | 75 | 1 | 8192 | 0.975 | Speech | 88 | | [🤗 simvq_65k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_65k) | 24 | 75 | 1 | 65536 | 1.2 | Speech | 89 | | [🤗 simvq_262k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_262k) | 24 | 75 | 1 | 262144 | 1.35 | Speech | 90 | | [🤗 MagiCodec-50Hz-Base](https://huggingface.co/Ereboas/MagiCodec_16k_50hz) | 16 | 50 | 1 | 131072 | 0.85 | Audiobooks | 91 | | [🤗 NeuCodec](https://huggingface.co/neuphonic/neucodec) | 16 | 50 | 1 | 65536 | 0.8 | Speech | 92 | | [🤗 Distill-NeuCodec](https://huggingface.co/neuphonic/distill-neucodec) | 16 | 50 | 1 | 65536 | 0.8 | Speech | 93 | 94 | \* Sample Rate (kHz) is the sampling rate of the audio input to the codec. 95 | 96 | \* Framerate (Hz) is the number of timesteps (acoustic units of size `num_codebooks`) per second output by the codec. 97 | 98 | \* Bandwidth (kbps) = `framerate (Hz) x num_codebooks x log2(codebook_size) / 1000`. 99 | 100 | ## Usage 101 | 102 | ### Convert audio codes to and from unicode strings 103 | Use your codec of choice (e.g., EnCodec, DAC, Mimi, XCodec2, FunCodec, WavTokenizer, SimVQ) to encode your audio into a torch tensor or numpy array of codes of shape (num_codebooks, length), then use the provided converter methods to convert to and from unicode strings. 104 | 105 | **Note:** In the Acoustic BPE paper, a single-level codec was used (HuBERT + k-means), where each encoded timestep consisted of a single code which was converted to a single unicode character. Here, we support multi-level codecs based on Residual Vector Quantizers. If num_codebooks > 1, a flattening pattern is used to interleave all codebooks into a single level before mapping to unicode. For example, if 4 codebooks are used then each encoded timestep would consist of 4 codes (one from each codebook) and would be converted to a unicode 4-gram. 106 | 107 | Example: audio language modeling using EnCodec 24 kHz at 3 kbps (4 codebooks): 108 | ```python 109 | import torch 110 | import librosa 111 | import soundfile as sf 112 | from transformers import ( 113 | EncodecModel, 114 | AutoModelForCausalLM, 115 | AutoProcessor, 116 | AutoTokenizer, 117 | ) 118 | from codec_bpe import codes_to_chars, chars_to_codes 119 | 120 | # load a Codec BPE tokenizer and compatible language model 121 | device = "cuda" if torch.cuda.is_available() else "cpu" 122 | tokenizer = AutoTokenizer.from_pretrained("output/my_tokenizer") 123 | model = AutoModelForCausalLM.from_pretrained("output/my_model").to(device) 124 | 125 | # load the EnCodec model 126 | encodec_modelname = "facebook/encodec_24khz" 127 | encodec_model = EncodecModel.from_pretrained(encodec_modelname).to(device) 128 | encodec_processor = AutoProcessor.from_pretrained(encodec_modelname) 129 | 130 | # (1) encode audio using EnCodec 131 | audio, sr = librosa.load("some_audio.mp3", sr=encodec_model.config.sampling_rate, mono=True) 132 | inputs = encodec_processor(raw_audio=audio, sampling_rate=sr, return_tensors="pt").to(device) 133 | with torch.no_grad(): 134 | encoded_audio = encodec_model.encode(**inputs, bandwidth=3.0).audio_codes[0, 0] 135 | 136 | # (2) convert the audio codes to a unicode string and tokenize it 137 | unicode_str = codes_to_chars(encoded_audio, codebook_size=encodec_model.config.codebook_size) 138 | inputs = tokenizer(unicode_str, return_tensors="pt").to(device) 139 | 140 | # (3) generate tokens from the model 141 | outputs = model.generate(**inputs, do_sample=True, max_new_tokens=300) 142 | 143 | # (4) detokenize the output back into a unicode string and convert it back to audio codes 144 | unicode_str_2 = tokenizer.decode(outputs[0], skip_special_tokens=False) 145 | encoded_audio_2 = chars_to_codes( 146 | unicode_str_2, 147 | num_codebooks=encoded_audio.shape[0], 148 | codebook_size=encodec_model.config.codebook_size, 149 | return_tensors="pt", 150 | ).to(device) 151 | 152 | # (5) decode the generated audio using EnCodec 153 | with torch.no_grad(): 154 | audio_2 = encodec_model.decode(encoded_audio_2.unsqueeze(0).unsqueeze(0), [None]).audio_values[0, 0] 155 | sf.write("some_audio_output.wav", audio_2.cpu().numpy(), sr) 156 | ``` 157 | 158 | ### Train a tokenizer from audio files 159 | To train a tokenizer from audio files: 160 | 161 | 1. Use your codec of choice (e.g., EnCodec, DAC, Mimi, XCodec2, FunCodec, WavTokenizer, SimVQ) to encode each audio file into a directory of numpy arrays (.npy files): 162 | ```bash 163 | # encode audio files using EnCodec 24 kHz at 3 kbps (4 codebooks) 164 | python -m codec_bpe.audio_to_codes \ 165 | --audio_path path/to/audio \ 166 | --codec_model facebook/encodec_24khz \ 167 | --bandwidth 3.0 \ 168 | --batch_size 8 169 | 170 | # encode audio files using first 4 codebooks of DAC 44kHz 171 | python -m codec_bpe.audio_to_codes \ 172 | --audio_path path/to/audio \ 173 | --codec_model descript/dac_44khz \ 174 | --n_quantizers 4 \ 175 | --batch_size 8 176 | 177 | # encode audio files using first 6 codebooks of Mimi (24kHz) 178 | python -m codec_bpe.audio_to_codes \ 179 | --audio_path path/to/audio \ 180 | --codec_model kyutai/mimi \ 181 | --n_quantizers 6 \ 182 | --batch_size 8 183 | 184 | # encode audio files using XCodec2 (16kHz, there is only 1 codebook) 185 | python -m codec_bpe.audio_to_codes \ 186 | --audio_path path/to/audio \ 187 | --codec_model HKUSTAudio/xcodec2 \ 188 | --batch_size 1 # XCodec2 only supports batch size 1 for now. 189 | 190 | # encode audio files using FunCodec (16kHz) at 1.5 kbps (6 codebooks) 191 | python -m codec_bpe.audio_to_codes \ 192 | --audio_path path/to/audio \ 193 | --codec_model alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch \ 194 | --bandwidth 1500 \ 195 | --batch_size 8 196 | 197 | # encode audio files using WavTokenizer at 0.9 kbps (24kHz -> 75Hz, only 1 codebook of 4096 codes) 198 | python -m codec_bpe.audio_to_codes \ 199 | --audio_path path/to/audio \ 200 | --codec_model wavtokenizer-large-320-24k-4096 \ 201 | --batch_size 8 202 | 203 | # encode audio files using SimVQ at 0.9 kbps (24kHz -> 75Hz, only 1 codebook of 4096 codes) 204 | python -m codec_bpe.audio_to_codes \ 205 | --audio_path path/to/audio \ 206 | --codec_model simvq_4k \ 207 | --batch_size 8 208 | 209 | # encode audio files using SimVQ at 0.9 kbps in tiny chunks of 80ms with a 400ms context to simulate streaming encoding 210 | python -m codec_bpe.audio_to_codes \ 211 | --audio_path path/to/audio \ 212 | --codec_model simvq_4k \ 213 | --batch_size 128 \ 214 | --chunk_size_secs 0.08 \ 215 | --context_secs 0.4 216 | 217 | # encode audio files using MagiCodec at 0.85 kbps (16kHz -> 50Hz, only 1 codebook of 131072 codes) 218 | python -m codec_bpe.audio_to_codes \ 219 | --audio_path path/to/audio \ 220 | --codec_model MagiCodec-50Hz-Base \ 221 | --batch_size 8 222 | 223 | # encode audio files using MagiCodec at 0.85 kbps in tiny chunks of 80ms with a 1s context to simulate streaming encoding 224 | python -m codec_bpe.audio_to_codes \ 225 | --audio_path path/to/audio \ 226 | --codec_model MagiCodec-50Hz-Base \ 227 | --batch_size 128 \ 228 | --chunk_size_secs 0.08 \ 229 | --context_secs 1.0 230 | 231 | # encode audio files using NeuCodec at 0.8 kbps (16kHz -> 50Hz, only 1 codebook of 65536 codes) 232 | python -m codec_bpe.audio_to_codes \ 233 | --audio_path path/to/audio \ 234 | --codec_model neuphonic/neucodec \ 235 | --batch_size 1 # NeuCodec only supports batch size 1 for now. 236 | ``` 237 | 238 | 2. Suppose you want to use the first 4 codebooks of [EnCodec 24 kHz](https://huggingface.co/facebook/encodec_24khz), run: 239 | ```bash 240 | python -m codec_bpe.train_tokenizer \ 241 | --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \ 242 | --chunk_size_secs 30 \ 243 | --vocab_size 30000 \ 244 | --pad_token "" 245 | ``` 246 | Here: 247 | - `chunk_size_secs` specifies the number of timesteps (in seconds) that get converted to unicode and returned to the underlying Tokenizers trainer at a time. 248 | - `vocab_size` specifies the number of tokens (including the base vocabulary of individual unicode characters) that you want your tokenizer to have. The base vocabulary size is `num_codebooks` x `codebook_size`. For example, the command above would yield a tokenizer with a base vocabulary of 4096 individual unicode character tokens, each representing a single code from a single codebook, and 25,904 merged "ngram" tokens. 249 | 250 | By default, the following additional arguments are automatically initialized from the `codec_info.json` file output by `codec_bpe.audio_to_codes`: 251 | - `num_codebooks` specifies how many codebooks should be used (in a flattened pattern) when converting each timestep to unicode. For example, EnCodec 24kHz uses 2 codebooks at 1.5 kbps, 4 codebooks at 3 kbps, 8 codebooks at 6 kbps, etc. Note: when encoding the audio files, you should use at least as many codebooks as you plan to specify here. 252 | - `codebook_size` specifies the size of the codebook. EnCodec 24 kHz uses a codebook size of 1024. 253 | - `codec_framerate` specifies the framerate (number of timesteps per second) of the codec. EnCodec 24 kHz generates 75 timesteps per second. 254 | 255 | You may also pass these arguments explicitly. For example: 256 | ```bash 257 | python -m codec_bpe.train_tokenizer \ 258 | --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \ 259 | --num_codebooks 4 \ 260 | --codebook_size 1024 \ 261 | --codec_framerate 75 \ 262 | --chunk_size_secs 30 \ 263 | --vocab_size 30000 \ 264 | --pad_token "" 265 | ``` 266 | This is useful if you are using audio codes that you generated with a tool other than the `codec_bpe.audio_to_codes` script, or if you wish to use a lower number of codebooks 267 | for training the tokenizer than you used for encoding the audio files. 268 | 269 | See [train_tokenizer.py](codec_bpe/train_tokenizer.py) for a complete list of supported arguments. 270 | 271 | #### Controlling the granularity of Codec BPE tokens 272 | The `max_token_codebook_ngrams` argument can be used to control how many codes can be merged into a single Codec BPE token. This is useful to avoid repetitive patterns in the audio manifesting as redundant tokens in the vocabulary. For example, if long segments of silence exist in the training audio then you may end up with hundreds of tokens that just represent different lengths of silence. 273 | 274 | To avoid this, you can set `max_token_codebook_ngrams` to the maximum number of codebook ngrams (whole acoustic units) you want to allow a single token to represent. For example, if you set `max_token_codebook_ngrams = 2` while `num_codebooks` is set to 4, then a single Codec BPE token may only hold up to 8 codes: 275 | ```bash 276 | python -m codec_bpe.train_tokenizer \ 277 | --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \ 278 | --chunk_size_secs 30 \ 279 | --vocab_size 30000 \ 280 | --pad_token "" \ 281 | --max_token_codebook_ngrams 2 282 | ``` 283 | 284 | **It is highly recommended to set this argument to a value <= 2 (or <= 4 if num_codebooks is 1) to ensure that your `vocab_size` budget gets distributed across diverse acoustic patterns in your training data.** 285 | 286 | #### Using a codec with a very large codebook size 287 | If you are using a codec with a very large codebook size (e.g. XCodec2, which has a codebook size of 65536), you may need to adjust the `unicode_offset` argument for `codec_bpe.train_tokenizer` to avoid the non-printable surrogate range 0xD800-0xDFFF: 288 | ```bash 289 | python -m codec_bpe.train_tokenizer \ 290 | --codes_path output/codes/xcodec2/30.0s_0.0s/mono \ 291 | --chunk_size_secs 30 \ 292 | --vocab_size 80000 \ 293 | --pad_token "" \ 294 | --max_token_codebook_ngrams 4 \ 295 | --unicode_offset 0xE000 296 | ``` 297 | 298 | Setting `max_token_codebook_ngrams = 0` will skip tokenizer training and simply output a base vocabulary of `num_codebooks x codebook_size` tokens, each representing a single code from a single codebook. This is useful if you want to directly model individual codes from the flattened codebooks instead of combining them into n-grams. 299 | 300 | ### Extend an existing Transformers PreTrainedTokenizer 301 | You may want to train a new Codec BPE tokenizer and then export its trained vocabulary to an existing Transformers tokenizer. For example, extending the Llama, Mistral, Qwen, etc. tokenizers for multimodal text-audio language modeling. 302 | 303 | Suppose you have trained your Codec BPE tokenizer and saved it to `output/encodec_bpe_4cb_30k` and you want to extend the Mistral-7B-v0.1 tokenizer with its vocabulary, run: 304 | ```bash 305 | python -m codec_bpe.extend_tokenizer \ 306 | --existing_tokenizer mistralai/Mistral-7B-v0.1 \ 307 | --codec_bpe_tokenizer output/encodec_bpe_4cb_30k \ 308 | --additional_special_tokens "" # optional 309 | ``` 310 | This will simply add every token in `output/encodec_bpe_4cb_30k/tokenizer.json` to the `mistralai/Mistral-7B-v0.1` tokenizer as a special token and save a copy of the latter. Any additional tokens specified with `--additional_special_tokens` will be appended to the existing tokenizer's additional special token list. 311 | 312 | #### Avoiding vocabulary conflicts 313 | If the added Codec BPE unicode tokens would conflict with existing tokens in the vocabulary, you can override the default unicode offset using the `unicode_offset` argument for `codec_bpe.train_tokenizer`. By default, unicode characters from the [CJK Unified Ideographs](https://symbl.cc/en/unicode-table/#cjk-unified-ideographs) block are used, following the Acoustic BPE paper. You can set `unicode_offset` to a different value (e.g. 0xE000) to start from a different unicode block that won't conflict with your existing vocabulary. 314 | --------------------------------------------------------------------------------