├── codec_bpe
    ├── core
    │   ├── __init__.py
    │   ├── utils.py
    │   ├── sentencepiece_bpe.py
    │   ├── converter.py
    │   └── trainer.py
    ├── tools
    │   ├── __init__.py
    │   ├── extender.py
    │   ├── lm_dataset_builder.py
    │   ├── codec_utils.py
    │   └── audio_encoder.py
    ├── __init__.py
    ├── extend_tokenizer.py
    ├── lm_dataset_stats.py
    ├── train_tokenizer.py
    ├── prep_lm_dataset.py
    └── audio_to_codes.py
├── requirements_neucodec.txt
├── requirements_magicodec.txt
├── requirements_wavtokenizer.txt
├── requirements_funcodec.txt
├── requirements_simvq.txt
├── requirements_xcodec2.txt
├── requirements.txt
├── img
    └── codec_bpe.png
├── .github
    └── workflows
    │   └── pypi-release.yml
├── LICENSE
├── setup.py
├── .gitignore
├── CHANGELOG.md
└── README.md


/codec_bpe/core/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/codec_bpe/tools/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/requirements_neucodec.txt:
--------------------------------------------------------------------------------
1 | neucodec


--------------------------------------------------------------------------------
/requirements_magicodec.txt:
--------------------------------------------------------------------------------
1 | huggingface-hub


--------------------------------------------------------------------------------
/requirements_wavtokenizer.txt:
--------------------------------------------------------------------------------
1 | huggingface-hub


--------------------------------------------------------------------------------
/requirements_funcodec.txt:
--------------------------------------------------------------------------------
1 | huggingface-hub
2 | funcodec


--------------------------------------------------------------------------------
/requirements_simvq.txt:
--------------------------------------------------------------------------------
1 | huggingface-hub
2 | omegaconf


--------------------------------------------------------------------------------
/requirements_xcodec2.txt:
--------------------------------------------------------------------------------
1 | huggingface-hub
2 | xcodec2


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | librosa
2 | numpy
3 | tokenizers>=0.19.0
4 | torch
5 | transformers>=4.45.0


--------------------------------------------------------------------------------
/img/codec_bpe.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/AbrahamSanders/codec-bpe/HEAD/img/codec_bpe.png


--------------------------------------------------------------------------------
/codec_bpe/__init__.py:
--------------------------------------------------------------------------------
1 | from .core.converter import (
2 |     codes_to_chars, 
3 |     chars_to_codes,
4 |     UNICODE_OFFSET,
5 |     UNICODE_OFFSET_LARGE,
6 | )
7 | 
8 | __version__ = "1.4.1"


--------------------------------------------------------------------------------
/.github/workflows/pypi-release.yml:
--------------------------------------------------------------------------------
 1 | name: Publish Python package
 2 | 
 3 | on:
 4 |   release:
 5 |     types: [ published ]
 6 | 
 7 | permissions:
 8 |   contents: read
 9 | 
10 | jobs:
11 |   publish:
12 |     runs-on: ubuntu-latest
13 |     steps:
14 |       - uses: actions/checkout@v4
15 |       - name: Set up Python
16 |         uses: actions/setup-python@v5
17 |         with:
18 |           python-version: "3.x"
19 |       - name: Install dependencies
20 |         run: |
21 |           python -m pip install --upgrade pip
22 |           pip install setuptools wheel
23 |       - name: Build a binary wheel
24 |         run: >-
25 |           python setup.py sdist bdist_wheel
26 |       - name: Publish to PyPI
27 |         uses: pypa/gh-action-pypi-publish@release/v1
28 |         with:
29 |           password: ${{ secrets.PYPI_API_TOKEN }}


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Abraham Sanders
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/codec_bpe/extend_tokenizer.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from transformers import AutoTokenizer
 3 | 
 4 | from .tools.extender import extend_existing_tokenizer
 5 | 
 6 | if __name__ == "__main__":
 7 |     parser = argparse.ArgumentParser(description="Extend an existing Transformers tokenizer with codec BPE tokens")
 8 |     parser.add_argument("--existing_tokenizer", type=str, required=True)
 9 |     parser.add_argument("--codec_bpe_tokenizer", type=str, required=True)
10 |     parser.add_argument("--additional_special_tokens", nargs="+", default=None)
11 |     parser.add_argument("--save_path", type=str)
12 |     args = parser.parse_args()
13 | 
14 |     if args.save_path is None:
15 |         args.save_path = f"output/{args.existing_tokenizer}_extended"
16 | 
17 |     existing_tokenizer = AutoTokenizer.from_pretrained(args.existing_tokenizer)
18 |     codec_bpe_tokenizer = AutoTokenizer.from_pretrained(args.codec_bpe_tokenizer)
19 | 
20 |     num_added = extend_existing_tokenizer(existing_tokenizer, codec_bpe_tokenizer, args.additional_special_tokens)
21 |     print(f"Added {num_added} tokens to the existing tokenizer {args.existing_tokenizer} and saved it as {args.save_path}.")
22 |     existing_tokenizer.save_pretrained(args.save_path)


--------------------------------------------------------------------------------
/codec_bpe/tools/extender.py:
--------------------------------------------------------------------------------
 1 | from typing import Optional, Union, List
 2 | from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
 3 | from tqdm import trange
 4 | 
 5 | def extend_existing_tokenizer(
 6 |     existing_tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast],
 7 |     codec_bpe_tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast],
 8 |     additional_special_tokens: Optional[List[str]] = None,
 9 | ) -> int:
10 |     target_tokens = []
11 |     skip_token_ids = set([
12 |         codec_bpe_tokenizer.bos_token_id, 
13 |         codec_bpe_tokenizer.eos_token_id, 
14 |         codec_bpe_tokenizer.unk_token_id, 
15 |         codec_bpe_tokenizer.pad_token_id,
16 |     ])
17 |     for i in trange(len(codec_bpe_tokenizer)):
18 |         if i in skip_token_ids:
19 |             continue
20 |         token = codec_bpe_tokenizer.convert_ids_to_tokens(i)
21 |         target_tokens.append(token)
22 | 
23 |     num_added = 0
24 |     if additional_special_tokens:
25 |         num_added += existing_tokenizer.add_special_tokens(
26 |             special_tokens_dict={"additional_special_tokens": additional_special_tokens}, 
27 |             replace_additional_special_tokens=False,
28 |         )
29 |     num_added += existing_tokenizer.add_tokens(target_tokens, special_tokens=True)
30 |     return num_added
31 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | import io
 2 | import os
 3 | 
 4 | from setuptools import find_packages, setup
 5 | 
 6 | for line in open("codec_bpe/__init__.py"):
 7 |     line = line.strip()
 8 |     if "__version__" in line:
 9 |         context = {}
10 |         exec(line, context)
11 |         VERSION = context["__version__"]
12 | 
13 | 
14 | def read(*paths, **kwargs):
15 |     with io.open(os.path.join(os.path.dirname(__file__), *paths), encoding=kwargs.get("encoding", "utf8")) as open_file:
16 |         content = open_file.read().strip()
17 |     return content
18 | 
19 | 
20 | def read_requirements(path):
21 |     return [line.strip() for line in read(path).split("\n") if not line.startswith(('"', "#", "-", "git+"))]
22 | 
23 | 
24 | setup(
25 |     name="codec-bpe",
26 |     version=VERSION,
27 |     author="Abraham Sanders",
28 |     author_email="abraham.sanders@gmail.com",
29 |     description="Implementation of Acoustic BPE (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs",
30 |     url="https://github.com/AbrahamSanders/codec-bpe",
31 |     long_description=read("README.md"),
32 |     long_description_content_type="text/markdown",
33 |     packages=find_packages(),
34 |     install_requires=read_requirements("requirements.txt"),
35 |     extras_require={
36 |         "funcodec": read_requirements("requirements_funcodec.txt"),
37 |         "xcodec2": read_requirements("requirements_xcodec2.txt"),
38 |         "wavtokenizer": read_requirements("requirements_wavtokenizer.txt"),
39 |         "simvq": read_requirements("requirements_simvq.txt"),
40 |         "magicodec": read_requirements("requirements_magicodec.txt"),
41 |         "neucodec": read_requirements("requirements_neucodec.txt"),
42 |     },
43 | )


--------------------------------------------------------------------------------
/codec_bpe/core/utils.py:
--------------------------------------------------------------------------------
 1 | from typing import List, Optional, Union
 2 | from argparse import Namespace
 3 | import json
 4 | import os
 5 | 
 6 | def get_codes_files(
 7 |     codes_path: str, 
 8 |     codes_filter: Optional[Union[str, List[str]]] = None, 
 9 |     num_files: Optional[int] = None,
10 | ) -> List[str]:
11 |     return get_files(codes_path, ".npy", codes_filter, num_files)
12 | 
13 | def get_files(
14 |     path: str, 
15 |     extension: str,
16 |     filter: Optional[Union[str, List[str]]] = None, 
17 |     num_files: Optional[int] = None,
18 | ) -> List[str]:
19 |     if isinstance(filter, str):
20 |         filter = [filter]
21 |     result_files = []
22 |     for root, _, files in os.walk(path):
23 |         for file in files:
24 |             file_path = os.path.join(root, file)
25 |             if not file_path.endswith(extension):
26 |                 continue
27 |             if filter and not any([f in file_path for f in filter]):
28 |                 continue
29 |             result_files.append(file_path)
30 |     result_files.sort()
31 |     if num_files is not None:
32 |         result_files = result_files[:num_files]
33 |     return result_files
34 | 
35 | def get_codec_info(codes_path: str) -> dict:
36 |     codec_info_file = os.path.join(codes_path, "codec_info.json")
37 |     if not os.path.exists(codec_info_file):
38 |         return None
39 |     with open(codec_info_file, "r") as f:
40 |         codec_info = json.load(f)
41 |     return codec_info
42 | 
43 | def update_args_from_codec_info(args: Namespace, codec_info: dict) -> Namespace:
44 |     if codec_info is not None:
45 |         if "num_codebooks" in args and args.num_codebooks is None:
46 |             args.num_codebooks = codec_info["num_codebooks"]
47 |         if "codebook_size" in args and args.codebook_size is None:
48 |             args.codebook_size = codec_info["codebook_size"]
49 |         if "codec_framerate" in args and args.codec_framerate is None:
50 |             args.codec_framerate = codec_info["framerate"]
51 |     return args


--------------------------------------------------------------------------------
/codec_bpe/lm_dataset_stats.py:
--------------------------------------------------------------------------------
 1 | from tqdm import tqdm
 2 | import numpy as np
 3 | import argparse
 4 | 
 5 | from .core.utils import get_codec_info, update_args_from_codec_info
 6 | 
 7 | if __name__ == "__main__":
 8 |     parser = argparse.ArgumentParser(description="Compute statistics for a plain-text codec BPE dataset")
 9 |     parser.add_argument("--dataset_path", type=str, required=True)
10 |     parser.add_argument("--codes_path", type=str)
11 |     parser.add_argument("--num_codebooks", type=int, default=None)
12 |     parser.add_argument("--codec_framerate", type=float, default=None)
13 |     parser.add_argument("--audio_start_token", type=str)
14 |     parser.add_argument("--audio_end_token", type=str)
15 |     parser.add_argument("--num_examples", type=int, default=None)
16 |     args = parser.parse_args()
17 | 
18 |     if args.codes_path is not None:
19 |         codec_info = get_codec_info(args.codes_path)
20 |         update_args_from_codec_info(args, codec_info)
21 |     if args.num_codebooks is None or args.codec_framerate is None:
22 |         error_cause = "codec_info.json does not exist in --codes_path" if args.codes_path is not None else "--codes_path is not specified"
23 |         raise ValueError(f"{error_cause} so you must specify --num_codebooks and --codec_framerate manually.")
24 | 
25 |     lengths = []
26 |     with open(args.dataset_path, encoding="utf-8") as f:
27 |         for i, line in tqdm(enumerate(f), desc="Examples"):
28 |             if i == args.num_examples:
29 |                 break
30 |             line = line.rstrip()
31 |             if args.audio_start_token is not None:
32 |                 line = line.lstrip(args.audio_start_token)
33 |             if args.audio_end_token is not None:
34 |                 line = line.rstrip(args.audio_end_token)
35 |             if line[0] == "<":
36 |                 line = line.replace("<", "").replace(">", "")
37 |             num_units = len(line) / args.num_codebooks
38 |             num_seconds = num_units / args.codec_framerate
39 |             lengths.append(num_seconds)
40 |     total_seconds = np.sum(lengths)
41 | 
42 |     print(f"{len(lengths)} examples")
43 |     print(f"Total: {total_seconds:.2f} seconds ({(total_seconds / 3600):.2f} hours)")
44 |     print(f"Max: {np.max(lengths):.2f} seconds")
45 |     print(f"Min: {np.min(lengths):.2f} seconds")
46 |     print(f"Median: {np.median(lengths):.2f} seconds")
47 |     print(f"Mean: {np.mean(lengths):.2f} seconds")
48 |     print(f"Std: {np.std(lengths):.2f} seconds")
49 | 


--------------------------------------------------------------------------------
/codec_bpe/train_tokenizer.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import functools
 3 | 
 4 | from .core.trainer import Trainer
 5 | from .core.utils import get_codec_info, update_args_from_codec_info
 6 | from . import UNICODE_OFFSET
 7 | 
 8 | if __name__ == "__main__":
 9 |     parser = argparse.ArgumentParser(description="Train a codec BPE tokenizer from numpy files containing audio codes")
10 |     parser.add_argument("--codes_path", type=str, required=True)
11 |     parser.add_argument("--num_codebooks", type=int, default=None)
12 |     parser.add_argument("--codebook_size", type=int, default=None)
13 |     parser.add_argument("--codec_framerate", type=float, default=None)
14 |     parser.add_argument("--chunk_size_secs", type=int, default=None)
15 |     parser.add_argument("--vocab_size", type=int, default=30000)
16 |     parser.add_argument("--min_frequency", type=int, default=2)
17 |     parser.add_argument("--special_tokens", nargs="+", default=None)
18 |     parser.add_argument("--bos_token", type=str)
19 |     parser.add_argument("--eos_token", type=str)
20 |     parser.add_argument("--unk_token", type=str)
21 |     parser.add_argument("--pad_token", type=str)
22 |     parser.add_argument("--max_token_codebook_ngrams", type=int, default=None)
23 |     # handle hex values for unicode_offset with argparse: https://stackoverflow.com/a/25513044
24 |     parser.add_argument("--unicode_offset", type=functools.partial(int, base=0), default=UNICODE_OFFSET)
25 |     parser.add_argument("--save_path", type=str)
26 |     parser.add_argument("--codes_filter", type=str, nargs="+")
27 |     parser.add_argument("--num_files", type=int, default=None)
28 |     args = parser.parse_args()
29 | 
30 |     codec_info = get_codec_info(args.codes_path)
31 |     update_args_from_codec_info(args, codec_info)
32 |     if args.num_codebooks is None or args.codebook_size is None:
33 |         raise ValueError(
34 |             "codec_info.json does not exist in --codes_path so you must specify --num_codebooks and --codebook_size manually."
35 |         )
36 | 
37 |     codec_type = codec_info["codec_type"] if codec_info is not None else "codec"
38 |     if args.save_path is None:
39 |         args.save_path = f"output/{codec_type}_bpe_{args.num_codebooks}cb_{round(args.vocab_size/1000)}k"
40 | 
41 |     trainer = Trainer(
42 |         args.num_codebooks,
43 |         args.codebook_size,
44 |         args.codec_framerate,
45 |         args.chunk_size_secs,
46 |         args.vocab_size,
47 |         args.min_frequency,
48 |         args.special_tokens,
49 |         args.bos_token,
50 |         args.eos_token,
51 |         args.unk_token,
52 |         args.pad_token,
53 |         args.max_token_codebook_ngrams,
54 |         args.unicode_offset,
55 |     )
56 |     tokenizer = trainer.train(args.codes_path, args.codes_filter, args.num_files)
57 |     tokenizer.save_pretrained(args.save_path)
58 | 


--------------------------------------------------------------------------------
/codec_bpe/prep_lm_dataset.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import os
 3 | import functools
 4 | from tqdm import tqdm
 5 | from transformers import AutoTokenizer
 6 | 
 7 | from .tools.lm_dataset_builder import LMDatasetBuilder
 8 | from .core.utils import get_codec_info, update_args_from_codec_info
 9 | from . import UNICODE_OFFSET
10 | 
11 | if __name__ == "__main__":
12 |     parser = argparse.ArgumentParser(
13 |         description="Use numpy files containing audio codes to construct a plain-text codec BPE dataset suitable for language modeling"
14 |     )
15 |     parser.add_argument("--tokenizer", type=str, required=True)
16 |     parser.add_argument("--codes_path", type=str, required=True)
17 |     parser.add_argument("--num_codebooks", type=int, default=None)
18 |     parser.add_argument("--codebook_size", type=int, default=None)
19 |     parser.add_argument("--audio_start_token", type=str)
20 |     parser.add_argument("--audio_end_token", type=str)
21 |     # handle hex values for unicode_offset with argparse: https://stackoverflow.com/a/25513044
22 |     parser.add_argument("--unicode_offset", type=functools.partial(int, base=0), default=UNICODE_OFFSET)
23 |     parser.add_argument("--sequence_length", type=int, default=4096)
24 |     parser.add_argument("--overlap_length", type=int, default=1024)
25 |     parser.add_argument("--drop_last", action="store_true")
26 |     parser.add_argument("--save_path", type=str, default="output/lm_dataset.txt")
27 |     parser.add_argument("--codes_filter", type=str, nargs="+")
28 |     parser.add_argument("--num_examples", type=int, default=None)
29 |     args = parser.parse_args()
30 | 
31 |     codec_info = get_codec_info(args.codes_path)
32 |     update_args_from_codec_info(args, codec_info)
33 |     if args.num_codebooks is None or args.codebook_size is None:
34 |         raise ValueError(
35 |             "codec_info.json does not exist in --codes_path so you must specify --num_codebooks and --codebook_size manually."
36 |         )
37 | 
38 |     tokenizer = AutoTokenizer.from_pretrained(args.tokenizer)
39 | 
40 |     lm_dataset_builder = LMDatasetBuilder(
41 |         tokenizer=tokenizer,
42 |         num_codebooks=args.num_codebooks,
43 |         codebook_size=args.codebook_size,
44 |         audio_start_token=args.audio_start_token,
45 |         audio_end_token=args.audio_end_token,
46 |         unicode_offset=args.unicode_offset,
47 |         sequence_length=args.sequence_length,
48 |         overlap_length=args.overlap_length,
49 |         drop_last=args.drop_last,
50 |     )
51 | 
52 |     save_dir = os.path.dirname(args.save_path)
53 |     if save_dir:
54 |         os.makedirs(save_dir, exist_ok=True)
55 | 
56 |     with open(args.save_path, "w", encoding="utf-8") as f:
57 |         for i, example in tqdm(enumerate(lm_dataset_builder.iterate_examples(args.codes_path, args.codes_filter)), desc="Examples"):
58 |             if i == args.num_examples:
59 |                 break
60 |             f.write(example)
61 |             f.write("\n")
62 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
110 | .pdm.toml
111 | .pdm-python
112 | .pdm-build/
113 | 
114 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115 | __pypackages__/
116 | 
117 | # Celery stuff
118 | celerybeat-schedule
119 | celerybeat.pid
120 | 
121 | # SageMath parsed files
122 | *.sage.py
123 | 
124 | # Environments
125 | .env
126 | .venv
127 | env/
128 | venv/
129 | ENV/
130 | env.bak/
131 | venv.bak/
132 | 
133 | # Spyder project settings
134 | .spyderproject
135 | .spyproject
136 | 
137 | # Rope project settings
138 | .ropeproject
139 | 
140 | # mkdocs documentation
141 | /site
142 | 
143 | # mypy
144 | .mypy_cache/
145 | .dmypy.json
146 | dmypy.json
147 | 
148 | # Pyre type checker
149 | .pyre/
150 | 
151 | # pytype static type analyzer
152 | .pytype/
153 | 
154 | # Cython debug symbols
155 | cython_debug/
156 | 
157 | # PyCharm
158 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
161 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
162 | #.idea/
163 | 
164 | # VSCode
165 | .vscode/
166 | output/
167 | audio/
168 | codes/


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
 1 | **2025-12-01**
 2 | - Added support for [NeuCodec](https://huggingface.co/neuphonic/neucodec), a new high-quality single-level codec with a 50 Hz framerate! NeuCodec extends XCodec2 with inference speedups and a commercially permissive license. Use `--codec_model neuphonic/neucodec` when encoding audio with `codec_bpe.audio_to_codes` to encode using the NeuCodec model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example.
 3 | 
 4 | **2025-06-22**
 5 | - Added support for [MagiCodec](https://github.com/Ereboas/MagiCodec), a new **streaming** single-level codec with a 50 Hz framerate! Use `--codec_model MagiCodec-50Hz-Base` when encoding audio with `codec_bpe.audio_to_codes` to encode using the MagiCodec model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example.
 6 | 
 7 | **2025-06-19**
 8 | - Added ability to encode audio into subsecond chunk sizes with a sliding window of prior audio as context. This helps support use-cases where the encoded audio should simulate a streaming setting. For example, many codecs will encode the same audio differently depending on the encoder's receptive field size - even with native streaming codecs like Mimi. So, when training a streaming speech-to-text audio LM, we want to encode the training audio in tiny chunks so that it resembles what will be received during live streaming. This helps prevent throwing the model out of distribution at inference time.
 9 |   - Use the `--chunk_size_secs` and `--context_secs` parameters with `codec_bpe.audio_to_codes` to configure this.
10 |   - By default `--chunk_size_secs=30` and `--context_secs=0.0` for non-streaming usage. 
11 |   - `--context_secs` controls the sliding window encoding size, which is useful to avoid codec degradation at tiny chunk sizes. For example, `--chunk_size_secs=0.08` with `--context_secs=0.4` will encode audio in chunks of 80ms, each chunk receiving the previous 320ms of audio as context to the encoder's receptive field (we encode 320 + 80 = 400ms of audio at a time but only keep the final 80ms of codes).
12 | 
13 | **2025-06-16**
14 | - Added support for [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) and [SimVQ](https://github.com/youngsheen/SimVQ)! Both are single-level codecs that share the same architecture but differ in their VQ strategy. WavTokenizer comes in 40Hz and 75Hz variants with a vocabulary size of 4096. SimVQ variants have a 75Hz framerate with vocabulary sizes ranging from 4096 to 262144 codes. SimVQ also features a causal encoder and partially causal decoder, making it suitable for streaming use cases. 
15 |   - Use `--codec_model WavTokenizer-large-320-24k-4096` (or any other from the `Model` column on [this table](#supported-codecs)) with `codec_bpe.audio_to_codes` to encode audio using WavTokenizer.
16 |   - Use `--codec_model simvq_4k` (or any other from the `Model` column on [this table](#supported-codecs)) with `codec_bpe.audio_to_codes` to encode audio using SimVQ.
17 |   - See [here](README.md#train-a-tokenizer-from-audio-files) for usage examples.
18 | 
19 | **2025-04-07**
20 | - Added support for [XCodec2](https://huggingface.co/HKUSTAudio/xcodec2), a high-quality multilingual single-level codec with a 50 Hz framerate! Use `--codec_model HKUSTAudio/xcodec2` when encoding audio with `codec_bpe.audio_to_codes` to encode using the XCodec2 model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example.
21 | 
22 | **2025-03-09**
23 | - Added support for [FunCodec](https://funcodec.github.io/) from Alibaba DAMO Speech Lab! Use `--codec_model alibaba-damo/...` when encoding audio with `codec_bpe.audio_to_codes` to encode using the FunCodec model. Model paths on the HuggingFace hub are listed [here](https://github.com/modelscope/FunCodec?tab=readme-ov-file#available-models). See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example.
24 | 
25 | **2024-09-20**
26 | - Added support for Kyutai Lab's [Mimi codec](https://huggingface.co/kyutai/mimi), an amazing new codec with a 12.5 Hz framerate! Use `--codec_model kyutai/mimi` when encoding audio with `codec_bpe.audio_to_codes` to encode using the Mimi model. See [here](README.md#train-a-tokenizer-from-audio-files) for a usage example.
27 | 
28 | **2024-09-19**
29 | - Initial Release!


--------------------------------------------------------------------------------
/codec_bpe/core/sentencepiece_bpe.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, Iterator, List, Optional, Tuple, Union
  2 | 
  3 | from tokenizers import AddedToken, Tokenizer, decoders, pre_tokenizers, trainers
  4 | from tokenizers.models import BPE
  5 | 
  6 | from tokenizers.implementations.base_tokenizer import BaseTokenizer
  7 | 
  8 | 
  9 | class SentencePieceBPETokenizer(BaseTokenizer):
 10 |     """SentencePiece BPE Tokenizer
 11 | 
 12 |     Represents the BPE algorithm, with the pretokenization used by SentencePiece
 13 |     
 14 |     ------------------------------------------------------------------------------------------------
 15 |     Adapted from:
 16 |     https://github.com/huggingface/tokenizers/blob/v0.19.1/bindings/python/py_src/tokenizers/implementations/sentencepiece_bpe.py
 17 | 
 18 |     Changes:
 19 |     (1) Removed NFKC Unicode normalization: We're using unicode characters as a base alphabet and their content is arbitrary
 20 |         for our purpose, so we don't need to normalize them.
 21 |     (2) Added the max_token_length parameter for BpeTrainer to the `train` and `train_from_iterator` methods.
 22 |     ------------------------------------------------------------------------------------------------
 23 |     """
 24 | 
 25 |     def __init__(
 26 |         self,
 27 |         vocab: Optional[Union[str, Dict[str, int]]] = None,
 28 |         merges: Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] = None,
 29 |         unk_token: Union[str, AddedToken] = "<unk>",
 30 |         replacement: str = "▁",
 31 |         add_prefix_space: bool = True,
 32 |         dropout: Optional[float] = None,
 33 |         fuse_unk: Optional[bool] = False,
 34 |     ):
 35 |         if vocab is not None and merges is not None:
 36 |             tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))
 37 |         else:
 38 |             tokenizer = Tokenizer(BPE(dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))
 39 | 
 40 |         if tokenizer.token_to_id(str(unk_token)) is not None:
 41 |             tokenizer.add_special_tokens([str(unk_token)])
 42 | 
 43 |         prepend_scheme = "always" if add_prefix_space else "never"
 44 |         tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme)
 45 |         tokenizer.decoder = decoders.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme)
 46 | 
 47 |         parameters = {
 48 |             "model": "SentencePieceBPE",
 49 |             "unk_token": unk_token,
 50 |             "replacement": replacement,
 51 |             "add_prefix_space": add_prefix_space,
 52 |             "dropout": dropout,
 53 |         }
 54 | 
 55 |         super().__init__(tokenizer, parameters)
 56 | 
 57 |     @staticmethod
 58 |     def from_file(vocab_filename: str, merges_filename: str, **kwargs):
 59 |         vocab, merges = BPE.read_file(vocab_filename, merges_filename)
 60 |         return SentencePieceBPETokenizer(vocab, merges, **kwargs)
 61 | 
 62 |     def train(
 63 |         self,
 64 |         files: Union[str, List[str]],
 65 |         vocab_size: int = 30000,
 66 |         min_frequency: int = 2,
 67 |         special_tokens: List[Union[str, AddedToken]] = ["<unk>"],
 68 |         limit_alphabet: int = 1000,
 69 |         initial_alphabet: List[str] = [],
 70 |         max_token_length: Optional[int] = None,
 71 |         show_progress: bool = True,
 72 |     ):
 73 |         """Train the model using the given files"""
 74 | 
 75 |         trainer = trainers.BpeTrainer(
 76 |             vocab_size=vocab_size,
 77 |             min_frequency=min_frequency,
 78 |             special_tokens=special_tokens,
 79 |             limit_alphabet=limit_alphabet,
 80 |             initial_alphabet=initial_alphabet,
 81 |             max_token_length=max_token_length,
 82 |             show_progress=show_progress,
 83 |         )
 84 |         if isinstance(files, str):
 85 |             files = [files]
 86 |         self._tokenizer.train(files, trainer=trainer)
 87 | 
 88 |     def train_from_iterator(
 89 |         self,
 90 |         iterator: Union[Iterator[str], Iterator[Iterator[str]]],
 91 |         vocab_size: int = 30000,
 92 |         min_frequency: int = 2,
 93 |         special_tokens: List[Union[str, AddedToken]] = ["<unk>"],
 94 |         limit_alphabet: int = 1000,
 95 |         initial_alphabet: List[str] = [],
 96 |         max_token_length: Optional[int] = None,
 97 |         show_progress: bool = True,
 98 |         length: Optional[int] = None,
 99 |     ):
100 |         """Train the model using the given iterator"""
101 | 
102 |         trainer = trainers.BpeTrainer(
103 |             vocab_size=vocab_size,
104 |             min_frequency=min_frequency,
105 |             special_tokens=special_tokens,
106 |             limit_alphabet=limit_alphabet,
107 |             initial_alphabet=initial_alphabet,
108 |             max_token_length=max_token_length,
109 |             show_progress=show_progress,
110 |         )
111 |         self._tokenizer.train_from_iterator(
112 |             iterator,
113 |             trainer=trainer,
114 |             length=length,
115 |         )
116 | 


--------------------------------------------------------------------------------
/codec_bpe/tools/lm_dataset_builder.py:
--------------------------------------------------------------------------------
  1 | from typing import Optional, Union, Iterator, List
  2 | from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
  3 | from tqdm import tqdm
  4 | import numpy as np
  5 | import re
  6 | 
  7 | from ..core.converter import codes_to_chars, validate_unicode_offset, UNICODE_OFFSET
  8 | from ..core.utils import get_codes_files
  9 | 
 10 | class LMDatasetBuilder:
 11 |     def __init__(
 12 |         self,
 13 |         tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast],
 14 |         num_codebooks: int,
 15 |         codebook_size: int,
 16 |         audio_start_token: Optional[str] = None,
 17 |         audio_end_token: Optional[str] = None,
 18 |         unicode_offset: int = UNICODE_OFFSET,
 19 |         sequence_length: int = 4096,
 20 |         overlap_length: int = 1024,
 21 |         drop_last: bool = False,
 22 |     ):
 23 |         self.tokenizer = tokenizer
 24 |         self.num_codebooks = num_codebooks
 25 |         self.codebook_size = codebook_size
 26 |         self.unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size)
 27 |         self.sequence_length = sequence_length
 28 |         self.overlap_length = overlap_length
 29 |         self.drop_last = drop_last
 30 | 
 31 |         self.audio_start_token_id = None
 32 |         if audio_start_token is not None:
 33 |             self.audio_start_token_id = self.tokenizer.convert_tokens_to_ids(audio_start_token)
 34 |             if self.audio_start_token_id is None:
 35 |                 raise ValueError(f"Token '{audio_start_token}' not found in tokenizer")
 36 |         self.audio_end_token_id = None
 37 |         if audio_end_token is not None:
 38 |             self.audio_end_token_id = self.tokenizer.convert_tokens_to_ids(audio_end_token)
 39 |             if self.audio_end_token_id is None:
 40 |                 raise ValueError(f"Token '{audio_end_token}' not found in tokenizer")
 41 | 
 42 |     def _group_codes_files(self, codes_files: List[str]) -> List[List[str]]:
 43 |         grouped_codes_files = []
 44 |         last_file_root = None
 45 |         for codes_file in codes_files:
 46 |             file_root = re.match(r"(.+)_c\d+[_.]", codes_file).group(1)
 47 |             if file_root != last_file_root:
 48 |                 grouped_codes_files.append([])
 49 |                 last_file_root = file_root
 50 |             grouped_codes_files[-1].append(codes_file)
 51 |         return grouped_codes_files
 52 | 
 53 |     def iterate_examples(self, codes_path: str, codes_filter: Optional[Union[str, List[str]]] = None) -> Iterator[str]:
 54 |         codes_files = get_codes_files(codes_path, codes_filter)
 55 |         # group codes files by root filename (minus channel and starting timestamp)
 56 |         grouped_codes_files = self._group_codes_files(codes_files)
 57 |         for file_group in tqdm(grouped_codes_files, desc="Codes file groups"):
 58 |             # concatenate all codes files in each group
 59 |             codes = np.concatenate([np.load(file) for file in file_group], axis=-1)
 60 |             if len(codes.shape) == 4:
 61 |                 codes = codes[0, 0]
 62 |             elif len(codes.shape) == 3:
 63 |                 codes = codes[0]
 64 |             codes = codes[:self.num_codebooks]
 65 |             # convert to unicode string
 66 |             chars = codes_to_chars(
 67 |                 codes, 
 68 |                 self.codebook_size,
 69 |                 copy_before_conversion=False,
 70 |                 unicode_offset=self.unicode_offset,
 71 |             )
 72 |             # encode the unicode string with the tokenizer
 73 |             tokens = self.tokenizer.encode(chars, return_tensors="np")[0]
 74 |             sequence_length = self.sequence_length
 75 |             if self.tokenizer.bos_token_id is not None and tokens[0] == self.tokenizer.bos_token_id:
 76 |                 tokens = tokens[1:]
 77 |                 sequence_length -= 1
 78 |             if self.tokenizer.eos_token_id is not None and tokens[-1] == self.tokenizer.eos_token_id:
 79 |                 tokens = tokens[:-1]
 80 |                 sequence_length -= 1
 81 |             if self.audio_start_token_id is not None:
 82 |                 sequence_length -= 1
 83 |             if self.audio_end_token_id is not None:
 84 |                 sequence_length -= 1
 85 |             # yield examples from the sequence with the specified sequence length and overlap
 86 |             start = 0
 87 |             while True:
 88 |                 end = start + sequence_length
 89 |                 if self.drop_last and end > len(tokens):
 90 |                     break
 91 |                 example_tokens = tokens[start:end]
 92 |                 # add audio start and end tokens if specified
 93 |                 if self.audio_start_token_id is not None:
 94 |                     example_tokens = np.concatenate([[self.audio_start_token_id], example_tokens])
 95 |                 if self.audio_end_token_id is not None:
 96 |                     example_tokens = np.concatenate([example_tokens, [self.audio_end_token_id]])
 97 |                 example = self.tokenizer.decode(example_tokens)
 98 |                 yield example
 99 |                 if end >= len(tokens):
100 |                     break
101 |                 start = end - self.overlap_length


--------------------------------------------------------------------------------
/codec_bpe/core/converter.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Converter utility for converting discrete codec codes to and from unicode characters used for BPE tokenization.
  3 | """
  4 | from typing import List, Optional, Union, Tuple
  5 | import logging
  6 | import numpy as np
  7 | import torch
  8 | 
  9 | logger = logging.getLogger(__name__)
 10 | 
 11 | UNICODE_OFFSET: int = 0x4E00
 12 | """Original unicode offset from the Acoustic BPE paper (Shen et al., 2024)"""
 13 | UNICODE_OFFSET_LARGE: int = 0xE000
 14 | """For very large codebook size (e.g. > 32768), use this higher unicode offset to avoid running into surrogates
 15 | which are not printable and won't work with BPE tokenization."""
 16 | 
 17 | def codes_to_chars(
 18 |     codes: Union[List[List[int]], np.ndarray, torch.Tensor], 
 19 |     codebook_size: int,
 20 |     copy_before_conversion: bool = True,
 21 |     unicode_offset: int = UNICODE_OFFSET,
 22 | ) -> str:
 23 |     if isinstance(codes, list):
 24 |         codes = np.array(codes)
 25 |         copy_before_conversion = False
 26 |     elif isinstance(codes, torch.Tensor):
 27 |         codes = codes.cpu().numpy()
 28 |     if len(codes.shape) != 2:
 29 |         raise ValueError("codes must be a 2D array of shape (num_codebooks, seq_length).")
 30 |     unicode_offset = validate_unicode_offset(unicode_offset, codes.shape[0], codebook_size)
 31 |     if copy_before_conversion:
 32 |         codes = codes.copy()
 33 |     for i in range(codes.shape[0]):
 34 |         codes[i] += unicode_offset + i*codebook_size
 35 |     codes = codes.T.reshape(-1)
 36 |     chars = "".join([chr(c) for c in codes])
 37 |     return chars
 38 | 
 39 | def chars_to_codes(
 40 |     chars: str, 
 41 |     num_codebooks: int,
 42 |     codebook_size: int,
 43 |     drop_inconsistent_codes: bool = True,
 44 |     drop_hanging_codes: bool = True,
 45 |     return_hanging_codes_chars: bool = False,
 46 |     return_tensors: Optional[str] = None, 
 47 |     unicode_offset: int = UNICODE_OFFSET,
 48 | ) -> Union[List[List[int]], np.ndarray, torch.Tensor]:
 49 |     unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size)
 50 |     codes = np.array([ord(c) for c in chars])
 51 |     if drop_inconsistent_codes:
 52 |         codes = _drop_inconsistent_codes(codes, num_codebooks, codebook_size, unicode_offset)
 53 |     if drop_hanging_codes:
 54 |         codes, begin_hanging, end_hanging = _drop_hanging_codes(codes, num_codebooks, codebook_size, unicode_offset)
 55 |     codes = codes.reshape(-1, num_codebooks).T
 56 |     for i in range(codes.shape[0]):
 57 |         codes[i] -= unicode_offset + i*codebook_size
 58 |     if return_tensors is None:
 59 |         codes = codes.tolist()
 60 |     elif return_tensors == "pt":
 61 |         codes = torch.tensor(codes)
 62 |     if return_hanging_codes_chars:
 63 |         begin_hanging = "".join([chr(c) for c in begin_hanging])
 64 |         end_hanging = "".join([chr(c) for c in end_hanging])
 65 |         return codes, begin_hanging, end_hanging
 66 |     return codes
 67 | 
 68 | def validate_unicode_offset(unicode_offset: int, num_codebooks: int, codebook_size: int) -> int:
 69 |     # If the range [unicode_offset, unicode_offset+num_codebooks*codebook_size) intersects with the
 70 |     # surrogate range [0xD800, 0xDFFF], then we need to use the large unicode offset.
 71 |     lower = unicode_offset
 72 |     upper = unicode_offset + num_codebooks * codebook_size
 73 |     surrogate_lower = 0xD800
 74 |     surrogate_upper = 0xDFFF
 75 |     if lower < surrogate_upper and upper > surrogate_lower:
 76 |         raise ValueError(
 77 |             f"You are using unicode offset {hex(unicode_offset)}, however your base vocabulary size (num_codebooks x codebook_size) "
 78 |             f"is {num_codebooks*codebook_size} which will intersect with the non-printable surrogate range 0xD800-0xDFFF if starting from this offset.\n"
 79 |             f"To avoid this issue, use a unicode offset starting after the surrogate range, such as {hex(UNICODE_OFFSET_LARGE)}."
 80 |         )
 81 |     return unicode_offset
 82 | 
 83 | def _resolve_codebook(code: int, num_codebooks: int, codebook_size: int, unicode_offset: int) -> int:
 84 |     codebook = num_codebooks-1
 85 |     while codebook > -1 and code < unicode_offset + codebook*codebook_size:
 86 |         codebook -= 1
 87 |     return codebook
 88 | 
 89 | def _drop_inconsistent_codes(
 90 |     codes: np.ndarray, 
 91 |     num_codebooks: int,
 92 |     codebook_size: int,
 93 |     unicode_offset: int,
 94 | ) -> np.ndarray:
 95 |     mask = np.ones_like(codes, dtype=bool)
 96 |     expected_codebook = _resolve_codebook(codes[0], num_codebooks, codebook_size, unicode_offset)
 97 |     if expected_codebook < 0:
 98 |         expected_codebook = 0
 99 |     for i in range(len(codes)):
100 |         # figure out which codebook the character belongs to
101 |         actual_codebook = _resolve_codebook(codes[i], num_codebooks, codebook_size, unicode_offset)
102 |         # mark it to be dropped if it doesn't match the expected codebook
103 |         if actual_codebook != expected_codebook:
104 |             mask[i] = False
105 |             logger.warning(
106 |                 f"Dropped inconsistent audio code at position {i}. "
107 |                 f"Expected codebook {expected_codebook} but got codebook {actual_codebook}."
108 |             )
109 |         else:
110 |             expected_codebook = (expected_codebook + 1) % num_codebooks
111 |     codes = codes[mask]
112 |     return codes
113 | 
114 | def _drop_hanging_codes(
115 |     codes: np.ndarray, 
116 |     num_codebooks: int,
117 |     codebook_size: int,
118 |     unicode_offset: int,
119 | ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
120 |     # first check for hanging codes at the beginning
121 |     begin_hanging = []
122 |     while len(codes) > 0:
123 |         actual_codebook = _resolve_codebook(codes[0], num_codebooks, codebook_size, unicode_offset)
124 |         if actual_codebook == 0:
125 |             break
126 |         begin_hanging.append(codes[0])
127 |         codes = codes[1:]
128 |         logger.info(f"Dropped hanging audio code (codebook {actual_codebook}) at beginning of sequence.")
129 |     # then check for hanging codes at the end
130 |     end_hanging = []
131 |     while len(codes) > 0:
132 |         actual_codebook = _resolve_codebook(codes[-1], num_codebooks, codebook_size, unicode_offset)
133 |         if actual_codebook == num_codebooks-1:
134 |             break
135 |         end_hanging.append(codes[-1])
136 |         codes = codes[:-1]
137 |         logger.info(f"Dropped hanging audio code (codebook {actual_codebook}) at end of sequence.")
138 |     begin_hanging = np.array(begin_hanging)
139 |     end_hanging = np.array(end_hanging)[::-1]
140 |     return codes, begin_hanging, end_hanging
141 | 
142 | 
143 | 


--------------------------------------------------------------------------------
/codec_bpe/audio_to_codes.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import json
  3 | import os
  4 | 
  5 | from .tools.audio_encoder import AudioEncoder, CodecTypes, SUPPORTED_EXTENSIONS
  6 | 
  7 | if __name__ == "__main__":
  8 |     parser = argparse.ArgumentParser(
  9 |         description="Convert audio files to numpy files containing audio codes using a Codec"
 10 |     )
 11 |     parser.add_argument(
 12 |         "--audio_path",
 13 |         type=str,
 14 |         default="audio",
 15 |         help="Directory containing the audio files",
 16 |     )
 17 |     parser.add_argument(
 18 |         "--codes_path",
 19 |         type=str,
 20 |         default="output/codes",
 21 |         help="Directory to save the numpy codes files",
 22 |     )
 23 |     parser.add_argument(
 24 |         "--chunk_size_secs", 
 25 |         type=float, 
 26 |         default=30.0, help="Chunk size in seconds",
 27 |     )
 28 |     parser.add_argument(
 29 |         "--context_secs", 
 30 |         type=float, 
 31 |         default=0.0, 
 32 |         help=(
 33 |             "Context size in seconds for encoding (default: 0.0, no context). "
 34 |             "If set, chunks will be left-padded with max(0, context_secs-chunk_size_secs) "
 35 |             "seconds of previous audio, while only chunk_size_secs worth of codes will be saved. "
 36 |             "This is useful for codecs that require context for better encoding quality at "
 37 |             "very small chunk sizes."
 38 |         ),
 39 |     )
 40 |     parser.add_argument(
 41 |         "--batch_size",
 42 |         type=int,
 43 |         default=1,
 44 |         help="Number of audio chunks to process in a single batch",
 45 |     )
 46 |     parser.add_argument(
 47 |         "--codec_type", 
 48 |         type=str, 
 49 |         choices=list(CodecTypes),
 50 |         default=None,
 51 |         help="Type of codec to use for encoding. None to infer the type from --codec_model.",
 52 |     )
 53 |     parser.add_argument(
 54 |         "--codec_model",
 55 |         type=str,
 56 |         default="facebook/encodec_24khz",
 57 |         help="Codec model path on the HuggingFace Model Hub.",
 58 |     )
 59 |     parser.add_argument(
 60 |         "--bandwidth",
 61 |         type=float,
 62 |         default=None,
 63 |         help=(
 64 |             "Bandwidth for encoding. Only applies if --codec_type is 'encodec' or 'funcodec'. "
 65 |             "Values may be provided in kbps (e.g. 1.5) or in bps (e.g. 1500)."
 66 |             "For FunCodec, valid ranges for this parameter are listed in the 'Bitrate' column at "
 67 |             "https://github.com/modelscope/FunCodec?tab=readme-ov-file#available-models. "
 68 |             "For EnCodec, valid values are 1.5, 3.0, 6.0, 12.0, and 24.0 (kpbs). "
 69 |             "None uses the max bandwidth with FunCodec and the min bandwidth with EnCodec."
 70 |         ),
 71 |     )
 72 |     parser.add_argument(
 73 |         "--n_quantizers",
 74 |         type=int,
 75 |         default=None,
 76 |         help=(
 77 |             "Number of quantizers (codebooks) to use for encoding. None to use all quantizers. "
 78 |             "Only applies if --codec_type is 'dac' or 'mimi'."
 79 |         ),
 80 |     )
 81 |     parser.add_argument(
 82 |         "--stereo",
 83 |         action="store_true",
 84 |         help="Encode stereo audio channels separately instead of converting to mono",
 85 |     )
 86 |     parser.add_argument(
 87 |         "--file_per_chunk",
 88 |         action="store_true",
 89 |         help=(
 90 |             "Save each audio chunk as a separate numpy file with the start timestamp (secs) in the filename "
 91 |             "instead of the default behavior of concatenating all chunks into a single numpy file corresponding "
 92 |             "to the original audio file."
 93 |         ),
 94 |     )
 95 |     parser.add_argument(
 96 |         "--extensions",
 97 |         nargs="+",
 98 |         default=SUPPORTED_EXTENSIONS,
 99 |         help="Audio file extensions to convert. Formats must be supported by a librosa backend.",
100 |     )
101 |     parser.add_argument(
102 |         "--audio_filter", 
103 |         nargs="+",
104 |         help=(
105 |             "Audio file filters. If provided, file paths must match one of the filters to be converted."
106 |         )
107 |     )
108 |     parser.add_argument(
109 |         "--overwrite",
110 |         action="store_true",
111 |         help=(
112 |             "Overwrite existing numpy codes directories. If not set, audio corresponding to existing "
113 |             "numpy codes directories will be skipped."
114 |         ),
115 |     )
116 |     parser.add_argument(
117 |         "--codec_info_only",
118 |         action="store_true",
119 |         help="Only write codec info and do not convert any audio files.",
120 |     )
121 |     args = parser.parse_args()
122 | 
123 |     codec_name_for_path = args.codec_model.split("/")[-1]
124 |     codec_setting_for_path = f"{args.chunk_size_secs}s_{args.context_secs}s"
125 |     args.codes_path = os.path.join(
126 |         args.codes_path, codec_name_for_path, codec_setting_for_path, "stereo" if args.stereo else "mono"
127 |     )
128 | 
129 |     audio_encoder = AudioEncoder(
130 |         args.codec_model,
131 |         codec_type=args.codec_type,
132 |         chunk_size_secs=args.chunk_size_secs,
133 |         context_secs=args.context_secs,
134 |         batch_size=args.batch_size,
135 |         bandwidth=args.bandwidth,
136 |         n_quantizers=args.n_quantizers,
137 |         stereo=args.stereo,
138 |         file_per_chunk=args.file_per_chunk,
139 |     )
140 | 
141 |     codec_info = audio_encoder.get_codec_info()
142 | 
143 |     # iterate and convert
144 |     if args.codec_info_only:
145 |         os.makedirs(args.codes_path, exist_ok=True)
146 |     else:
147 |         result = audio_encoder.encode_audio(
148 |             args.audio_path,
149 |             args.codes_path,
150 |             extensions=args.extensions,
151 |             audio_filter=args.audio_filter,
152 |             overwrite=args.overwrite,
153 |         )
154 |         # Print summary
155 |         print(f"Attempted to convert {result.num_audio_files} audio files:")
156 |         print(f"{result.num_audio_files-len(result.errored_audio_files)} Succeeded.")
157 |         print(f"{len(result.errored_audio_files)} Errored.")
158 |         print(f"{result.num_numpy_files} numpy files created.")
159 |         print(f"{result.num_skipped_dirs} directories skipped.")
160 |         if result.errored_audio_files:
161 |             print("\nErrored files:")
162 |             for file in result.errored_audio_files:
163 |                 print(file)
164 | 
165 |     # write codec info to the base codes directory
166 |     codec_info_path = os.path.join(args.codes_path, "codec_info.json")
167 |     with open(codec_info_path, "w") as f:
168 |         json.dump(codec_info, f, indent=4)
169 |     print("\nCodec info written.")
170 |     print("\nDone.")
171 | 


--------------------------------------------------------------------------------
/codec_bpe/core/trainer.py:
--------------------------------------------------------------------------------
  1 | from typing import Optional, List, Union, Iterator
  2 | import warnings
  3 | import numpy as np
  4 | from tokenizers import AddedToken
  5 | from transformers import PreTrainedTokenizerFast
  6 | 
  7 | from .sentencepiece_bpe import SentencePieceBPETokenizer
  8 | from .converter import codes_to_chars, validate_unicode_offset, UNICODE_OFFSET
  9 | from .utils import get_codes_files
 10 | 
 11 | class Trainer:
 12 |     def __init__(
 13 |         self, 
 14 |         num_codebooks: int,
 15 |         codebook_size: int,
 16 |         codec_framerate: Optional[float] = None,
 17 |         chunk_size_secs: Optional[int] = None,
 18 |         vocab_size: int = 30000,
 19 |         min_frequency: int = 2,
 20 |         special_tokens: Optional[List[Union[str, AddedToken]]] = None,
 21 |         bos_token: Optional[str] = None,
 22 |         eos_token: Optional[str] = None,
 23 |         unk_token: Optional[str] = None,
 24 |         pad_token: Optional[str] = None,
 25 |         max_token_codebook_ngrams: Optional[int] = None,
 26 |         unicode_offset: int = UNICODE_OFFSET,
 27 |     ):
 28 |         if chunk_size_secs is not None:
 29 |             if codec_framerate is None:
 30 |                 raise ValueError("If chunk_size_secs is set, codec_framerate must also be set.")
 31 |             if chunk_size_secs < 1:
 32 |                 raise ValueError("chunk_size_secs must be a positive integer >= 1.")
 33 |         if eos_token is None and pad_token is None:
 34 |             raise ValueError(
 35 |                 "Either pad_token or eos_token should be set, otherwise padded batching will not work with this tokenizer."
 36 |             )
 37 |         if max_token_codebook_ngrams is not None and max_token_codebook_ngrams < 0:
 38 |             raise ValueError("max_token_codebook_ngrams must be a non-negative integer (0 or greater).")
 39 | 
 40 |         self.num_codebooks = num_codebooks
 41 |         self.codebook_size = codebook_size
 42 |         self.codec_framerate = codec_framerate
 43 |         self.chunk_size_secs = chunk_size_secs
 44 |         self.vocab_size = vocab_size
 45 |         self.min_frequency = min_frequency
 46 |         self.special_tokens = special_tokens
 47 |         self.bos_token = bos_token
 48 |         self.eos_token = eos_token
 49 |         self.unk_token = unk_token
 50 |         self.pad_token = pad_token
 51 |         self.max_token_codebook_ngrams = max_token_codebook_ngrams
 52 |         self.unicode_offset = validate_unicode_offset(unicode_offset, num_codebooks, codebook_size)
 53 | 
 54 |         if self.special_tokens is None:
 55 |             self.special_tokens = []
 56 |         for special_token in [self.eos_token, self.bos_token, self.unk_token, self.pad_token]:
 57 |             if special_token is not None and special_token not in self.special_tokens:
 58 |                 self.special_tokens.insert(0, special_token)
 59 | 
 60 |         min_vocab_size = self.num_codebooks*self.codebook_size + len(self.special_tokens)
 61 |         if self.vocab_size < min_vocab_size:
 62 |             raise ValueError(
 63 |                 f"vocab_size is set to {self.vocab_size} but it must be at least {min_vocab_size} to accommodate "
 64 |                 f"{self.num_codebooks} x {self.codebook_size} codes and {len(self.special_tokens)} special token(s).\n"
 65 |                 f"Consider setting vocab_size to {min_vocab_size} + K, where K is the number of tokens you want to "
 66 |                 "reserve for codebook ngrams (learned merges). K should be a sufficiently large number (e.g. >= 10,000) "
 67 |                 "to allow for wide coverage of the most common codebook ngrams in your training data."
 68 |             )
 69 | 
 70 |     def _iterate_and_convert(self, codes_files: List[str]) -> Iterator[str]:
 71 |         for codes_file in codes_files:
 72 |             codes = np.load(codes_file)
 73 |             if len(codes.shape) == 4:
 74 |                 codes = codes[0, 0]
 75 |             elif len(codes.shape) == 3:
 76 |                 codes = codes[0]
 77 |             codes = codes[:self.num_codebooks]
 78 |             chunk_size = int(self.chunk_size_secs * self.codec_framerate) if self.chunk_size_secs else codes.shape[1]
 79 |             for i in range(0, codes.shape[1], chunk_size):
 80 |                 chars = codes_to_chars(
 81 |                     codes[:, i:i+chunk_size], 
 82 |                     self.codebook_size, 
 83 |                     copy_before_conversion=False,
 84 |                     unicode_offset=self.unicode_offset,
 85 |                 )
 86 |                 yield chars
 87 | 
 88 |     def train(
 89 |         self,
 90 |         codes_path: str,
 91 |         codes_filter: Optional[Union[str, List[str]]] = None, 
 92 |         num_files: Optional[int] = None,
 93 |     ) -> SentencePieceBPETokenizer:
 94 |         # Compute base alphabet. This should be num_codebooks * codebook_size so that we never split a codeword
 95 |         # into smaller units.
 96 |         initial_alphabet = [
 97 |             chr(i) for i in range(
 98 |                 self.unicode_offset, 
 99 |                 self.unicode_offset + self.num_codebooks * self.codebook_size
100 |             )
101 |         ]
102 |         
103 |         # If max_token_codebook_ngrams is set, we need to limit the token length to avoid creating tokens that are larger than
104 |         # that number of codebook ngrams. A codebook ngram is a sequence of length num_codebooks with one codeword taken from 
105 |         # each codebook, representing a complete acoustic unit.
106 |         # For example if num_codebooks = 4 and max_token_codebook_ngrams = 5, the maximum token length would be 20.
107 |         max_token_length = None
108 |         if self.max_token_codebook_ngrams is not None:
109 |             max_token_length = max(1, self.max_token_codebook_ngrams * self.num_codebooks)
110 | 
111 |         # Train tokenizer
112 |         if max_token_length == 1:
113 |             # We don't need to actually train the tokenizer here, just create one with the initial alphabet.
114 |             codes_iterator = []
115 |         else:
116 |             codes_files = get_codes_files(codes_path, codes_filter, num_files)
117 |             if not self.chunk_size_secs and codes_files[0].split("_")[-1].startswith("c"):
118 |                 warnings.warn(
119 |                     "The codes files do not have start timestamps, indicating they represent full-length encoded audio files rather than chunks. "
120 |                     "It is recommended to set `--chunk_size_secs` to a small value (e.g. 30) to avoid the tokenizer training on very long sequences. "
121 |                     "Training on very long sequences of audio codes can lead to memory issues and poor BPE merges."
122 |                 )
123 |             codes_iterator = self._iterate_and_convert(codes_files)
124 |             # the +1 is because max_token_length is exclusive (e.g., max_token_length of n yields an actual max token length of n-1).
125 |             # not sure if this is a bug in Tokenizers or intended behavior.
126 |             max_token_length = max_token_length + 1 if max_token_length is not None else None
127 | 
128 |         tokenizer = SentencePieceBPETokenizer(unk_token=self.unk_token, add_prefix_space=False)
129 |         tokenizer.train_from_iterator(
130 |             codes_iterator,
131 |             vocab_size=self.vocab_size,
132 |             min_frequency=self.min_frequency,
133 |             special_tokens=self.special_tokens,
134 |             limit_alphabet=len(initial_alphabet),
135 |             initial_alphabet=initial_alphabet,
136 |             max_token_length=max_token_length,
137 |         )
138 |         tokenizer = PreTrainedTokenizerFast(
139 |             tokenizer_object=tokenizer, 
140 |             bos_token=self.bos_token,
141 |             eos_token=self.eos_token,
142 |             unk_token=self.unk_token,
143 |             pad_token=self.pad_token,
144 |             clean_up_tokenization_spaces=False,
145 |             model_input_names=['input_ids', 'attention_mask'],
146 |         )
147 |         return tokenizer
148 |     


--------------------------------------------------------------------------------
/codec_bpe/tools/codec_utils.py:
--------------------------------------------------------------------------------
  1 | import torch
  2 | import torch.nn.functional as F
  3 | import numpy as np
  4 | import os
  5 | from enum import Enum
  6 | from typing import Tuple, Union, List
  7 | from transformers.feature_extraction_utils import BatchFeature, FeatureExtractionMixin
  8 | 
  9 | MAGICODEC_MODELS = {
 10 |     "magicodec-50hz-base": {
 11 |         "ckpt": {
 12 |             "repo_id": "Ereboas/MagiCodec_16k_50hz",
 13 |             "filename": "MagiCodec-50Hz-Base.ckpt",
 14 |         },
 15 |     },
 16 | }
 17 | 
 18 | WAVTOKENIZER_MODELS = {
 19 |     "wavtokenizer-small-600-24k-4096": {
 20 |         "config": {
 21 |             "repo_id": "novateur/WavTokenizer",
 22 |             "filename": "wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 23 |         },
 24 |         "ckpt": {
 25 |             "repo_id": "novateur/WavTokenizer",
 26 |             "filename": "WavTokenizer_small_600_24k_4096.ckpt"
 27 |         },
 28 |     },
 29 |     "wavtokenizer-small-320-24k-4096": {
 30 |         "config": {
 31 |             "repo_id": "novateur/WavTokenizer",
 32 |             "filename": "wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 33 |         },
 34 |         "ckpt": {
 35 |             "repo_id": "novateur/WavTokenizer",
 36 |             "filename": "WavTokenizer_small_320_24k_4096.ckpt"
 37 |         },
 38 |     },
 39 |     "wavtokenizer-medium-speech-320-24k-4096": {
 40 |         "config": {
 41 |             "repo_id": "novateur/WavTokenizer-medium-speech-75token",
 42 |             "filename": "wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 43 |         },
 44 |         "ckpt": {
 45 |             "repo_id": "novateur/WavTokenizer-medium-speech-75token",
 46 |             "filename": "wavtokenizer_medium_speech_320_24k_v2.ckpt"
 47 |         },
 48 |     },
 49 |     "wavtokenizer-medium-music-audio-320-24k-4096": {
 50 |         "config": {
 51 |             "repo_id": "novateur/WavTokenizer-medium-music-audio-75token",
 52 |             "filename": "wavtokenizer_mediumdata_music_audio_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 53 |         },
 54 |         "ckpt": {
 55 |             "repo_id": "novateur/WavTokenizer-medium-music-audio-75token",
 56 |             "filename": "wavtokenizer_medium_music_audio_320_24k_v2.ckpt"
 57 |         },
 58 |     },
 59 |     "wavtokenizer-large-600-24k-4096": {
 60 |         "config": {
 61 |             "repo_id": "novateur/WavTokenizer",
 62 |             "filename": "wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 63 |         },
 64 |         "ckpt": {
 65 |             "repo_id": "novateur/WavTokenizer-large-unify-40token",
 66 |             "filename": "wavtokenizer_large_unify_600_24k.ckpt"
 67 |         },
 68 |     },
 69 |     "wavtokenizer-large-320-24k-4096": {
 70 |         "config": {
 71 |             "repo_id": "novateur/WavTokenizer",
 72 |             "filename": "wavtokenizer_smalldata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml",
 73 |         },
 74 |         "ckpt": {
 75 |             "repo_id": "novateur/WavTokenizer-large-speech-75token",
 76 |             "filename": "wavtokenizer_large_speech_320_v2.ckpt"
 77 |         },
 78 |     }
 79 | }
 80 | 
 81 | SIMVQ_MODELS = ["simvq_4k", "simvq_8k", "simvq_65k", "simvq_262k"]
 82 | 
 83 | class CodecTypes(Enum):
 84 |     ENCODEC = "encodec"
 85 |     DAC = "dac"
 86 |     MIMI = "mimi"
 87 |     FUNCODEC = "funcodec"
 88 |     XCODEC2 = "xcodec2"
 89 |     WAVTOKENIZER = "wavtokenizer"
 90 |     SIMVQ = "simvq"
 91 |     MAGICODEC = "magicodec"
 92 |     NEUCODEC = "neucodec"
 93 | 
 94 |     @classmethod
 95 |     def try_get_codec_type(cls, codec_model):
 96 |         codec_model = codec_model.lower()
 97 |         if "audio_codec" in codec_model:
 98 |             return cls.FUNCODEC
 99 |         if "encodec" in codec_model:
100 |             return cls.ENCODEC
101 |         if "dac" in codec_model:
102 |             return cls.DAC
103 |         if "mimi" in codec_model:
104 |             return cls.MIMI
105 |         if "xcodec2" in codec_model:
106 |             return cls.XCODEC2
107 |         if "wavtokenizer" in codec_model:
108 |             return cls.WAVTOKENIZER
109 |         if "simvq" in codec_model:
110 |             return cls.SIMVQ
111 |         if "magicodec" in codec_model:
112 |             return cls.MAGICODEC
113 |         if "neucodec" in codec_model:
114 |             return cls.NEUCODEC
115 |         raise ValueError(f"Could not infer codec type from codec model: {codec_model}. Please specify --codec_type.")
116 | 
117 |     def __str__(self):
118 |         return self.value
119 |     def __eq__(self, value):
120 |         return str(self) == value
121 | 
122 | class DefaultProcessor:
123 |     def __call__(self, raw_audio: Union[np.ndarray, List[np.ndarray]], sampling_rate: int, return_tensors: str = "pt") -> BatchFeature:
124 |         if not isinstance(raw_audio, list):
125 |             raw_audio = [raw_audio]
126 |         # Process audio to get padded input tensor
127 |         max_audio_len = max([audio.shape[-1] for audio in raw_audio])
128 |         batch_tensors = [F.pad(torch.from_numpy(audio), (0, max_audio_len-audio.shape[-1])) for audio in raw_audio]
129 |         inputs = BatchFeature(
130 |             data={"input_values": torch.stack(batch_tensors).unsqueeze(1).float()}, 
131 |             tensor_type=return_tensors,
132 |         )
133 |         return inputs
134 | 
135 | def load_funcodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
136 |     from funcodec.bin.codec_inference import Speech2Token
137 |     from huggingface_hub import snapshot_download
138 |     cache_path = snapshot_download(codec_model)
139 |     config_file = os.path.join(cache_path, "config.yaml")
140 |     model_pth = os.path.join(cache_path, "model.pth")
141 |     model = Speech2Token(config_file, model_pth, device=str(device))
142 |     model.eval()
143 |     processor = DefaultProcessor()
144 |     sr_enc = sr_dec = model.model_args.sampling_rate
145 |     return model, processor, sr_enc, sr_dec
146 | 
147 | def load_xcodec2_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
148 |     from huggingface_hub import hf_hub_download
149 |     from xcodec2.modeling_xcodec2 import XCodec2Model
150 |     from xcodec2.configuration_bigcodec import BigCodecConfig
151 |     from safetensors import safe_open
152 |     ckpt_path = hf_hub_download(repo_id=codec_model, filename="model.safetensors")
153 |     ckpt = {}
154 |     with safe_open(ckpt_path, framework="pt", device="cpu") as f:
155 |         for k in f.keys():
156 |             ckpt[k.replace(".beta", ".bias")] = f.get_tensor(k)
157 |     codec_config = BigCodecConfig.from_pretrained(codec_model)
158 |     model = XCodec2Model.from_pretrained(None, config=codec_config, state_dict=ckpt)
159 |     model = model.eval().to(device)
160 |     processor = DefaultProcessor()
161 |     sr_enc = sr_dec = model.feature_extractor.sampling_rate
162 |     return model, processor, sr_enc, sr_dec
163 | 
164 | def load_wavtokenizer_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
165 |     # add `WavTokenizer` directory to the import path.
166 |     # TODO: get rid of this if a proper WavTokenizer package is ever released.
167 |     if not os.path.exists("WavTokenizer"):
168 |         raise ValueError(
169 |             "WavTokenizer not found in your working directory. Please clone the WavTokenizer repository: "
170 |             "`git clone https://github.com/jishengpeng/WavTokenizer.git`"
171 |         )
172 |     import sys
173 |     sys.path.append("WavTokenizer")
174 |     from huggingface_hub import hf_hub_download
175 |     from decoder.pretrained import WavTokenizer
176 |     if codec_model.lower() not in WAVTOKENIZER_MODELS:
177 |         raise ValueError(f"Unsupported wavtokenizer model: {codec_model}. Supported models: {list(WAVTOKENIZER_MODELS)}")
178 |     model_info = WAVTOKENIZER_MODELS[codec_model.lower()]
179 |     config_file = hf_hub_download(**model_info["config"])
180 |     model_ckpt = hf_hub_download(**model_info["ckpt"])
181 |     model = WavTokenizer.from_pretrained0802(config_file, model_ckpt).to(device)
182 |     processor = DefaultProcessor()
183 |     sr_enc = sr_dec = model.feature_extractor.encodec.sample_rate
184 |     return model, processor, sr_enc, sr_dec
185 | 
186 | def load_simvq_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
187 |     # add `SimVQ` directory to the import path.
188 |     # TODO: get rid of this if a proper SimVQ package is ever released.
189 |     if not os.path.exists("SimVQ"):
190 |         raise ValueError(
191 |             "SimVQ not found in your working directory. Please clone the SimVQ repository: "
192 |             "`git clone https://github.com/youngsheen/SimVQ.git`"
193 |         )
194 |     import sys
195 |     sys.path.append("SimVQ")
196 |     import importlib
197 |     from huggingface_hub import hf_hub_download
198 |     from omegaconf import OmegaConf
199 |     if codec_model.lower() not in SIMVQ_MODELS:
200 |         raise ValueError(f"Unsupported SimVQ model: {codec_model}. Supported models: {SIMVQ_MODELS}")
201 |     config_file = hf_hub_download(repo_id="youngsheen/SimVQ", filename=f"vq_audio_log/{codec_model.lower()}/1second/config.yaml")
202 |     model_ckpt = hf_hub_download(repo_id="youngsheen/SimVQ", filename=f"vq_audio_log/{codec_model.lower()}/epoch=49-step=138600.ckpt")
203 |     config = OmegaConf.load(config_file)
204 |     module, cls = config.model.class_path.rsplit(".", 1)
205 |     cls_init = getattr(importlib.import_module(module, package=None), cls)
206 |     model = cls_init(**config.model.init_args)
207 |     sd = torch.load(model_ckpt, map_location="cpu")["state_dict"]
208 |     model.load_state_dict(sd, strict=False)
209 |     model = model.eval().to(device)
210 |     processor = DefaultProcessor()
211 |     sr_enc = sr_dec = config.model.init_args.sample_rate
212 |     return model, processor, sr_enc, sr_dec
213 | 
214 | def load_magicodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
215 |     # add `MagiCodec` directory to the import path.
216 |     # TODO: get rid of this if a proper MagiCodec package is ever released.
217 |     if not os.path.exists("MagiCodec"):
218 |         raise ValueError(
219 |             "MagiCodec not found in your working directory. Please clone the MagiCodec repository: "
220 |             "`git clone https://github.com/Ereboas/MagiCodec.git`"
221 |         )
222 |     import sys
223 |     sys.path.append("MagiCodec")
224 |     from huggingface_hub import hf_hub_download
225 |     from codec.generator import Generator
226 |     if codec_model.lower() not in MAGICODEC_MODELS:
227 |         raise ValueError(f"Unsupported magicodec model: {codec_model}. Supported models: {list(MAGICODEC_MODELS)}")
228 |     model_info = MAGICODEC_MODELS[codec_model.lower()]
229 |     model_ckpt = hf_hub_download(**model_info["ckpt"])
230 |     model = Generator(token_hz=50)
231 |     state_dict = torch.load(model_ckpt, map_location='cpu')
232 |     model.load_state_dict(state_dict, strict=False)
233 |     model = model.eval().to(device)
234 |     processor = DefaultProcessor()
235 |     sr_enc = sr_dec = model.sample_rate
236 |     return model, processor, sr_enc, sr_dec
237 | 
238 | def load_neucodec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, DefaultProcessor, int, int]:
239 |     if "distill" in codec_model.lower():
240 |         from neucodec import DistillNeuCodec
241 |         model = DistillNeuCodec.from_pretrained(codec_model)
242 |     else:
243 |         from neucodec import NeuCodec
244 |         model = NeuCodec.from_pretrained(codec_model)
245 |     model = model.eval().to(device)
246 |     processor = DefaultProcessor()
247 |     sr_enc = model.feature_extractor.sampling_rate
248 |     sr_dec = model.sample_rate
249 |     return model, processor, sr_enc, sr_dec
250 | 
251 | def load_transformers_codec_model(codec_model: str, device: Union[str, torch.device]) -> Tuple[torch.nn.Module, FeatureExtractionMixin, int, int]:
252 |     from transformers import AutoModel, AutoProcessor
253 |     model = AutoModel.from_pretrained(codec_model).to(device)
254 |     processor = AutoProcessor.from_pretrained(codec_model)
255 |     sr_enc = sr_dec = model.config.sampling_rate
256 |     return model, processor, sr_enc, sr_dec
257 | 
258 | def load_codec_model(
259 |     codec_type: CodecTypes, 
260 |     codec_model: str,
261 |     device: Union[str, torch.device],
262 | ) -> Tuple[torch.nn.Module, Union[DefaultProcessor, FeatureExtractionMixin], int, int]:
263 |     if codec_type == CodecTypes.FUNCODEC:
264 |         return load_funcodec_model(codec_model, device)
265 |     elif codec_type == CodecTypes.XCODEC2:
266 |         return load_xcodec2_model(codec_model, device)
267 |     elif codec_type == CodecTypes.WAVTOKENIZER:
268 |         return load_wavtokenizer_model(codec_model, device)
269 |     elif codec_type == CodecTypes.SIMVQ:
270 |         return load_simvq_model(codec_model, device)
271 |     elif codec_type == CodecTypes.MAGICODEC:
272 |         return load_magicodec_model(codec_model, device)
273 |     elif codec_type == CodecTypes.NEUCODEC:
274 |         return load_neucodec_model(codec_model, device)
275 |     else:
276 |         return load_transformers_codec_model(codec_model, device)


--------------------------------------------------------------------------------
/codec_bpe/tools/audio_encoder.py:
--------------------------------------------------------------------------------
  1 | import librosa
  2 | import os
  3 | import shutil
  4 | import numpy as np
  5 | import torch
  6 | import math
  7 | from typing import Optional, List, Union, Tuple, Dict
  8 | from tqdm import tqdm
  9 | 
 10 | from .codec_utils import CodecTypes, load_codec_model
 11 | 
 12 | SUPPORTED_EXTENSIONS = [".mp3", ".wav", ".flac", ".opus"]
 13 | 
 14 | class AudioEncodeResult:
 15 |     def __init__(self):
 16 |         self.num_audio_files = 0
 17 |         self.num_numpy_files = 0
 18 |         self.num_skipped_dirs = 0
 19 |         self.errored_audio_files = []
 20 | 
 21 | class AudioEncoder:
 22 |     def __init__(
 23 |         self, 
 24 |         codec_model: str,
 25 |         codec_type: Optional[CodecTypes] = None,
 26 |         device: Optional[Union[str, torch.device]] = None,
 27 |         chunk_size_secs: float = 30.0,
 28 |         context_secs: float = 0.0,
 29 |         batch_size: int = 1,
 30 |         bandwidth: Optional[float] = None,
 31 |         n_quantizers: Optional[int] = None,
 32 |         stereo: bool = False,
 33 |         file_per_chunk: bool = False,
 34 |     ):
 35 |         self.codec_model = codec_model
 36 |         self.codec_type = codec_type
 37 |         if self.codec_type is None:
 38 |             self.codec_type = CodecTypes.try_get_codec_type(self.codec_model)
 39 |         self.device = device
 40 |         if self.device is None:
 41 |             self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 42 |         elif isinstance(self.device, str):
 43 |             self.device = torch.device(self.device)
 44 |         self.chunk_size_secs = chunk_size_secs
 45 |         self.context_secs = context_secs
 46 |         self.batch_size = batch_size
 47 |         if self.batch_size > 1 and self.codec_type in [CodecTypes.XCODEC2, CodecTypes.NEUCODEC]:
 48 |             raise ValueError("XCodec2 and NeuCodec only support batch size 1 for now.")
 49 |         self.bandwidth = bandwidth
 50 |         # support bandwidth in kbps or bps
 51 |         if self.bandwidth is not None:
 52 |             if self.codec_type == CodecTypes.FUNCODEC and self.bandwidth <= 16.0:
 53 |                 self.bandwidth *= 1000
 54 |             if self.codec_type == CodecTypes.ENCODEC and self.bandwidth > 24.0:
 55 |                 self.bandwidth /= 1000
 56 |         self.n_quantizers = n_quantizers
 57 |         self.stereo = stereo
 58 |         self.file_per_chunk = file_per_chunk
 59 | 
 60 |         # load the codec model
 61 |         self.model, self.processor, self.sr_enc, self.sr_dec = load_codec_model(self.codec_type, self.codec_model, self.device)
 62 |         self.chunk_size_samples = int(self.chunk_size_secs * self.sr_enc)
 63 |         self.context_samples = int(max(0.0, self.context_secs-self.chunk_size_secs) * self.sr_enc)
 64 | 
 65 |     def _encode_batch(self, batch: List[np.ndarray]) -> Tuple[torch.Tensor, float]:
 66 |         # Process audio to get padded input tensor
 67 |         inputs = self.processor(raw_audio=batch, sampling_rate=self.sr_enc, return_tensors="pt")
 68 |         if self.codec_type != CodecTypes.NEUCODEC:
 69 |             inputs = inputs.to(self.device)
 70 |         input_values = inputs.input_values
 71 | 
 72 |         # Encode the batch
 73 |         with torch.no_grad():
 74 |             if self.codec_type == CodecTypes.FUNCODEC:
 75 |                 encoded_batch, _, _, _ = self.model(
 76 |                     input_values,
 77 |                     bit_width=int(self.bandwidth) if self.bandwidth is not None else None,
 78 |                     run_mod="encode",
 79 |                 )
 80 |                 # Permute dimensions to match expected format
 81 |                 audio_codes = torch.permute(encoded_batch[0], (1, 0, 2))
 82 |             elif self.codec_type == CodecTypes.XCODEC2:
 83 |                 input_values = input_values.squeeze(1)
 84 |                 audio_codes = self.model.encode_code(input_values, sample_rate=self.sr_enc)
 85 |             elif self.codec_type == CodecTypes.WAVTOKENIZER:
 86 |                 input_values = input_values.squeeze(1)
 87 |                 bandwidth_id = torch.tensor([0]).to(self.device)
 88 |                 _, audio_codes = self.model.encode_infer(input_values, bandwidth_id=bandwidth_id)
 89 |                 # Permute dimensions to match expected format
 90 |                 audio_codes = torch.permute(audio_codes, (1, 0, 2))
 91 |             elif self.codec_type == CodecTypes.SIMVQ:
 92 |                 _, _, audio_codes, _ = self.model.encode(input_values)
 93 |                 audio_codes = audio_codes.view(input_values.shape[0], 1, -1)
 94 |             elif self.codec_type == CodecTypes.MAGICODEC:
 95 |                 with torch.autocast(
 96 |                     device_type = "cuda",
 97 |                     dtype = torch.bfloat16,
 98 |                     enabled = self.device.type == "cuda" and torch.cuda.is_bf16_supported(),
 99 |                 ):
100 |                     x = self.model.pad_audio(input_values)
101 |                     z_e = self.model.encoder(x)
102 |                     _, audio_codes = self.model.quantizer.inference(z_e)
103 |                 audio_codes = audio_codes.unsqueeze(1)
104 |             elif self.codec_type == CodecTypes.NEUCODEC:
105 |                 audio_codes = self.model.encode_code(input_values)
106 |             else:
107 |                 encode_kwargs = {}
108 |                 if self.codec_type == CodecTypes.DAC:
109 |                     encode_kwargs["n_quantizers"] = self.n_quantizers
110 |                 elif self.codec_type == CodecTypes.MIMI:
111 |                     encode_kwargs["num_quantizers"] = self.n_quantizers
112 |                 elif self.codec_type == CodecTypes.ENCODEC:
113 |                     encode_kwargs["bandwidth"] = self.bandwidth
114 |                 outputs = self.model.encode(**inputs, **encode_kwargs)
115 |                 audio_codes = outputs.audio_codes
116 |         
117 |         samples_per_frame = math.ceil(input_values.shape[-1] / audio_codes.shape[-1])
118 |         return audio_codes, samples_per_frame
119 | 
120 |     def _process_batch(
121 |         self, 
122 |         batch: List[np.ndarray], 
123 |         batch_info: List[Tuple[str, str, int, float, bool]], 
124 |         encoded_file_chunks: List[List[np.ndarray]],
125 |     ) -> List[str]:
126 |         errored_files = []
127 |         num_numpy_files = 0
128 |         if not batch:
129 |             return errored_files, num_numpy_files
130 |         
131 |         try:        
132 |             audio_codes, samples_per_frame = self._encode_batch(batch)
133 |                 
134 |             # Save the non-padded part of the encoded audio
135 |             batch_dim = 1 if self.codec_type == CodecTypes.ENCODEC else 0
136 |             for i, (file_path, numpy_root, channel, start_secs, end_of_file) in enumerate(batch_info):
137 |                 encoded_chunk = audio_codes.select(batch_dim, i).unsqueeze(batch_dim)
138 |                 context_len = math.ceil(self.context_samples / samples_per_frame)
139 |                 non_padded_len = math.ceil(batch[i].shape[-1] / samples_per_frame)
140 |                 encoded_chunk = encoded_chunk[..., context_len:non_padded_len]
141 | 
142 |                 # Save encoded chunk to numpy file
143 |                 if not self.file_per_chunk:
144 |                     encoded_file_chunks[channel].append(encoded_chunk.cpu().numpy())
145 |                 if self.file_per_chunk or end_of_file:
146 |                     file_name_noext = os.path.basename(os.path.splitext(file_path)[0])
147 |                     start_secs_whole = int(start_secs)
148 |                     start_secs_ms = round((start_secs - start_secs_whole) * 1000)
149 |                     timestamp_slot = f"_t{start_secs_whole:06d}_{start_secs_ms:03d}" if self.file_per_chunk else ""
150 |                     numpy_filepath = os.path.join(numpy_root, f"{file_name_noext}_c{channel}{timestamp_slot}.npy")
151 |                     os.makedirs(os.path.dirname(numpy_filepath), exist_ok=True)
152 |                     if self.file_per_chunk:
153 |                         np.save(numpy_filepath, encoded_chunk.cpu().numpy(), allow_pickle=False)
154 |                     else:
155 |                         encoded_file = np.concatenate(encoded_file_chunks[channel], axis=-1)
156 |                         np.save(numpy_filepath, encoded_file, allow_pickle=False)
157 |                         encoded_file_chunks[channel].clear()
158 |                     num_numpy_files += 1
159 | 
160 |         except Exception as e:
161 |             print(f"Error encoding batch: {e}")
162 |             errored_files.extend(set([info[0] for info in batch_info]))
163 |         
164 |         return num_numpy_files, errored_files
165 | 
166 |     def encode_audio(
167 |         self,
168 |         audio_path: str,
169 |         codes_path: str,
170 |         extensions: List[str] = SUPPORTED_EXTENSIONS,
171 |         audio_filter: Optional[Union[str, List[str]]] = None,
172 |         overwrite: bool = False,
173 |     ) -> AudioEncodeResult:
174 |         # traverse the audio directory recursively and convert in each subdirectory containing
175 |         # audio fileswith the specified extensions
176 |         if isinstance(audio_filter, str):
177 |             audio_filter = [audio_filter]
178 |         result = AudioEncodeResult()
179 |         batch = []
180 |         batch_info = []
181 |         encoded_file_chunks = [[], []] if self.stereo else [[]]
182 |         for root, _, files in os.walk(audio_path):
183 |             files = sorted([os.path.join(root, f) for f in files if os.path.splitext(f)[1] in extensions])
184 |             if audio_filter:
185 |                 files = [f for f in files if any([filter_ in f for filter_ in audio_filter])]
186 |             if len(files) == 0:
187 |                 continue
188 |             numpy_root = root.replace(audio_path, codes_path)
189 |             if os.path.exists(numpy_root):
190 |                 if overwrite:
191 |                     shutil.rmtree(numpy_root)
192 |                 else:
193 |                     print(f"Skipping {root} because {numpy_root} already exists.")
194 |                     result.num_skipped_dirs += 1
195 |                     continue
196 |             print(f"Converting in {root}...")
197 |             for file_path in tqdm(files, desc="Files"):
198 |                 result.num_audio_files += 1
199 |                 try:
200 |                     # Load the audio file
201 |                     audio, _ = librosa.load(file_path, sr=self.sr_enc, mono=not self.stereo)
202 |                 except Exception as e:
203 |                     print(f"Error loading {file_path}: {e}")
204 |                     result.errored_audio_files.append(file_path)
205 |                     continue
206 | 
207 |                 # Encode it in chunks of size chunk_size_secs on each channel independently
208 |                 start = 0
209 |                 while True:
210 |                     end = start + self.chunk_size_samples
211 |                     end_of_file = end >= audio.shape[-1]
212 |                     start_with_context = start - self.context_samples
213 |                     audio_chunk = audio[..., max(0, start_with_context):end]
214 |                     if audio_chunk.ndim == 1:
215 |                         audio_chunk = np.expand_dims(audio_chunk, axis=0)
216 |                     if start_with_context < 0:
217 |                         # if we are at the beginning of the audio, pad with a silent context
218 |                         audio_chunk = np.pad(audio_chunk, ((0, 0), (-start_with_context, 0)), mode='constant')
219 |                     for channel in range(audio_chunk.shape[0]):
220 |                         batch.append(audio_chunk[channel])
221 |                         batch_info.append((file_path, numpy_root, channel, start / self.sr_enc, end_of_file))
222 |                         
223 |                         # Process batch if it reaches the specified size
224 |                         if len(batch) == self.batch_size:
225 |                             num_numpy_files, errored_files = self._process_batch(batch, batch_info, encoded_file_chunks)
226 |                             result.num_numpy_files += num_numpy_files
227 |                             result.errored_audio_files.extend(errored_files)
228 |                             batch.clear()
229 |                             batch_info.clear()
230 | 
231 |                     if end_of_file:
232 |                         break
233 |                     start = end
234 | 
235 |         # Process any remaining chunks in the batch
236 |         if batch:
237 |             num_numpy_files, errored_files = self._process_batch(batch, batch_info, encoded_file_chunks)
238 |             result.num_numpy_files += num_numpy_files
239 |             result.errored_audio_files.extend(errored_files)
240 | 
241 |         result.errored_audio_files = sorted(set(result.errored_audio_files))
242 |         return result
243 | 
244 |     def get_codec_info(self) -> Dict[str, Union[str, int, float]]:
245 |         # encode ten seconds of audio and get the number of codebooks and framerate
246 |         dummy_audio = np.zeros(10 * self.sr_enc)
247 |         audio_codes, samples_per_frame = self._encode_batch([dummy_audio])
248 |         # get stats
249 |         if self.codec_type == CodecTypes.FUNCODEC:
250 |             codebook_size = self.model.model_args.quantizer_conf["codebook_size"]
251 |         elif self.codec_type == CodecTypes.XCODEC2:
252 |             codebook_size = 65536
253 |         elif self.codec_type == CodecTypes.WAVTOKENIZER:
254 |             codebook_size = self.model.feature_extractor.encodec.quantizer.bins
255 |         elif self.codec_type == CodecTypes.SIMVQ:
256 |             codebook_size = self.model.quantize.n_e
257 |         elif self.codec_type == CodecTypes.MAGICODEC:
258 |             codebook_size = self.model.codebook_size
259 |         elif self.codec_type == CodecTypes.NEUCODEC:
260 |             codebook_size = 65536
261 |         else:
262 |             codebook_size = self.model.config.codebook_size
263 | 
264 |         # write codec info to json
265 |         codec_info = {
266 |             "codec_type": str(self.codec_type),
267 |             "codec_model": self.codec_model,
268 |             "sampling_rate_encoder": self.sr_enc,
269 |             "sampling_rate_decoder": self.sr_dec,
270 |             "num_codebooks": audio_codes.shape[-2],
271 |             "codebook_size": codebook_size,
272 |             "framerate": self.sr_enc / samples_per_frame,
273 |         }
274 |         return codec_info
275 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # codec-bpe
  2 | ![codec_bpe.png](img/codec_bpe.png)
  3 | 
  4 | Codec BPE is an implementation of [Acoustic BPE](https://arxiv.org/abs/2310.14580) (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs such as [EnCodec](https://github.com/facebookresearch/encodec) (Défossez et al., 2022), [DAC](https://github.com/descriptinc/descript-audio-codec) (Kumar et al., 2023), [Mimi](https://huggingface.co/kyutai/mimi) (Défossez et al., 2024), and [FunCodec](https://funcodec.github.io/) (Du et al., 2024). Built on top of the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) library.
  5 | 
  6 | Codec BPE flattens multi-level codes from Residual Vector Quantizers (RVQ) and converts them into unicode strings for tokenization into compressed token sequences. For example, a single Codec BPE token might represent a 4-gram of codes from 4 codebooks representing a single acoustic unit, a 6-gram comprising a whole acoustic unit and half of the next one, or even an 8-gram represnting two whole acoustic units. Depending on the codec, vocab size and type of audio, this can yield savings of 2-5x in sequence length compared to directly modeling the flattened codebooks.
  7 | 
  8 | Codec BPE can also be used with single-level codecs such as [XCodec2](https://github.com/zhenye234/X-Codec-2.0) (Ye et al., 2025), [WavTokenizer](https://github.com/jishengpeng/WavTokenizer) (Ji et al., 2024), [SimVQ](https://github.com/youngsheen/SimVQ) (Zhu et al., 2024), [MagiCodec](https://github.com/Ereboas/MagiCodec) (Song et al., 2025), and [NeuCodec](https://github.com/neuphonic/neucodec) (Julian et al., 2025). In this case, a single Codec BPE token could represent one or more codes where each code represents a whole acoustic unit.
  9 | 
 10 | **Using Codec BPE allows efficient audio language modeling with multi-level codecs to be done with vanilla LLM architectures, meaning no custom architecture is needed to deal with modeling the RVQ. Your model will already be compatible with the full ecosystem of training and inference tools available for [HuggingFace Transformers](https://github.com/huggingface/transformers), such as [vLLM](https://github.com/vllm-project/vllm) and [Ollama](https://ollama.com/)!**
 11 | 
 12 | ## 🚀 Updates
 13 | **2025-12-01**
 14 | - Added support for [NeuCodec](https://huggingface.co/neuphonic/neucodec), a new high-quality single-level codec with a 50 Hz framerate! NeuCodec extends XCodec2 with inference speedups, an upsampling decoder, and a commercially permissive license. Use `--codec_model neuphonic/neucodec` when encoding audio with `codec_bpe.audio_to_codes` to encode using the NeuCodec model. See [here](#train-a-tokenizer-from-audio-files) for a usage example.
 15 | 
 16 | **2025-06-22**
 17 | - Added support for [MagiCodec](https://github.com/Ereboas/MagiCodec), a new **streaming** single-level codec with a 50 Hz framerate! Use `--codec_model MagiCodec-50Hz-Base` when encoding audio with `codec_bpe.audio_to_codes` to encode using the MagiCodec model. See [here](#train-a-tokenizer-from-audio-files) for a usage example.
 18 | 
 19 | **Older updates**
 20 | - See [CHANGELOG.md](CHANGELOG.md) for a complete list of updates.
 21 | 
 22 | ## Setup
 23 | ```bash
 24 | pip install codec-bpe
 25 | ```
 26 | If you want to use the `--codec_type funcodec` or `--codec_model alibaba-damo/...` options with `codec_bpe.audio_to_codes`, run:
 27 | ```bash
 28 | pip install codec-bpe[funcodec]
 29 | ```
 30 | If you want to use the `--codec_type xcodec2` or `--codec_model HKUSTAudio/xcodec2` options with `codec_bpe.audio_to_codes`, run:
 31 | ```bash
 32 | pip install codec-bpe[xcodec2]
 33 | ```
 34 | If you want to use the `--codec_type wavtokenizer` or `--codec_model wavtokenizer-*` options with `codec_bpe.audio_to_codes`, run:
 35 | ```bash
 36 | pip install codec-bpe[wavtokenizer]
 37 | # WavTokenizer is not an installable package so you need to clone the repository into your working directory manually:
 38 | cd your/working/dir
 39 | git clone https://github.com/jishengpeng/WavTokenizer.git
 40 | # Note: WavTokenizer requirements are all version pinned and include both training and inference dependencies.
 41 | # I recommend either using a dedicated environment or cherry-picking the requirements you need for inference and installing them manually.
 42 | # For example, I had no issue running inference with latest versions of torch, numpy, and transformers.
 43 | pip install -r WavTokenizer/requirements.txt
 44 | ```
 45 | If you want to use the `--codec_type simvq` or `--codec_model simvq_*` options with `codec_bpe.audio_to_codes`, run:
 46 | ```bash
 47 | pip install codec-bpe[simvq]
 48 | # SimVQ is not an installable package so you need to clone the repository into your working directory manually:
 49 | cd your/working/dir
 50 | git clone https://github.com/youngsheen/SimVQ.git
 51 | pip install -r SimVQ/requirements.txt
 52 | ```
 53 | If you want to use the `--codec_type magicodec` or `--codec_model MagiCodec-50Hz-Base` options with `codec_bpe.audio_to_codes`, run:
 54 | ```bash
 55 | pip install codec-bpe[magicodec]
 56 | # MagiCodec is not an installable package so you need to clone the repository into your working directory manually:
 57 | cd your/working/dir
 58 | git clone https://github.com/Ereboas/MagiCodec.git
 59 | cd MagiCodec
 60 | # Follow setup instructions for MagiCodec [here](https://github.com/Ereboas/MagiCodec#env-setup)
 61 | ```
 62 | If you want to use the `--codec_type neucodec` or `--codec_model neuphonic/neucodec` options with `codec_bpe.audio_to_codes`, run:
 63 | ```bash
 64 | pip install codec-bpe[neucodec]
 65 | ```
 66 | 
 67 | ## Supported Codecs
 68 | | Model                                                               | Sample Rate (kHz)* | Framerate (Hz)* | Max Codebooks | Codebook Size | Max Bandwidth (kbps)*     | Training Domain |
 69 | |:--------------------------------------------------------------------|:------------------:|:--------------:|:--------------:|:-------------:|:-------------------------:|:---------------:|
 70 | | [🤗 EnCodec 24khz](https://huggingface.co/facebook/encodec_24khz)  | 24                 | 75             | 32             | 1024          | 24                         | General         |
 71 | | [🤗 DAC 44khz](https://huggingface.co/descript/dac_44khz)          | 44.1               | 86.1328125     | 9              | 1024          | 7.8                        | General         |
 72 | | [🤗 DAC 24khz](https://huggingface.co/descript/dac_24khz)          | 24                 | 75             | 32             | 1024          | 24                         | General         |
 73 | | [🤗 DAC 16khz](https://huggingface.co/descript/dac_16khz)          | 16                 | 50             | 12             | 1024          | 6                          | General         |
 74 | | [🤗 Mimi](https://huggingface.co/kyutai/mimi)                      | 24                 | 12.5           | 32             | 2048          | 4.4                        | Speech          |
 75 | | [🤗 XCodec2](https://huggingface.co/HKUSTAudio/xcodec2)            | 16                 | 50             | 1              | 65536         | 0.8                        | Speech          |
 76 | | [🤗 FunCodec zh_en-general-16k-nq32ds640](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch) | 16 | 25 | 32 | 1024   | 8     | General         |
 77 | | [🤗 FunCodec zh_en-general-16k-nq32ds320](https://huggingface.co/alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds320-pytorch) | 16 | 50 | 32 | 1024   | 16    | General         |
 78 | | [🤗 FunCodec en-libritts-16k-nq32ds640](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch)     | 16 | 25 | 32 | 1024   | 8     | Audiobooks      |
 79 | | [🤗 FunCodec en-libritts-16k-nq32ds320](https://huggingface.co/alibaba-damo/audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch)     | 16 | 50 | 32 | 1024   | 16    | Audiobooks      |
 80 | | [🤗 WavTokenizer-small-600-24k-4096](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_600_24k_4096.ckpt)      | 24 | 40 | 1  | 4096   | 0.48  | Speech          |
 81 | | [🤗 WavTokenizer-small-320-24k-4096](https://huggingface.co/novateur/WavTokenizer/blob/main/WavTokenizer_small_320_24k_4096.ckpt)      | 24 | 75 | 1  | 4096   | 0.9   | Speech          |
 82 | | [🤗 WavTokenizer-medium-speech-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-medium-speech-75token)                       | 24 | 75 | 1  | 4096   | 0.9   | Speech          |
 83 | | [🤗 WavTokenizer-medium-music-audio-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-medium-music-audio-75token)             | 24 | 75 | 1  | 4096   | 0.9   | General         |
 84 | | [🤗 WavTokenizer-large-600-24k-4096](https://huggingface.co/novateur/WavTokenizer-large-unify-40token)                                 | 24 | 40 | 1  | 4096   | 0.48  | General         |
 85 | | [🤗 WavTokenizer-large-320-24k-4096](https://huggingface.co/novateur/WavTokenizer-large-speech-75token)                                | 24 | 75 | 1  | 4096   | 0.9   | General         |
 86 | | [🤗 simvq_4k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_4k)                                                 | 24 | 75 | 1  | 4096   | 0.9   | Speech          |
 87 | | [🤗 simvq_8k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_8k)                                                 | 24 | 75 | 1  | 8192   | 0.975 | Speech          |
 88 | | [🤗 simvq_65k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_65k)                                               | 24 | 75 | 1  | 65536  | 1.2   | Speech          |
 89 | | [🤗 simvq_262k](https://huggingface.co/youngsheen/SimVQ/tree/main/vq_audio_log/simvq_262k)                                             | 24 | 75 | 1  | 262144 | 1.35  | Speech          |
 90 | | [🤗 MagiCodec-50Hz-Base](https://huggingface.co/Ereboas/MagiCodec_16k_50hz)                                                            | 16 | 50 | 1  | 131072 | 0.85  | Audiobooks      |
 91 | | [🤗 NeuCodec](https://huggingface.co/neuphonic/neucodec)                                                                               | 16 | 50 | 1  | 65536  | 0.8   | Speech          |
 92 | | [🤗 Distill-NeuCodec](https://huggingface.co/neuphonic/distill-neucodec)                                                               | 16 | 50 | 1  | 65536  | 0.8   | Speech          |
 93 | 
 94 | \* Sample Rate (kHz) is the sampling rate of the audio input to the codec.
 95 | 
 96 | \* Framerate (Hz) is the number of timesteps (acoustic units of size `num_codebooks`) per second output by the codec.
 97 | 
 98 | \* Bandwidth (kbps) = `framerate (Hz) x num_codebooks x log2(codebook_size) / 1000`.
 99 | 
100 | ## Usage
101 | 
102 | ### Convert audio codes to and from unicode strings
103 | Use your codec of choice (e.g., EnCodec, DAC, Mimi, XCodec2, FunCodec, WavTokenizer, SimVQ) to encode your audio into a torch tensor or numpy array of codes of shape (num_codebooks, length), then use the provided converter methods to convert to and from unicode strings.
104 | 
105 | **Note:** In the Acoustic BPE paper, a single-level codec was used (HuBERT + k-means), where each encoded timestep consisted of a single code which was converted to a single unicode character. Here, we support multi-level codecs based on Residual Vector Quantizers. If num_codebooks > 1, a flattening pattern is used to interleave all codebooks into a single level before mapping to unicode. For example, if 4 codebooks are used then each encoded timestep would consist of 4 codes (one from each codebook) and would be converted to a unicode 4-gram.
106 | 
107 | Example: audio language modeling using EnCodec 24 kHz at 3 kbps (4 codebooks):
108 | ```python
109 | import torch
110 | import librosa
111 | import soundfile as sf
112 | from transformers import (
113 |     EncodecModel, 
114 |     AutoModelForCausalLM,
115 |     AutoProcessor, 
116 |     AutoTokenizer,
117 | )
118 | from codec_bpe import codes_to_chars, chars_to_codes
119 | 
120 | # load a Codec BPE tokenizer and compatible language model
121 | device = "cuda" if torch.cuda.is_available() else "cpu"
122 | tokenizer = AutoTokenizer.from_pretrained("output/my_tokenizer")
123 | model = AutoModelForCausalLM.from_pretrained("output/my_model").to(device)
124 | 
125 | # load the EnCodec model
126 | encodec_modelname = "facebook/encodec_24khz"
127 | encodec_model = EncodecModel.from_pretrained(encodec_modelname).to(device)
128 | encodec_processor = AutoProcessor.from_pretrained(encodec_modelname)
129 | 
130 | # (1) encode audio using EnCodec
131 | audio, sr = librosa.load("some_audio.mp3", sr=encodec_model.config.sampling_rate, mono=True)
132 | inputs = encodec_processor(raw_audio=audio, sampling_rate=sr, return_tensors="pt").to(device)
133 | with torch.no_grad():
134 |     encoded_audio = encodec_model.encode(**inputs, bandwidth=3.0).audio_codes[0, 0]
135 | 
136 | # (2) convert the audio codes to a unicode string and tokenize it
137 | unicode_str = codes_to_chars(encoded_audio, codebook_size=encodec_model.config.codebook_size)
138 | inputs = tokenizer(unicode_str, return_tensors="pt").to(device)
139 | 
140 | # (3) generate tokens from the model
141 | outputs = model.generate(**inputs, do_sample=True, max_new_tokens=300)
142 | 
143 | # (4) detokenize the output back into a unicode string and convert it back to audio codes
144 | unicode_str_2 = tokenizer.decode(outputs[0], skip_special_tokens=False)
145 | encoded_audio_2 = chars_to_codes(
146 |     unicode_str_2, 
147 |     num_codebooks=encoded_audio.shape[0], 
148 |     codebook_size=encodec_model.config.codebook_size, 
149 |     return_tensors="pt",
150 | ).to(device)
151 | 
152 | # (5) decode the generated audio using EnCodec
153 | with torch.no_grad():
154 |     audio_2 = encodec_model.decode(encoded_audio_2.unsqueeze(0).unsqueeze(0), [None]).audio_values[0, 0]
155 | sf.write("some_audio_output.wav", audio_2.cpu().numpy(), sr)
156 | ```
157 | 
158 | ### Train a tokenizer from audio files
159 | To train a tokenizer from audio files:
160 | 
161 | 1. Use your codec of choice (e.g., EnCodec, DAC, Mimi, XCodec2, FunCodec, WavTokenizer, SimVQ) to encode each audio file into a directory of numpy arrays (.npy files):
162 |     ```bash
163 |     # encode audio files using EnCodec 24 kHz at 3 kbps (4 codebooks)
164 |     python -m codec_bpe.audio_to_codes \
165 |         --audio_path path/to/audio \
166 |         --codec_model facebook/encodec_24khz \
167 |         --bandwidth 3.0 \
168 |         --batch_size 8
169 | 
170 |     # encode audio files using first 4 codebooks of DAC 44kHz
171 |     python -m codec_bpe.audio_to_codes \
172 |         --audio_path path/to/audio \
173 |         --codec_model descript/dac_44khz \
174 |         --n_quantizers 4 \
175 |         --batch_size 8
176 | 
177 |     # encode audio files using first 6 codebooks of Mimi (24kHz)
178 |     python -m codec_bpe.audio_to_codes \
179 |         --audio_path path/to/audio \
180 |         --codec_model kyutai/mimi \
181 |         --n_quantizers 6 \
182 |         --batch_size 8
183 | 
184 |     # encode audio files using XCodec2 (16kHz, there is only 1 codebook)
185 |     python -m codec_bpe.audio_to_codes \
186 |         --audio_path path/to/audio \
187 |         --codec_model HKUSTAudio/xcodec2 \
188 |         --batch_size 1 # XCodec2 only supports batch size 1 for now.
189 | 
190 |     # encode audio files using FunCodec (16kHz) at 1.5 kbps (6 codebooks)
191 |     python -m codec_bpe.audio_to_codes \
192 |         --audio_path path/to/audio \
193 |         --codec_model alibaba-damo/audio_codec-encodec-zh_en-general-16k-nq32ds640-pytorch \
194 |         --bandwidth 1500 \
195 |         --batch_size 8
196 | 
197 |     # encode audio files using WavTokenizer at 0.9 kbps (24kHz -> 75Hz, only 1 codebook of 4096 codes)
198 |     python -m codec_bpe.audio_to_codes \
199 |         --audio_path path/to/audio \
200 |         --codec_model wavtokenizer-large-320-24k-4096 \
201 |         --batch_size 8
202 | 
203 |     # encode audio files using SimVQ at 0.9 kbps (24kHz -> 75Hz, only 1 codebook of 4096 codes)
204 |     python -m codec_bpe.audio_to_codes \
205 |         --audio_path path/to/audio \
206 |         --codec_model simvq_4k \
207 |         --batch_size 8
208 | 
209 |     # encode audio files using SimVQ at 0.9 kbps in tiny chunks of 80ms with a 400ms context to simulate streaming encoding
210 |     python -m codec_bpe.audio_to_codes \
211 |         --audio_path path/to/audio \
212 |         --codec_model simvq_4k \
213 |         --batch_size 128 \
214 |         --chunk_size_secs 0.08 \
215 |         --context_secs 0.4
216 | 
217 |     # encode audio files using MagiCodec at 0.85 kbps (16kHz -> 50Hz, only 1 codebook of 131072 codes)
218 |     python -m codec_bpe.audio_to_codes \
219 |         --audio_path path/to/audio \
220 |         --codec_model MagiCodec-50Hz-Base \
221 |         --batch_size 8
222 | 
223 |     # encode audio files using MagiCodec at 0.85 kbps in tiny chunks of 80ms with a 1s context to simulate streaming encoding
224 |     python -m codec_bpe.audio_to_codes \
225 |         --audio_path path/to/audio \
226 |         --codec_model MagiCodec-50Hz-Base \
227 |         --batch_size 128 \
228 |         --chunk_size_secs 0.08 \
229 |         --context_secs 1.0
230 | 
231 |     # encode audio files using NeuCodec at 0.8 kbps (16kHz -> 50Hz, only 1 codebook of 65536 codes)
232 |     python -m codec_bpe.audio_to_codes \
233 |         --audio_path path/to/audio \
234 |         --codec_model neuphonic/neucodec \
235 |         --batch_size 1 # NeuCodec only supports batch size 1 for now.
236 |     ```
237 | 
238 | 2. Suppose you want to use the first 4 codebooks of [EnCodec 24 kHz](https://huggingface.co/facebook/encodec_24khz), run:
239 |     ```bash
240 |     python -m codec_bpe.train_tokenizer \
241 |         --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \
242 |         --chunk_size_secs 30 \
243 |         --vocab_size 30000 \
244 |         --pad_token "<pad>"
245 |     ```
246 |     Here: 
247 |     - `chunk_size_secs` specifies the number of timesteps (in seconds) that get converted to unicode and returned to the underlying Tokenizers trainer at a time.
248 |     - `vocab_size` specifies the number of tokens (including the base vocabulary of individual unicode characters) that you want your tokenizer to have. The base vocabulary size is `num_codebooks` x `codebook_size`. For example, the command above would yield a tokenizer with a base vocabulary of 4096 individual unicode character tokens, each representing a single code from a single codebook, and 25,904 merged "ngram" tokens.
249 | 
250 |     By default, the following additional arguments are automatically initialized from the `codec_info.json` file output by `codec_bpe.audio_to_codes`:
251 |     - `num_codebooks` specifies how many codebooks should be used (in a flattened pattern) when converting each timestep to unicode. For example, EnCodec 24kHz uses 2 codebooks at 1.5 kbps, 4 codebooks at 3 kbps, 8 codebooks at 6 kbps, etc. Note: when encoding the audio files, you should use at least as many codebooks as you plan to specify here.
252 |     - `codebook_size` specifies the size of the codebook. EnCodec 24 kHz uses a codebook size of 1024.
253 |     - `codec_framerate` specifies the framerate (number of timesteps per second) of the codec. EnCodec 24 kHz generates 75 timesteps per second.
254 |     
255 |     You may also pass these arguments explicitly. For example:
256 |     ```bash
257 |     python -m codec_bpe.train_tokenizer \
258 |         --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \
259 |         --num_codebooks 4 \
260 |         --codebook_size 1024 \
261 |         --codec_framerate 75 \
262 |         --chunk_size_secs 30 \
263 |         --vocab_size 30000 \
264 |         --pad_token "<pad>"
265 |     ```
266 |     This is useful if you are using audio codes that you generated with a tool other than the `codec_bpe.audio_to_codes` script, or if you wish to use a lower number of codebooks
267 |     for training the tokenizer than you used for encoding the audio files.
268 | 
269 |     See [train_tokenizer.py](codec_bpe/train_tokenizer.py) for a complete list of supported arguments.
270 | 
271 | #### Controlling the granularity of Codec BPE tokens
272 | The `max_token_codebook_ngrams` argument can be used to control how many codes can be merged into a single Codec BPE token. This is useful to avoid repetitive patterns in the audio manifesting as redundant tokens in the vocabulary. For example, if long segments of silence exist in the training audio then you may end up with hundreds of tokens that just represent different lengths of silence.
273 | 
274 | To avoid this, you can set `max_token_codebook_ngrams` to the maximum number of codebook ngrams (whole acoustic units) you want to allow a single token to represent. For example, if you set `max_token_codebook_ngrams = 2` while `num_codebooks` is set to 4, then a single Codec BPE token may only hold up to 8 codes:
275 | ```bash
276 | python -m codec_bpe.train_tokenizer \
277 |     --codes_path output/codes/encodec_24khz/30.0s_0.0s/mono \
278 |     --chunk_size_secs 30 \
279 |     --vocab_size 30000 \
280 |     --pad_token "<pad>" \
281 |     --max_token_codebook_ngrams 2
282 | ```
283 | 
284 | **It is highly recommended to set this argument to a value <= 2 (or <= 4 if num_codebooks is 1) to ensure that your `vocab_size` budget gets distributed across diverse acoustic patterns in your training data.**
285 | 
286 | #### Using a codec with a very large codebook size
287 | If you are using a codec with a very large codebook size (e.g. XCodec2, which has a codebook size of 65536), you may need to adjust the `unicode_offset` argument for `codec_bpe.train_tokenizer` to avoid the non-printable surrogate range 0xD800-0xDFFF:
288 | ```bash
289 | python -m codec_bpe.train_tokenizer \
290 |     --codes_path output/codes/xcodec2/30.0s_0.0s/mono \
291 |     --chunk_size_secs 30 \
292 |     --vocab_size 80000 \
293 |     --pad_token "<pad>" \
294 |     --max_token_codebook_ngrams 4 \
295 |     --unicode_offset 0xE000
296 | ```
297 | 
298 | Setting `max_token_codebook_ngrams = 0` will skip tokenizer training and simply output a base vocabulary of `num_codebooks x codebook_size` tokens, each representing a single code from a single codebook. This is useful if you want to directly model individual codes from the flattened codebooks instead of combining them into n-grams.
299 | 
300 | ### Extend an existing Transformers PreTrainedTokenizer
301 | You may want to train a new Codec BPE tokenizer and then export its trained vocabulary to an existing Transformers tokenizer. For example, extending the Llama, Mistral, Qwen, etc. tokenizers for multimodal text-audio language modeling.
302 | 
303 | Suppose you have trained your Codec BPE tokenizer and saved it to `output/encodec_bpe_4cb_30k` and you want to extend the Mistral-7B-v0.1 tokenizer with its vocabulary, run:
304 | ```bash
305 | python -m codec_bpe.extend_tokenizer \
306 |     --existing_tokenizer mistralai/Mistral-7B-v0.1 \
307 |     --codec_bpe_tokenizer output/encodec_bpe_4cb_30k \
308 |     --additional_special_tokens "<audio>" "</audio>" # optional
309 | ```
310 | This will simply add every token in `output/encodec_bpe_4cb_30k/tokenizer.json` to the `mistralai/Mistral-7B-v0.1` tokenizer as a special token and save a copy of the latter. Any additional tokens specified with `--additional_special_tokens` will be appended to the existing tokenizer's additional special token list.
311 | 
312 | #### Avoiding vocabulary conflicts
313 | If the added Codec BPE unicode tokens would conflict with existing tokens in the vocabulary, you can override the default unicode offset using the `unicode_offset` argument for `codec_bpe.train_tokenizer`. By default, unicode characters from the [CJK Unified Ideographs](https://symbl.cc/en/unicode-table/#cjk-unified-ideographs) block are used, following the Acoustic BPE paper. You can set `unicode_offset` to a different value (e.g. 0xE000) to start from a different unicode block that won't conflict with your existing vocabulary.
314 | 


--------------------------------------------------------------------------------