├── setup.py
├── .vscode
└── settings.json
├── bark
├── assets
│ └── prompts
│ │ ├── lee.npz
│ │ ├── lee_2.npz
│ │ ├── output.npz
│ │ ├── announcer.npz
│ │ ├── speaker_0.npz
│ │ ├── speaker_1.npz
│ │ ├── speaker_2.npz
│ │ ├── speaker_3.npz
│ │ ├── speaker_4.npz
│ │ ├── speaker_5.npz
│ │ ├── speaker_6.npz
│ │ ├── speaker_7.npz
│ │ ├── speaker_8.npz
│ │ ├── speaker_9.npz
│ │ ├── de_speaker_0.npz
│ │ ├── de_speaker_1.npz
│ │ ├── de_speaker_2.npz
│ │ ├── de_speaker_3.npz
│ │ ├── de_speaker_4.npz
│ │ ├── de_speaker_5.npz
│ │ ├── de_speaker_6.npz
│ │ ├── de_speaker_7.npz
│ │ ├── de_speaker_8.npz
│ │ ├── de_speaker_9.npz
│ │ ├── en_speaker_0.npz
│ │ ├── en_speaker_1.npz
│ │ ├── en_speaker_2.npz
│ │ ├── en_speaker_3.npz
│ │ ├── en_speaker_4.npz
│ │ ├── en_speaker_5.npz
│ │ ├── en_speaker_6.npz
│ │ ├── en_speaker_7.npz
│ │ ├── en_speaker_8.npz
│ │ ├── en_speaker_9.npz
│ │ ├── es_speaker_0.npz
│ │ ├── es_speaker_1.npz
│ │ ├── es_speaker_2.npz
│ │ ├── es_speaker_3.npz
│ │ ├── es_speaker_4.npz
│ │ ├── es_speaker_5.npz
│ │ ├── es_speaker_6.npz
│ │ ├── es_speaker_7.npz
│ │ ├── es_speaker_8.npz
│ │ ├── es_speaker_9.npz
│ │ ├── fr_speaker_0.npz
│ │ ├── fr_speaker_1.npz
│ │ ├── fr_speaker_2.npz
│ │ ├── fr_speaker_3.npz
│ │ ├── fr_speaker_4.npz
│ │ ├── fr_speaker_5.npz
│ │ ├── fr_speaker_6.npz
│ │ ├── fr_speaker_7.npz
│ │ ├── fr_speaker_8.npz
│ │ ├── fr_speaker_9.npz
│ │ ├── hi_speaker_0.npz
│ │ ├── hi_speaker_1.npz
│ │ ├── hi_speaker_2.npz
│ │ ├── hi_speaker_3.npz
│ │ ├── hi_speaker_4.npz
│ │ ├── hi_speaker_5.npz
│ │ ├── hi_speaker_6.npz
│ │ ├── hi_speaker_7.npz
│ │ ├── hi_speaker_8.npz
│ │ ├── hi_speaker_9.npz
│ │ ├── it_speaker_0.npz
│ │ ├── it_speaker_1.npz
│ │ ├── it_speaker_2.npz
│ │ ├── it_speaker_3.npz
│ │ ├── it_speaker_4.npz
│ │ ├── it_speaker_5.npz
│ │ ├── it_speaker_6.npz
│ │ ├── it_speaker_7.npz
│ │ ├── it_speaker_8.npz
│ │ ├── it_speaker_9.npz
│ │ ├── ja_speaker_0.npz
│ │ ├── ja_speaker_1.npz
│ │ ├── ja_speaker_2.npz
│ │ ├── ja_speaker_3.npz
│ │ ├── ja_speaker_4.npz
│ │ ├── ja_speaker_5.npz
│ │ ├── ja_speaker_6.npz
│ │ ├── ja_speaker_7.npz
│ │ ├── ja_speaker_8.npz
│ │ ├── ja_speaker_9.npz
│ │ ├── ko_speaker_0.npz
│ │ ├── ko_speaker_1.npz
│ │ ├── ko_speaker_2.npz
│ │ ├── ko_speaker_3.npz
│ │ ├── ko_speaker_4.npz
│ │ ├── ko_speaker_5.npz
│ │ ├── ko_speaker_6.npz
│ │ ├── ko_speaker_7.npz
│ │ ├── ko_speaker_8.npz
│ │ ├── ko_speaker_9.npz
│ │ ├── pl_speaker_0.npz
│ │ ├── pl_speaker_1.npz
│ │ ├── pl_speaker_2.npz
│ │ ├── pl_speaker_3.npz
│ │ ├── pl_speaker_4.npz
│ │ ├── pl_speaker_5.npz
│ │ ├── pl_speaker_6.npz
│ │ ├── pl_speaker_7.npz
│ │ ├── pl_speaker_8.npz
│ │ ├── pl_speaker_9.npz
│ │ ├── pt_speaker_0.npz
│ │ ├── pt_speaker_1.npz
│ │ ├── pt_speaker_2.npz
│ │ ├── pt_speaker_3.npz
│ │ ├── pt_speaker_4.npz
│ │ ├── pt_speaker_5.npz
│ │ ├── pt_speaker_6.npz
│ │ ├── pt_speaker_7.npz
│ │ ├── pt_speaker_8.npz
│ │ ├── pt_speaker_9.npz
│ │ ├── ru_speaker_0.npz
│ │ ├── ru_speaker_1.npz
│ │ ├── ru_speaker_2.npz
│ │ ├── ru_speaker_3.npz
│ │ ├── ru_speaker_4.npz
│ │ ├── ru_speaker_5.npz
│ │ ├── ru_speaker_6.npz
│ │ ├── ru_speaker_7.npz
│ │ ├── ru_speaker_8.npz
│ │ ├── ru_speaker_9.npz
│ │ ├── tr_speaker_0.npz
│ │ ├── tr_speaker_1.npz
│ │ ├── tr_speaker_2.npz
│ │ ├── tr_speaker_3.npz
│ │ ├── tr_speaker_4.npz
│ │ ├── tr_speaker_5.npz
│ │ ├── tr_speaker_6.npz
│ │ ├── tr_speaker_7.npz
│ │ ├── tr_speaker_8.npz
│ │ ├── tr_speaker_9.npz
│ │ ├── zh_speaker_0.npz
│ │ ├── zh_speaker_1.npz
│ │ ├── zh_speaker_2.npz
│ │ ├── zh_speaker_3.npz
│ │ ├── zh_speaker_4.npz
│ │ ├── zh_speaker_5.npz
│ │ ├── zh_speaker_6.npz
│ │ ├── zh_speaker_7.npz
│ │ ├── zh_speaker_8.npz
│ │ ├── zh_speaker_9.npz
│ │ └── readme.md
├── __init__.py
├── api.py
├── model_fine.py
├── model.py
└── generation.py
├── .gitignore
├── notebooks
├── runs
│ └── linear_projection
│ │ ├── events.out.tfevents.1683572933.melchior.7747.0
│ │ ├── events.out.tfevents.1683574068.melchior.7747.1
│ │ ├── events.out.tfevents.1683574191.melchior.7747.2
│ │ ├── events.out.tfevents.1683575623.melchior.7747.3
│ │ ├── events.out.tfevents.1683575632.melchior.7747.4
│ │ ├── events.out.tfevents.1683576047.melchior.7747.5
│ │ ├── events.out.tfevents.1683576055.melchior.7747.6
│ │ ├── events.out.tfevents.1683576273.melchior.7747.7
│ │ └── events.out.tfevents.1683577423.melchior.7747.8
├── create_dataset.ipynb
└── fake_classifier.ipynb
├── pyproject.toml
├── model-card.md
├── generate.ipynb
├── generate_chunked.ipynb
├── README.md
└── LICENSE
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 |
3 | setup()
4 |
--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
1 | {
2 | "python.formatting.provider": "black"
3 | }
--------------------------------------------------------------------------------
/bark/assets/prompts/lee.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/lee.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/lee_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/lee_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/output.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/output.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/announcer.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/announcer.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/de_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/de_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/en_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/en_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/es_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/es_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/fr_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/fr_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/hi_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/hi_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/it_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/it_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ja_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ja_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ko_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ko_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pl_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pl_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/pt_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/pt_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/ru_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/ru_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/tr_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/tr_speaker_9.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_0.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_0.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_1.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_1.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_2.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_2.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_3.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_3.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_4.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_4.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_5.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_5.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_6.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_6.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_7.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_7.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_8.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_8.npz
--------------------------------------------------------------------------------
/bark/assets/prompts/zh_speaker_9.npz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/bark/assets/prompts/zh_speaker_9.npz
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | **/venv/*
3 | suno_bark.egg-info
4 | build
5 | references
6 |
7 | *.wav
8 | _temp/
9 | datasets/
10 | models/
11 |
12 | **/runs/
--------------------------------------------------------------------------------
/bark/__init__.py:
--------------------------------------------------------------------------------
1 | from .api import generate_audio, text_to_semantic, semantic_to_waveform, save_as_prompt
2 | from .generation import SAMPLE_RATE, preload_models
3 |
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683572933.melchior.7747.0:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683572933.melchior.7747.0
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683574068.melchior.7747.1:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683574068.melchior.7747.1
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683574191.melchior.7747.2:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683574191.melchior.7747.2
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683575623.melchior.7747.3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683575623.melchior.7747.3
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683575632.melchior.7747.4:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683575632.melchior.7747.4
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683576047.melchior.7747.5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683576047.melchior.7747.5
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683576055.melchior.7747.6:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683576055.melchior.7747.6
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683576273.melchior.7747.7:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683576273.melchior.7747.7
--------------------------------------------------------------------------------
/notebooks/runs/linear_projection/events.out.tfevents.1683577423.melchior.7747.8:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/EndlessReform/bark-with-voice-clone/HEAD/notebooks/runs/linear_projection/events.out.tfevents.1683577423.melchior.7747.8
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [build-system]
2 | requires = ["setuptools"]
3 | build-backend = "setuptools.build_meta"
4 |
5 | [project]
6 | name = "suno-bark"
7 | version = "0.0.1a"
8 | description = "Bark text to audio model"
9 | readme = "README.md"
10 | requires-python = ">=3.8"
11 | authors = [
12 | {name = "Suno Inc", email = "hello@suno.ai"},
13 | ]
14 | # Apache 2.0
15 | license = {file = "LICENSE"}
16 |
17 | dependencies = [
18 | "boto3",
19 | "encodec",
20 | "funcy",
21 | "numpy",
22 | "scipy",
23 | "tokenizers",
24 | "torch==2.0.0+cu118",
25 | "tqdm",
26 | "transformers",
27 | ]
28 |
29 | [project.urls]
30 | source = "https://github.com/suno-ai/bark"
31 |
32 | [project.optional-dependencies]
33 | dev = [
34 | "bandit",
35 | "black",
36 | "codecov",
37 | "flake8",
38 | "huggingface-hub",
39 | "hypothesis>=6.14,<7",
40 | "isort>=5.0.0,<6",
41 | "jupyter",
42 | "mypy",
43 | "nbconvert",
44 | "nbformat",
45 | "pydocstyle",
46 | "pylint",
47 | "pytest",
48 | "pytest-cov",
49 | ]
50 |
51 | [tool.setuptools]
52 | packages = ["bark"]
53 |
54 | [tool.setuptools.package-data]
55 | bark = ["assets/prompts/*.npz"]
56 |
57 | [tool.black]
58 | line-length = 100
59 |
--------------------------------------------------------------------------------
/bark/assets/prompts/readme.md:
--------------------------------------------------------------------------------
1 | # Example Prompts Data
2 |
3 | The provided data is in the .npz format, which is a file format used in Python for storing arrays and data. The data contains three arrays: semantic_prompt, coarse_prompt, and fine_prompt.
4 |
5 | ```semantic_prompt```
6 |
7 | The semantic_prompt array contains a sequence of token IDs generated by the BERT tokenizer from Hugging Face. These tokens encode the text input and are used as an input to generate the audio output. The shape of this array is (n,), where n is the number of tokens in the input text.
8 |
9 | ```coarse_prompt```
10 |
11 | The coarse_prompt array is an intermediate output of the text-to-speech pipeline, and contains token IDs generated by the first two codebooks of the EnCodec Codec from Facebook. This step converts the semantic tokens into a different representation that is better suited for the subsequent step. The shape of this array is (2, m), where m is the number of tokens after conversion by the EnCodec Codec.
12 |
13 | ```fine_prompt```
14 |
15 | The fine_prompt array is a further processed output of the pipeline, and contains 8 codebooks from the EnCodec Codec. These codebooks represent the final stage of tokenization, and the resulting tokens are used to generate the audio output. The shape of this array is (8, p), where p is the number of tokens after further processing by the EnCodec Codec.
16 |
17 | Overall, these arrays represent different stages of a text-to-speech pipeline that converts text input into synthesized audio output. The semantic_prompt array represents the input text, while coarse_prompt and fine_prompt represent intermediate and final stages of tokenization, respectively.
18 |
19 |
20 |
21 |
--------------------------------------------------------------------------------
/model-card.md:
--------------------------------------------------------------------------------
1 | # Model Card: Bark
2 |
3 | This is the official codebase for running the text to audio model, from Suno.ai.
4 |
5 | The following is additional information about the models released here.
6 |
7 | ## Model Details
8 |
9 | Bark is a series of three transformer models that turn text into audio.
10 | ### Text to semantic tokens
11 | - Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
12 | - Output: semantic tokens that encode the audio to be generated
13 |
14 | ### Semantic to coarse tokens
15 | - Input: semantic tokens
16 | - Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
17 |
18 | ### Coarse to fine tokens
19 | - Input: the first two codebooks from EnCodec
20 | - Output: 8 codebooks from EnCodec
21 |
22 | ### Architecture
23 | | Model | Parameters | Attention | Output Vocab size |
24 | |:-------------------------:|:----------:|------------|:-----------------:|
25 | | Text to semantic tokens | 80 M | Causal | 10,000 |
26 | | Semantic to coarse tokens | 80 M | Causal | 2x 1,024 |
27 | | Coarse to fine tokens | 80 M | Non-causal | 6x 1,024 |
28 |
29 |
30 | ### Release date
31 | April 2023
32 |
33 | ## Broader Implications
34 | We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
35 | Straightforward improvements will allow models to run faster than realtime, rendering them useful for applications such as virtual assistants.
36 |
37 | While we hope that this release will enable users to express their creativity and build applications that are a force
38 | for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
39 | to voice clone known people with Bark, they can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
40 | we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).
41 |
--------------------------------------------------------------------------------
/generate.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from bark.api import generate_audio\n",
10 | "from transformers import BertTokenizer\n",
11 | "from bark.generation import SAMPLE_RATE, preload_models, codec_decode, generate_coarse, generate_fine, generate_text_semantic\n",
12 | "\n",
13 | "# Enter your prompt and speaker here\n",
14 | "text_prompt = \"Hello, my name is Serpy. And, uh — and I like pizza. [laughs]\"\n",
15 | "voice_name = \"speaker_0\" # use your custom voice name here if you have one\n",
16 | "\n",
17 | "# load the tokenizer\n",
18 | "tokenizer = BertTokenizer.from_pretrained(\"bert-base-multilingual-cased\")"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": null,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "# download and load all models\n",
28 | "preload_models(\n",
29 | " text_use_gpu=True,\n",
30 | " text_use_small=False,\n",
31 | " coarse_use_gpu=True,\n",
32 | " coarse_use_small=False,\n",
33 | " fine_use_gpu=True,\n",
34 | " fine_use_small=False,\n",
35 | " codec_use_gpu=True,\n",
36 | " force_reload=False,\n",
37 | " path=\"models\"\n",
38 | ")"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": null,
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "# simple generation\n",
48 | "audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {},
55 | "outputs": [],
56 | "source": [
57 | "# generation with more control\n",
58 | "x_semantic = generate_text_semantic(\n",
59 | " text_prompt,\n",
60 | " history_prompt=voice_name,\n",
61 | " temp=0.7,\n",
62 | " top_k=50,\n",
63 | " top_p=0.95,\n",
64 | ")\n",
65 | "\n",
66 | "x_coarse_gen = generate_coarse(\n",
67 | " x_semantic,\n",
68 | " history_prompt=voice_name,\n",
69 | " temp=0.7,\n",
70 | " top_k=50,\n",
71 | " top_p=0.95,\n",
72 | ")\n",
73 | "x_fine_gen = generate_fine(\n",
74 | " x_coarse_gen,\n",
75 | " history_prompt=voice_name,\n",
76 | " temp=0.5,\n",
77 | ")\n",
78 | "audio_array = codec_decode(x_fine_gen)"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": [
87 | "from IPython.display import Audio\n",
88 | "# play audio\n",
89 | "Audio(audio_array, rate=SAMPLE_RATE)"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "from scipy.io.wavfile import write as write_wav\n",
99 | "# save audio\n",
100 | "filepath = \"/output/audio.wav\" # change this to your desired output path\n",
101 | "write_wav(filepath, SAMPLE_RATE, audio_array)"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": []
110 | }
111 | ],
112 | "metadata": {
113 | "kernelspec": {
114 | "display_name": "Python 3",
115 | "language": "python",
116 | "name": "python3"
117 | },
118 | "language_info": {
119 | "codemirror_mode": {
120 | "name": "ipython",
121 | "version": 3
122 | },
123 | "file_extension": ".py",
124 | "mimetype": "text/x-python",
125 | "name": "python",
126 | "nbconvert_exporter": "python",
127 | "pygments_lexer": "ipython3",
128 | "version": "3.10.8"
129 | },
130 | "orig_nbformat": 4
131 | },
132 | "nbformat": 4,
133 | "nbformat_minor": 2
134 | }
135 |
--------------------------------------------------------------------------------
/bark/api.py:
--------------------------------------------------------------------------------
1 | from typing import Optional
2 |
3 | import numpy as np
4 |
5 | from .generation import codec_decode, generate_coarse, generate_fine, generate_text_semantic
6 |
7 |
8 | def text_to_semantic(
9 | text: str,
10 | history_prompt: Optional[str] = None,
11 | temp: float = 0.7,
12 | silent: bool = False,
13 | ):
14 | """Generate semantic array from text.
15 |
16 | Args:
17 | text: text to be turned into audio
18 | history_prompt: history choice for audio cloning
19 | temp: generation temperature (1.0 more diverse, 0.0 more conservative)
20 | silent: disable progress bar
21 |
22 | Returns:
23 | numpy semantic array to be fed into `semantic_to_waveform`
24 | """
25 | x_semantic = generate_text_semantic(
26 | text,
27 | history_prompt=history_prompt,
28 | temp=temp,
29 | silent=silent,
30 | use_kv_caching=True
31 | )
32 | return x_semantic
33 |
34 |
35 | def semantic_to_waveform(
36 | semantic_tokens: np.ndarray,
37 | history_prompt: Optional[str] = None,
38 | temp: float = 0.7,
39 | silent: bool = False,
40 | output_full: bool = False,
41 | ):
42 | """Generate audio array from semantic input.
43 |
44 | Args:
45 | semantic_tokens: semantic token output from `text_to_semantic`
46 | history_prompt: history choice for audio cloning
47 | temp: generation temperature (1.0 more diverse, 0.0 more conservative)
48 | silent: disable progress bar
49 | output_full: return full generation to be used as a history prompt
50 |
51 | Returns:
52 | numpy audio array at sample frequency 24khz
53 | """
54 | coarse_tokens = generate_coarse(
55 | semantic_tokens,
56 | history_prompt=history_prompt,
57 | temp=temp,
58 | silent=silent,
59 | use_kv_caching=True
60 | )
61 | fine_tokens = generate_fine(
62 | coarse_tokens,
63 | history_prompt=history_prompt,
64 | temp=0.5,
65 | )
66 | audio_arr = codec_decode(fine_tokens)
67 | if output_full:
68 | full_generation = {
69 | "semantic_prompt": semantic_tokens,
70 | "coarse_prompt": coarse_tokens,
71 | "fine_prompt": fine_tokens,
72 | }
73 | return full_generation, audio_arr
74 | return audio_arr
75 |
76 |
77 | def save_as_prompt(filepath, full_generation):
78 | assert(filepath.endswith(".npz"))
79 | assert(isinstance(full_generation, dict))
80 | assert("semantic_prompt" in full_generation)
81 | assert("coarse_prompt" in full_generation)
82 | assert("fine_prompt" in full_generation)
83 | np.savez(filepath, **full_generation)
84 |
85 |
86 | def generate_audio(
87 | text: str,
88 | history_prompt: Optional[str] = None,
89 | text_temp: float = 0.7,
90 | waveform_temp: float = 0.7,
91 | silent: bool = False,
92 | output_full: bool = False,
93 | ):
94 | """Generate audio array from input text.
95 |
96 | Args:
97 | text: text to be turned into audio
98 | history_prompt: history choice for audio cloning
99 | text_temp: generation temperature (1.0 more diverse, 0.0 more conservative)
100 | waveform_temp: generation temperature (1.0 more diverse, 0.0 more conservative)
101 | silent: disable progress bar
102 | output_full: return full generation to be used as a history prompt
103 |
104 | Returns:
105 | numpy audio array at sample frequency 24khz
106 | """
107 | semantic_tokens = text_to_semantic(
108 | text,
109 | history_prompt=history_prompt,
110 | temp=text_temp,
111 | silent=silent,
112 | )
113 | out = semantic_to_waveform(
114 | semantic_tokens,
115 | history_prompt=history_prompt,
116 | temp=waveform_temp,
117 | silent=silent,
118 | output_full=output_full,
119 | )
120 | if output_full:
121 | full_generation, audio_arr = out
122 | return full_generation, audio_arr
123 | else:
124 | audio_arr = out
125 | return audio_arr
126 |
--------------------------------------------------------------------------------
/bark/model_fine.py:
--------------------------------------------------------------------------------
1 | """
2 | Much of this code is adapted from Andrej Karpathy's NanoGPT
3 | (https://github.com/karpathy/nanoGPT)
4 | """
5 | from dataclasses import dataclass
6 | import math
7 |
8 | import torch
9 | import torch.nn as nn
10 | from torch.nn import functional as F
11 |
12 | from .model import GPT, GPTConfig, MLP
13 |
14 |
15 | class NonCausalSelfAttention(nn.Module):
16 | def __init__(self, config):
17 | super().__init__()
18 | assert config.n_embd % config.n_head == 0
19 | # key, query, value projections for all heads, but in a batch
20 | self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
21 | # output projection
22 | self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
23 | # regularization
24 | self.attn_dropout = nn.Dropout(config.dropout)
25 | self.resid_dropout = nn.Dropout(config.dropout)
26 | self.n_head = config.n_head
27 | self.n_embd = config.n_embd
28 | self.dropout = config.dropout
29 | # flash attention make GPU go brrrrr but support is only in PyTorch nightly and still a bit scary
30 | self.flash = (
31 | hasattr(torch.nn.functional, "scaled_dot_product_attention") and self.dropout == 0.0
32 | )
33 |
34 | def forward(self, x):
35 | B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
36 |
37 | # calculate query, key, values for all heads in batch and move head forward to be the batch dim
38 | q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
39 | k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
40 | q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
41 | v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
42 |
43 | # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
44 | if self.flash:
45 | # efficient attention using Flash Attention CUDA kernels
46 | y = torch.nn.functional.scaled_dot_product_attention(
47 | q, k, v, attn_mask=None, dropout_p=self.dropout, is_causal=False
48 | )
49 | else:
50 | # manual implementation of attention
51 | att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
52 | att = F.softmax(att, dim=-1)
53 | att = self.attn_dropout(att)
54 | y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
55 | y = (
56 | y.transpose(1, 2).contiguous().view(B, T, C)
57 | ) # re-assemble all head outputs side by side
58 |
59 | # output projection
60 | y = self.resid_dropout(self.c_proj(y))
61 | return y
62 |
63 |
64 | class FineBlock(nn.Module):
65 | def __init__(self, config):
66 | super().__init__()
67 | self.ln_1 = nn.LayerNorm(config.n_embd)
68 | self.attn = NonCausalSelfAttention(config)
69 | self.ln_2 = nn.LayerNorm(config.n_embd)
70 | self.mlp = MLP(config)
71 |
72 | def forward(self, x):
73 | x = x + self.attn(self.ln_1(x))
74 | x = x + self.mlp(self.ln_2(x))
75 | return x
76 |
77 |
78 | class FineGPT(GPT):
79 | def __init__(self, config):
80 | super().__init__(config)
81 | del self.lm_head
82 | self.config = config
83 | self.n_codes_total = config.n_codes_total
84 | self.transformer = nn.ModuleDict(
85 | dict(
86 | wtes=nn.ModuleList(
87 | [
88 | nn.Embedding(config.input_vocab_size, config.n_embd)
89 | for _ in range(config.n_codes_total)
90 | ]
91 | ),
92 | wpe=nn.Embedding(config.block_size, config.n_embd),
93 | drop=nn.Dropout(config.dropout),
94 | h=nn.ModuleList([FineBlock(config) for _ in range(config.n_layer)]),
95 | ln_f=nn.LayerNorm(config.n_embd),
96 | )
97 | )
98 | self.lm_heads = nn.ModuleList(
99 | [
100 | nn.Linear(config.n_embd, config.output_vocab_size, bias=False)
101 | for _ in range(config.n_codes_given, self.n_codes_total)
102 | ]
103 | )
104 | for i in range(self.n_codes_total - config.n_codes_given):
105 | self.transformer.wtes[i + 1].weight = self.lm_heads[i].weight
106 |
107 | def forward(self, pred_idx, idx):
108 | device = idx.device
109 | b, t, codes = idx.size()
110 | assert (
111 | t <= self.config.block_size
112 | ), f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
113 | assert pred_idx > 0, "cannot predict 0th codebook"
114 | assert codes == self.n_codes_total, (b, t, codes)
115 | pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)
116 |
117 | # forward the GPT model itself
118 | tok_embs = [
119 | wte(idx[:, :, i]).unsqueeze(-1) for i, wte in enumerate(self.transformer.wtes)
120 | ] # token embeddings of shape (b, t, n_embd)
121 | tok_emb = torch.cat(tok_embs, dim=-1)
122 | pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
123 | x = tok_emb[:, :, :, : pred_idx + 1].sum(dim=-1)
124 | x = self.transformer.drop(x + pos_emb)
125 | for block in self.transformer.h:
126 | x = block(x)
127 | x = self.transformer.ln_f(x)
128 | logits = self.lm_heads[pred_idx - self.config.n_codes_given](x)
129 | return logits
130 |
131 | def get_num_params(self, non_embedding=True):
132 | """
133 | Return the number of parameters in the model.
134 | For non-embedding count (default), the position embeddings get subtracted.
135 | The token embeddings would too, except due to the parameter sharing these
136 | params are actually used as weights in the final layer, so we include them.
137 | """
138 | n_params = sum(p.numel() for p in self.parameters())
139 | if non_embedding:
140 | for wte in self.transformer.wtes:
141 | n_params -= wte.weight.numel()
142 | n_params -= self.transformer.wpe.weight.numel()
143 | return n_params
144 |
145 |
146 | @dataclass
147 | class FineGPTConfig(GPTConfig):
148 | n_codes_total: int = 8
149 | n_codes_given: int = 1
150 |
--------------------------------------------------------------------------------
/notebooks/create_dataset.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "/home/ritsuko/projects/ai/audio/bark/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
13 | " from .autonotebook import tqdm as notebook_tqdm\n"
14 | ]
15 | }
16 | ],
17 | "source": [
18 | "#%pip install resampy\n",
19 | "import numpy as np\n",
20 | "import os\n",
21 | "from pprint import pprint\n",
22 | "from bark.api import text_to_semantic, semantic_to_waveform, generate_audio\n",
23 | "from bark.generation import SAMPLE_RATE, generate_text_semantic, SEMANTIC_RATE_HZ\n",
24 | "from IPython.display import Audio\n",
25 | "from scipy.io.wavfile import write as write_wav\n",
26 | "from datetime import datetime\n",
27 | "import torch\n",
28 | "import torchaudio\n",
29 | "import soundfile\n",
30 | "import resampy\n",
31 | "import sys"
32 | ]
33 | },
34 | {
35 | "attachments": {},
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## Generate synthetic dataset\n",
40 | "\n",
41 | "This notebook creates a synthetic dataset of audio: semantic tokens pairs based on voice line prompts from Mozilla CommonVoice. The purpose of this dataset is to reconstruct the Bark semantic tokens codebook, which will enable us to convert ground-truth audio to a semantic prompt for use in fine-tuning and voice cloning. This notebook provides step-by-step instructions for creating the synthetic dataset and saving it in Fairseq dataset format. Let's get started!\n"
42 | ]
43 | },
44 | {
45 | "attachments": {},
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "For prototyping, we generate voice lines based on metadata from an old version of the [Mozilla CommonVoice dataset](https://www.kaggle.com/datasets/nickj26/common-voice-corpus-1?resource=download&select=validated.tsv) metadata. This is far from ideal; down the pike, we need a much more larger dataset with more diverse voice lines, including multilingual and non-spoken."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 3,
55 | "metadata": {},
56 | "outputs": [
57 | {
58 | "data": {
59 | "text/plain": [
60 | "Index(['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age',\n",
61 | " 'gender', 'accent'],\n",
62 | " dtype='object')"
63 | ]
64 | },
65 | "execution_count": 3,
66 | "metadata": {},
67 | "output_type": "execute_result"
68 | }
69 | ],
70 | "source": [
71 | "import pandas as pd\n",
72 | "\n",
73 | "CV_METADATA_PATH = '../datasets/validated.tsv'\n",
74 | "df = pd.read_csv(CV_METADATA_PATH, sep=\"\\t\")\n",
75 | "df.columns"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 4,
81 | "metadata": {},
82 | "outputs": [
83 | {
84 | "data": {
85 | "text/plain": [
86 | "array(['To give chalk for cheese', 'Judge may not think so.',\n",
87 | " 'I have already described the appearance of that colossal bulk which was embedded in the ground.',\n",
88 | " ..., \"How's the forecast for VI\",\n",
89 | " 'Please look up the Jenny of the Prairie television show.',\n",
90 | " 'Find me the creative work The Pickwick Papers'], dtype=object)"
91 | ]
92 | },
93 | "execution_count": 4,
94 | "metadata": {},
95 | "output_type": "execute_result"
96 | }
97 | ],
98 | "source": [
99 | "# Preview\n",
100 | "lines = df[\"sentence\"].unique()\n",
101 | "lines"
102 | ]
103 | },
104 | {
105 | "attachments": {},
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "There are enough English lines for ~25 hours of audio with unique voice lines; _hopefully_ we'll need less than that."
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 5,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "# Force cu118 generation if available\n",
119 | "#%pip install torch torchaudio --force --extra-index-url https://download.pytorch.org/whl/cu118"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "minutes_to_generate = 3 * 60\n",
129 | "# Line index in commonvoice to start with. Useful when resuming\n",
130 | "start_line = 10307"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 10,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "%%capture log\n",
140 | "minutes_generated = 0\n",
141 | "\n",
142 | "label_file = open('../datasets/en/labels.txt', \"a\")\n",
143 | "manifest_file = open('../datasets/en/manifest.tsv', 'a')\n",
144 | "# Give TSV header at beginning.\n",
145 | "# No, this isn't robust. Too bad!\n",
146 | "if start_line == 0:\n",
147 | " manifest_file.write(str(os.path.abspath(\"../datasets/en\")) + \"\\n\")\n",
148 | "\n",
149 | "# Because HuBERT is trained on 16khz data\n",
150 | "OUTPUT_SAMPLE_RATE = 16_000\n",
151 | "resampler = torchaudio.transforms.Resample(orig_freq=SAMPLE_RATE, new_freq=OUTPUT_SAMPLE_RATE)\n",
152 | "\n",
153 | "for i, line in enumerate(lines[start_line:]):\n",
154 | " try:\n",
155 | " semantic_tokens = generate_text_semantic(text=line, temp=1)\n",
156 | " waveform_arr = semantic_to_waveform(semantic_tokens)\n",
157 | "\n",
158 | " # Persist sequence to new line\n",
159 | " label_file.write(' '.join(list(map(str, semantic_tokens.tolist()))) + \"\\n\")\n",
160 | " label_file.flush()\n",
161 | "\n",
162 | " # Downsample generated audio to 16khz and save \n",
163 | " waveform_tensor = torch.from_numpy(waveform_arr)\n",
164 | " resampled_tensor = resampler(waveform_tensor).unsqueeze(0)\n",
165 | " wav_fname = f\"en_{start_line + i}_{line}.wav\"\n",
166 | " wav_filepath = f\"../datasets/en/{wav_fname}\"\n",
167 | " torchaudio.save(wav_filepath, resampled_tensor, OUTPUT_SAMPLE_RATE)\n",
168 | "\n",
169 | " # Log info to manifest\n",
170 | " seconds_generated = len(semantic_tokens) / SEMANTIC_RATE_HZ\n",
171 | " manifest_file.write(f\"{wav_fname}\\t{resampled_tensor.shape[1]}\" + \"\\n\")\n",
172 | " manifest_file.flush()\n",
173 | "\n",
174 | " # Cutoff when sufficient data\n",
175 | " minutes_generated += seconds_generated / 60\n",
176 | " print(f\"Minutes of audio: {minutes_generated}\")\n",
177 | " if minutes_generated > minutes_to_generate:\n",
178 | " break\n",
179 | " except:\n",
180 | " pass"
181 | ]
182 | },
183 | {
184 | "attachments": {},
185 | "cell_type": "markdown",
186 | "metadata": {},
187 | "source": [
188 | "## ONE-OFF: Convert existing model to new\n",
189 | "\n",
190 | "DELETE THIS after finishing and verifying correctness!"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 12,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "# Create labels\n",
200 | "import glob\n",
201 | "\n",
202 | "old_folder_path = '../datasets/en_old/'\n",
203 | "search_pattern = os.path.join(old_folder_path, \"*.wav\")\n",
204 | "\n",
205 | "label_file = open(f'{old_folder_path}/labels.txt', \"w\")\n",
206 | "manifest_file = open(f'{old_folder_path}/manifest.tsv', 'w')\n",
207 | "manifest_file.write(str(os.path.abspath(\"../datasets/en_old\")) + \"\\n\")\n",
208 | "\n",
209 | "OUTPUT_SAMPLE_RATE = 16_000\n",
210 | "resampler = torchaudio.transforms.Resample(orig_freq=SAMPLE_RATE, new_freq=OUTPUT_SAMPLE_RATE)\n",
211 | "\n",
212 | "for wav_filename in glob.glob(search_pattern):\n",
213 | " # Load file\n",
214 | " basename = os.path.basename(wav_filename)\n",
215 | " wav, sr = torchaudio.load(wav_filename)\n",
216 | "\n",
217 | " # Convert to 16khz and overwrite original\n",
218 | " if sr != 16_000:\n",
219 | " resampled_tensor = resampler(wav)\n",
220 | " torchaudio.save(wav_filename, resampled_tensor, OUTPUT_SAMPLE_RATE)\n",
221 | " manifest_file.write(f\"{basename}\\t{resampled_tensor.shape[1]}\\n\")\n",
222 | " else:\n",
223 | " manifest_file.write(f\"{basename}\\t{wav.shape[1]}\\n\")\n",
224 | "\n",
225 | " \n",
226 | " manifest_file.flush()\n",
227 | " semantic_history = np.load(\n",
228 | " os.path.join(old_folder_path, f\"{basename[2:-4]}.npz\")\n",
229 | " )[\"tokens\"]\n",
230 | " wav_length_seconds = len(semantic_history) / 49.9\n",
231 | "\n",
232 | " # Add manifest entry\n",
233 | "\n",
234 | " # Write tokens to label file\n",
235 | " label_file.write(f'{\" \".join(list(map(str, semantic_history.tolist())))}\\n')\n",
236 | " label_file.flush()\n",
237 | "\n",
238 | " # Try only one for now\n"
239 | ]
240 | }
241 | ],
242 | "metadata": {
243 | "kernelspec": {
244 | "display_name": "venv",
245 | "language": "python",
246 | "name": "python3"
247 | },
248 | "language_info": {
249 | "codemirror_mode": {
250 | "name": "ipython",
251 | "version": 3
252 | },
253 | "file_extension": ".py",
254 | "mimetype": "text/x-python",
255 | "name": "python",
256 | "nbconvert_exporter": "python",
257 | "pygments_lexer": "ipython3",
258 | "version": "3.10.9"
259 | },
260 | "orig_nbformat": 4,
261 | "vscode": {
262 | "interpreter": {
263 | "hash": "790f29072abc26870ccb3736e8ffe1b6fbe9bdb3e500c5faf362e772e52ef00f"
264 | }
265 | }
266 | },
267 | "nbformat": 4,
268 | "nbformat_minor": 2
269 | }
270 |
--------------------------------------------------------------------------------
/bark/model.py:
--------------------------------------------------------------------------------
1 | """
2 | Much of this code is adapted from Andrej Karpathy's NanoGPT
3 | (https://github.com/karpathy/nanoGPT)
4 | """
5 | import math
6 | import numpy as np
7 | from dataclasses import dataclass
8 |
9 | import torch
10 | import torch.nn as nn
11 | from torch.nn import functional as F
12 |
13 |
14 | class LayerNorm(nn.Module):
15 | """LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False"""
16 |
17 | def __init__(self, ndim, bias):
18 | super().__init__()
19 | self.weight = nn.Parameter(torch.ones(ndim))
20 | self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
21 |
22 | def forward(self, input):
23 | return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
24 |
25 |
26 | class CausalSelfAttention(nn.Module):
27 | def __init__(self, config):
28 | super().__init__()
29 | assert config.n_embd % config.n_head == 0
30 | # key, query, value projections for all heads, but in a batch
31 | self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
32 | # output projection
33 | self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
34 | # regularization
35 | self.attn_dropout = nn.Dropout(config.dropout)
36 | self.resid_dropout = nn.Dropout(config.dropout)
37 | self.n_head = config.n_head
38 | self.n_embd = config.n_embd
39 | self.dropout = config.dropout
40 | # flash attention make GPU go brrrrr but support is only in PyTorch nightly and still a bit scary
41 | self.flash = hasattr(torch.nn.functional, "scaled_dot_product_attention")
42 | if not self.flash:
43 | # print("WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0")
44 | # causal mask to ensure that attention is only applied to the left in the input sequence
45 | self.register_buffer(
46 | "bias",
47 | torch.tril(torch.ones(config.block_size, config.block_size)).view(
48 | 1, 1, config.block_size, config.block_size
49 | ),
50 | )
51 |
52 | def forward(self, x, past_kv=None, use_cache=False):
53 | B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
54 |
55 | # calculate query, key, values for all heads in batch and move head forward to be the batch dim
56 | q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
57 | k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
58 | q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
59 | v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
60 |
61 | if past_kv is not None:
62 | past_key = past_kv[0]
63 | past_value = past_kv[1]
64 | k = torch.cat((past_key, k), dim=-2)
65 | v = torch.cat((past_value, v), dim=-2)
66 |
67 | FULL_T = k.shape[-2]
68 |
69 | if use_cache is True:
70 | present = (k, v)
71 | else:
72 | present = None
73 |
74 | # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
75 | if self.flash:
76 | # efficient attention using Flash Attention CUDA kernels
77 | if past_kv is not None:
78 | # When `past_kv` is provided, we're doing incremental decoding and `q.shape[2] == 1`: q only contains
79 | # the query for the last token. scaled_dot_product_attention interprets this as the first token in the
80 | # sequence, so if is_causal=True it will mask out all attention from it. This is not what we want, so
81 | # to work around this we set is_causal=False.
82 | is_causal = False
83 | else:
84 | is_causal = True
85 |
86 | y = torch.nn.functional.scaled_dot_product_attention(
87 | q, k, v, dropout_p=self.dropout, is_causal=is_causal
88 | )
89 | else:
90 | # manual implementation of attention
91 | att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
92 | att = att.masked_fill(self.bias[:, :, FULL_T - T : FULL_T, :FULL_T] == 0, float("-inf"))
93 | att = F.softmax(att, dim=-1)
94 | att = self.attn_dropout(att)
95 | y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
96 | y = (
97 | y.transpose(1, 2).contiguous().view(B, T, C)
98 | ) # re-assemble all head outputs side by side
99 |
100 | # output projection
101 | y = self.resid_dropout(self.c_proj(y))
102 | return (y, present)
103 |
104 |
105 | class MLP(nn.Module):
106 | def __init__(self, config):
107 | super().__init__()
108 | self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
109 | self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
110 | self.dropout = nn.Dropout(config.dropout)
111 | self.gelu = nn.GELU()
112 |
113 | def forward(self, x):
114 | x = self.c_fc(x)
115 | x = self.gelu(x)
116 | x = self.c_proj(x)
117 | x = self.dropout(x)
118 | return x
119 |
120 |
121 | class Block(nn.Module):
122 | def __init__(self, config, layer_idx):
123 | super().__init__()
124 | self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
125 | self.attn = CausalSelfAttention(config)
126 | self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
127 | self.mlp = MLP(config)
128 | self.layer_idx = layer_idx
129 |
130 | def forward(self, x, past_kv=None, use_cache=False):
131 | attn_output, prev_kvs = self.attn(self.ln_1(x), past_kv=past_kv, use_cache=use_cache)
132 | x = x + attn_output
133 | x = x + self.mlp(self.ln_2(x))
134 | return (x, prev_kvs)
135 |
136 |
137 | @dataclass
138 | class GPTConfig:
139 | block_size: int = 1024
140 | input_vocab_size: int = 10_048
141 | output_vocab_size: int = 10_048
142 | n_layer: int = 12
143 | n_head: int = 12
144 | n_embd: int = 768
145 | dropout: float = 0.0
146 | bias: bool = (
147 | True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
148 | )
149 |
150 |
151 | class GPT(nn.Module):
152 | def __init__(self, config):
153 | super().__init__()
154 | assert config.input_vocab_size is not None
155 | assert config.output_vocab_size is not None
156 | assert config.block_size is not None
157 | self.config = config
158 |
159 | self.transformer = nn.ModuleDict(
160 | dict(
161 | wte=nn.Embedding(config.input_vocab_size, config.n_embd),
162 | wpe=nn.Embedding(config.block_size, config.n_embd),
163 | drop=nn.Dropout(config.dropout),
164 | h=nn.ModuleList([Block(config, idx) for idx in range(config.n_layer)]),
165 | ln_f=LayerNorm(config.n_embd, bias=config.bias),
166 | )
167 | )
168 | self.lm_head = nn.Linear(config.n_embd, config.output_vocab_size, bias=False)
169 |
170 | def get_num_params(self, non_embedding=True):
171 | """
172 | Return the number of parameters in the model.
173 | For non-embedding count (default), the position embeddings get subtracted.
174 | The token embeddings would too, except due to the parameter sharing these
175 | params are actually used as weights in the final layer, so we include them.
176 | """
177 | n_params = sum(p.numel() for p in self.parameters())
178 | if non_embedding:
179 | n_params -= self.transformer.wte.weight.numel()
180 | n_params -= self.transformer.wpe.weight.numel()
181 | return n_params
182 |
183 | def forward(
184 | self,
185 | idx,
186 | input_embeds=None,
187 | merge_context=False,
188 | past_kv=None,
189 | position_ids=None,
190 | use_cache=False,
191 | ):
192 | # Mixed embed and token input not supported
193 | assert (idx is None) ^ (
194 | input_embeds is None
195 | ), "Either use embedded or tokenized input, not both"
196 |
197 | device = idx.device if idx else input_embeds.device
198 | if idx:
199 | b, t = idx.size()
200 | else:
201 | b, t, d = input_embeds.size()
202 |
203 | if input_embeds is not None:
204 | assert (
205 | d == self.config.n_embd
206 | ), f"Embeds are the wrong dimension: Expected {self.config.n_embd}, got {d}"
207 |
208 | if past_kv is not None:
209 | assert t == 1, f"KV caching but t is {t} not 1!"
210 | tok_emb = (
211 | self.transformer.wte(idx) if idx else input_embeds
212 | ) # token embeddings of shape (b, t, n_embd)
213 | else:
214 | if merge_context:
215 | assert idx.shape[1] >= 256 + 256 + 1
216 | t = idx.shape[1] - 256
217 | else:
218 | assert (
219 | t <= self.config.block_size
220 | ), f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
221 |
222 | # forward the GPT model itself
223 | if merge_context and idx:
224 | tok_emb = torch.cat(
225 | [
226 | self.transformer.wte(idx[:, :256])
227 | + self.transformer.wte(idx[:, 256 : 256 + 256]),
228 | self.transformer.wte(idx[:, 256 + 256 :]),
229 | ],
230 | dim=1,
231 | )
232 | elif idx:
233 | tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
234 | else:
235 | tok_emb = input_embeds # Assume the caller did the context merging already
236 |
237 | if past_kv is None:
238 | past_length = 0
239 | past_kv = tuple([None] * len(self.transformer.h))
240 | else:
241 | past_length = past_kv[0][0].size(-2)
242 |
243 | if position_ids is None:
244 | position_ids = torch.arange(
245 | past_length, t + past_length, dtype=torch.long, device=device
246 | )
247 | position_ids = position_ids.unsqueeze(0) # shape (1, t)
248 | assert position_ids.shape == (1, t)
249 |
250 | pos_emb = self.transformer.wpe(position_ids) # position embeddings of shape (1, t, n_embd)
251 |
252 | x = self.transformer.drop(tok_emb + pos_emb)
253 |
254 | new_kv = () if use_cache else None
255 |
256 | for i, (block, past_layer_kv) in enumerate(zip(self.transformer.h, past_kv)):
257 | x, kv = block(x, past_kv=past_layer_kv, use_cache=use_cache)
258 |
259 | if use_cache:
260 | new_kv = new_kv + (kv,)
261 |
262 | x = self.transformer.ln_f(x)
263 |
264 | # inference-time mini-optimization: only forward the lm_head on the very last position
265 | logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
266 |
267 | return (logits, new_kv)
268 |
--------------------------------------------------------------------------------
/generate_chunked.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from bark.generation import SAMPLE_RATE, preload_models, codec_decode, generate_coarse, generate_fine, generate_text_semantic"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "import re\n",
19 | "def split_and_recombine_text(text, desired_length=100, max_length=150):\n",
20 | " # from https://github.com/neonbjb/tortoise-tts\n",
21 | " \"\"\"Split text it into chunks of a desired length trying to keep sentences intact.\"\"\"\n",
22 | " # normalize text, remove redundant whitespace and convert non-ascii quotes to ascii\n",
23 | " text = re.sub(r\"\\n\\n+\", \"\\n\", text)\n",
24 | " text = re.sub(r\"\\s+\", \" \", text)\n",
25 | " text = re.sub(r\"[“”]\", '\"', text)\n",
26 | "\n",
27 | " rv = []\n",
28 | " in_quote = False\n",
29 | " current = \"\"\n",
30 | " split_pos = []\n",
31 | " pos = -1\n",
32 | " end_pos = len(text) - 1\n",
33 | "\n",
34 | " def seek(delta):\n",
35 | " nonlocal pos, in_quote, current\n",
36 | " is_neg = delta < 0\n",
37 | " for _ in range(abs(delta)):\n",
38 | " if is_neg:\n",
39 | " pos -= 1\n",
40 | " current = current[:-1]\n",
41 | " else:\n",
42 | " pos += 1\n",
43 | " current += text[pos]\n",
44 | " if text[pos] == '\"':\n",
45 | " in_quote = not in_quote\n",
46 | " return text[pos]\n",
47 | "\n",
48 | " def peek(delta):\n",
49 | " p = pos + delta\n",
50 | " return text[p] if p < end_pos and p >= 0 else \"\"\n",
51 | "\n",
52 | " def commit():\n",
53 | " nonlocal rv, current, split_pos\n",
54 | " rv.append(current)\n",
55 | " current = \"\"\n",
56 | " split_pos = []\n",
57 | "\n",
58 | " while pos < end_pos:\n",
59 | " c = seek(1)\n",
60 | " # do we need to force a split?\n",
61 | " if len(current) >= max_length:\n",
62 | " if len(split_pos) > 0 and len(current) > (desired_length / 2):\n",
63 | " # we have at least one sentence and we are over half the desired length, seek back to the last split\n",
64 | " d = pos - split_pos[-1]\n",
65 | " seek(-d)\n",
66 | " else:\n",
67 | " # no full sentences, seek back until we are not in the middle of a word and split there\n",
68 | " while c not in \"!?.\\n \" and pos > 0 and len(current) > desired_length:\n",
69 | " c = seek(-1)\n",
70 | " commit()\n",
71 | " # check for sentence boundaries\n",
72 | " elif not in_quote and (c in \"!?\\n\" or (c == \".\" and peek(1) in \"\\n \")):\n",
73 | " # seek forward if we have consecutive boundary markers but still within the max length\n",
74 | " while (\n",
75 | " pos < len(text) - 1 and len(current) < max_length and peek(1) in \"!?.\"\n",
76 | " ):\n",
77 | " c = seek(1)\n",
78 | " split_pos.append(pos)\n",
79 | " if len(current) >= desired_length:\n",
80 | " commit()\n",
81 | " # treat end of quote as a boundary if its followed by a space or newline\n",
82 | " elif in_quote and peek(1) == '\"' and peek(2) in \"\\n \":\n",
83 | " seek(2)\n",
84 | " split_pos.append(pos)\n",
85 | " rv.append(current)\n",
86 | "\n",
87 | " # clean up, remove lines with only whitespace or punctuation\n",
88 | " rv = [s.strip() for s in rv]\n",
89 | " rv = [s for s in rv if len(s) > 0 and not re.match(r\"^[\\s\\.,;:!?]*$\", s)]\n",
90 | "\n",
91 | " return rv\n",
92 | "\n",
93 | "def generate_with_settings(text_prompt, semantic_temp=0.7, semantic_top_k=50, semantic_top_p=0.95, coarse_temp=0.7, coarse_top_k=50, coarse_top_p=0.95, fine_temp=0.5, voice_name=None, use_semantic_history_prompt=True, use_coarse_history_prompt=True, use_fine_history_prompt=True, output_full=False):\n",
94 | " # generation with more control\n",
95 | " x_semantic = generate_text_semantic(\n",
96 | " text_prompt,\n",
97 | " history_prompt=voice_name if use_semantic_history_prompt else None,\n",
98 | " temp=semantic_temp,\n",
99 | " top_k=semantic_top_k,\n",
100 | " top_p=semantic_top_p,\n",
101 | " )\n",
102 | "\n",
103 | " x_coarse_gen = generate_coarse(\n",
104 | " x_semantic,\n",
105 | " history_prompt=voice_name if use_coarse_history_prompt else None,\n",
106 | " temp=coarse_temp,\n",
107 | " top_k=coarse_top_k,\n",
108 | " top_p=coarse_top_p,\n",
109 | " )\n",
110 | " x_fine_gen = generate_fine(\n",
111 | " x_coarse_gen,\n",
112 | " history_prompt=voice_name if use_fine_history_prompt else None,\n",
113 | " temp=fine_temp,\n",
114 | " )\n",
115 | "\n",
116 | " if output_full:\n",
117 | " full_generation = {\n",
118 | " 'semantic_prompt': x_semantic,\n",
119 | " 'coarse_prompt': x_coarse_gen,\n",
120 | " 'fine_prompt': x_fine_gen,\n",
121 | " }\n",
122 | " return full_generation, codec_decode(x_fine_gen)\n",
123 | " return codec_decode(x_fine_gen)"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": null,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "# `[laughter]`\n",
133 | "# - `[laughs]`\n",
134 | "# - `[sighs]`\n",
135 | "# - `[music]`\n",
136 | "# - `[gasps]`\n",
137 | "# - `[clears throat]`\n",
138 | "# - `—` or `...` for hesitations\n",
139 | "# - `♪` for song lyrics"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": null,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "text = \"\"\"The Uncharted Land of Discovery: A Journey Through Time and Space\n",
149 | "[clears throat]\n",
150 | "Chapter 1: The Dawn of Curiosity\n",
151 | "[takes breath]\n",
152 | "Since the dawn of humankind, our species has been driven by a powerful force: curiosity. It is an innate, unquenchable desire to explore, understand, and unravel the mysteries of the world around us. This primal urge has led us on countless adventures, pushing us to the farthest reaches of our planet and beyond.\n",
153 | "\n",
154 | "Early humans, huddled around a flickering fire, gazed up at the night sky and wondered what those twinkling lights were. They had no idea that their curiosity would eventually propel us into the vast, uncharted realm of space. As time progressed, our ancestors began to explore their surroundings, venturing beyond their caves and settlements, driven by the need to discover what lay beyond the horizon.\n",
155 | "\n",
156 | "hapter 2: The Age of Exploration\n",
157 | "\n",
158 | "The Age of Exploration marked a turning point in human history, as brave souls took to the seas in search of new lands, wealth, and knowledge. Pioneers like Christopher Columbus, Vasco da Gama, and Ferdinand Magellan set sail on perilous voyages, pushing the boundaries of what was known and understood.\n",
159 | "[clears throat]\n",
160 | "These intrepid explorers discovered new continents, mapped out previously unknown territories, and encountered diverse cultures. They also established trade routes, allowing for the exchange of goods, ideas, and innovations between distant societies. The Age of Exploration was not without its dark moments, however, as conquest, colonization, and exploitation often went hand in hand with discovery.\n",
161 | "[clears throat]\n",
162 | "Chapter 3: The Scientific Revolution\n",
163 | "[laughs]\n",
164 | "The Scientific Revolution was a period of profound change, as humanity began to question long-held beliefs and seek empirical evidence. Pioneers like Galileo Galilei, Isaac Newton, and Johannes Kepler sought to understand the natural world through observation, experimentation, and reason.\n",
165 | "[sighs]\n",
166 | "Their discoveries laid the foundation for modern science, transforming the way we view the universe and our place within it. New technologies, such as the telescope and the microscope, allowed us to peer deeper into the cosmos and the microscopic world, further expanding our understanding of reality.\n",
167 | "[gasps]\n",
168 | "Chapter 4: The Information Age\n",
169 | "\n",
170 | "The Information Age, sometimes referred to as the Digital Age, has revolutionized the way we communicate, learn, and access knowledge. With the advent of the internet and personal computers, information that was once reserved for the privileged few is now available to the masses.\n",
171 | "...\n",
172 | "This democratization of knowledge has led to an explosion of innovation, as ideas and information are shared across borders and cultures at lightning speed. The Information Age has also brought new challenges, as the rapid pace of technological advancements threatens to outpace our ability to adapt and raises questions about the ethical implications of our increasingly interconnected world.\n",
173 | "[laughter]\n",
174 | "Chapter 5: The Final Frontier\n",
175 | "[clears throat]\n",
176 | "As our knowledge of the universe expands, so too does our desire to explore the cosmos. Space exploration has come a long way since the first successful satellite, Sputnik, was launched in 1957. We have landed humans on the moon, sent probes to the far reaches of our solar system, and even glimpsed distant galaxies through powerful telescopes.\n",
177 | "\n",
178 | "The future of space exploration is filled with possibilities, from establishing colonies on Mars to the search for extraterrestrial life. As we venture further into the unknown, we continue to be driven by the same curiosity that has propelled us throughout history, always seeking to uncover the secrets of the universe and our place within it.\n",
179 | "...\n",
180 | "In conclusion, the human journey is one of discovery, driven by our innate curiosity and desire to understand the world around us. From the dawn of our species to the present day, we have continued to explore, learn, and adapt, pushing the boundaries of what is known and possible. As we continue to unravel the mysteries of the cosmos, our spirit.\"\"\""
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "# download and load all models\n",
190 | "preload_models(\n",
191 | " text_use_gpu=True,\n",
192 | " text_use_small=False,\n",
193 | " coarse_use_gpu=True,\n",
194 | " coarse_use_small=False,\n",
195 | " fine_use_gpu=True,\n",
196 | " fine_use_small=False,\n",
197 | " codec_use_gpu=True,\n",
198 | " force_reload=False,\n",
199 | " path=\"models\"\n",
200 | ")"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": null,
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "# Chunk the text into smaller pieces then combine the generated audio\n",
210 | "from time import time\n",
211 | "from tqdm.auto import tqdm\n",
212 | "from IPython.display import Audio\n",
213 | "from scipy.io.wavfile import write as write_wav\n",
214 | "import os\n",
215 | "import numpy as np\n",
216 | "\n",
217 | "# generation settings\n",
218 | "voice_name = 'speaker_4'\n",
219 | "out_filepath = 'audio/audio.wav'\n",
220 | "\n",
221 | "semantic_temp = 0.7\n",
222 | "semantic_top_k = 50\n",
223 | "semantic_top_p = 0.95\n",
224 | "\n",
225 | "coarse_temp = 0.7\n",
226 | "coarse_top_k = 50\n",
227 | "coarse_top_p = 0.95\n",
228 | "\n",
229 | "fine_temp = 0.5\n",
230 | "\n",
231 | "use_semantic_history_prompt = True\n",
232 | "use_coarse_history_prompt = True\n",
233 | "use_fine_history_prompt = True\n",
234 | "\n",
235 | "use_last_generation_as_history = True\n",
236 | "\n",
237 | "texts = split_and_recombine_text(text)\n",
238 | "\n",
239 | "all_parts = []\n",
240 | "for i, text in tqdm(enumerate(texts), total=len(texts)):\n",
241 | " full_generation, audio_array = generate_with_settings(\n",
242 | " text,\n",
243 | " semantic_temp=semantic_temp,\n",
244 | " semantic_top_k=semantic_top_k,\n",
245 | " semantic_top_p=semantic_top_p,\n",
246 | " coarse_temp=coarse_temp,\n",
247 | " coarse_top_k=coarse_top_k,\n",
248 | " coarse_top_p=coarse_top_p,\n",
249 | " fine_temp=fine_temp,\n",
250 | " voice_name=voice_name,\n",
251 | " use_semantic_history_prompt=use_semantic_history_prompt,\n",
252 | " use_coarse_history_prompt=use_coarse_history_prompt,\n",
253 | " use_fine_history_prompt=use_fine_history_prompt,\n",
254 | " output_full=True\n",
255 | " )\n",
256 | " if use_last_generation_as_history:\n",
257 | " # save to npz\n",
258 | " os.makedirs('_temp', exist_ok=True)\n",
259 | " np.savez_compressed(\n",
260 | " '_temp/history.npz',\n",
261 | " semantic_prompt=full_generation['semantic_prompt'],\n",
262 | " coarse_prompt=full_generation['coarse_prompt'],\n",
263 | " fine_prompt=full_generation['fine_prompt'],\n",
264 | " )\n",
265 | " voice_name = '_temp/history.npz'\n",
266 | " all_parts.append(audio_array)\n",
267 | "\n",
268 | "audio_array = np.concatenate(all_parts, axis=-1)\n",
269 | "\n",
270 | "# save audio\n",
271 | "write_wav(out_filepath, SAMPLE_RATE, audio_array)\n",
272 | "\n",
273 | "# play audio\n",
274 | "Audio(audio_array, rate=SAMPLE_RATE)"
275 | ]
276 | }
277 | ],
278 | "metadata": {
279 | "kernelspec": {
280 | "display_name": "Python 3",
281 | "language": "python",
282 | "name": "python3"
283 | },
284 | "language_info": {
285 | "codemirror_mode": {
286 | "name": "ipython",
287 | "version": 3
288 | },
289 | "file_extension": ".py",
290 | "mimetype": "text/x-python",
291 | "name": "python",
292 | "nbconvert_exporter": "python",
293 | "pygments_lexer": "ipython3",
294 | "version": "3.10.8"
295 | },
296 | "orig_nbformat": 4
297 | },
298 | "nbformat": 4,
299 | "nbformat_minor": 2
300 | }
301 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Bark fine-tuning experiment
2 |
3 | > **Status report, 2023-05-16**: Still in the prototyping phase with the [Serp.ai](https://serp.ai/) team. Bark now takes embedded semantic history for generation, and we have prototyping code for creating a synthetic dataset of audio to token mappings. There's boilerplate for training a simple projection with MSE loss between spaces, to test E2E, but unsurprisingly it didn't work on the first try. Over the next few weeks I'll be working on the actual training: (a) hyperparameter tuning, (b) using a larger HuBERT model for embeddings, (c) making the training objective more sophisticated. Contributions and feedback welcome! Join us at the [SERP Discord](https://serp.ly/@serpai/discord).
4 |
5 | > **Warning**
6 | >
7 | > I'm a junior web dev with a grand total of four months of AI tutorials, so I could be totally "Bark"-ing up the wrong tree! Please don't hesitate to give suggestions, contribute, or correct me, that's what open source is for!
8 |
9 | This repo attempts to enable converting ground-truth audio to Bark semantic tokens (or their input embeddings). If successful, this will add the missing piece to Serp.ai's voice cloning fork, which solved coarse and fine token conversion, and enable full fine tuning - or at least get some of the way there. **My eventual goal is to merge this fork back into the main Serp.ai voice cloning fork**, if I ever get that far.
10 |
11 | For progress, please see CHANGELOG.md
12 |
13 | ## Why can't Bark be fine-tuned (yet)?
14 |
15 | Under the hood, Bark is essentially the AudioLM model (see [paper](https://arxiv.org/abs/2209.03143), [public GitHub replication](https://github.com/lucidrains/audiolm-pytorch)) + text conditioning. It's three GPTs stacked on top of each other. In AudioLM, just like GPT-3 generates text tokens from a prompt of text tokens, the first GPT takes a prompt of **semantic** tokens, which encode the _content_ of new audio and a bit of the speaker identity (that's `text_to_semantic` in Bark), and generates the "next tokens". Bark adds to this by adding a learned embedding of the text you want to generate. The second and third GPTs, the fine and coarse or `semantic_to_waveform` in Bark, in both Bark and AudioLM handle the **acoustic** tokens, which encode the finer details of the audio.
16 |
17 | 
18 |
19 | So how do you turn real audio into token prompts for these models? Mercifully, the acoustic tokens are a predefined open-source format: lower and upper layers of Facebook's [Encodec](https://github.com/facebookresearch/encodec) neural compressed encoding for audio. Serp.ai's voice cloning fork successfully converts coarse and fine token prompts this way. Unfortunately, the audio to semantic token conversion requires Bark's proprietary model, which is only used during training and not at inference. Suno has repeatedly refused to open-source this model despite many community requests including from Serp.ai, in order to ~~make money by using an unconfigurable Bark as a giant advertisement for their future proprietary platform, which guards cloning behind a paid API~~ "prevent online harms and misinformation". Instead, Suno gives out their own predefined prompts. This approach is quite similar to how the Tortoise TTS developer "de-weaponized" it for the first year of its existence: see Sherman Chann's blog post [Why can't Tortoise be fine-tuned?](https://152334h.github.io/blog/tortoise-fine-tuning/) for a writeup.
20 |
21 | Serp.ai's voice cloning fork deals with this limitation by generating semantic tokens prompted only by text, but supplying the fine and coarse prompts from the ground-truth audio. Serp's approach gets pretty far; fine and coarse are enough to get major details like speaker gender and tone of voice pretty close. However, sadly this isn't enough to nail speaker identity. Check out the `notebooks/ablation.ipynb` notebook for an informal demonstration of how much difference semantic and acoustic prompts make to the output.
22 |
23 | ## Reverse engineering the semantic token codebook
24 |
25 | Sherman Chann's blog post on Tortoise goes on to suggest "baseless speculation" on how to reverse-engineer the Tortoise codebook. By definition, the model outputs are the audio from the new semantic tokens, and mercifully, the length specified by the 50hz semantic tokens is the length of the audio. So we can generate a large, diverse dataset of voice lines and save the semantic tokens for them, then train a small model to map generated audio to source tokens. Chann never ended up having to do this, since the Tortoise author foolishly left the original semantic token encoder in a not-actually-deleted HuggingFace branch. Sadly, the Bark community isn't so lucky; we'll have to do it the hard way.
26 |
27 | The `notebooks/create_dataset` is a naive attempt to generate a dataset of synthetic audio to semantic tokens, in [Fairseq's](https://github.com/facebookresearch/fairseq) dataset format, so we can feed our generated audio easily into Fairseq's HuBERT implementation and get the sixth-layer embeddings. The key thing here is to generate as large and diverse a dataset as possible, but for prototyping purposes, I'm solely doing this for English using voice prompts from [Mozilla CommonVoice](https://commonvoice.mozilla.org/en/datasets) (NOT the actual audio). (As a side note, I would really appreciate someone getting the `validated.tsv` voice lines from other languages in CommonVoice, like Hindi; I don't want to download all that audio just to get the tsv and not use the audio at all).
28 |
29 | The original AudioLM paper creates the audio to semantic token mapping as follows:
30 | - Take an encoder transformer BERT-like model that encodes audio to embeddings (for tasks like speaker recognition). AudioLM and Google use the closed-source wav2vec-BERT, but the open-source AudioLM repo uses [HuBERT](https://huggingface.co/docs/transformers/model_doc/hubert).
31 | - Run a bunch of source audio through HuBERT and take the embeddings from the sixth layer. HuBERT runs at 50 embeddings / second of audio.
32 | - Run k-means clusters on the embeddings to essentially produce k "groups" of kinds of input audio. For example, AudioLM uses ~500, and in a [GitHub statement](https://github.com/lucidrains/audiolm-pytorch/discussions/170), the Bark devs say they use a similar approach but with 10k groups. In what I am sure is a complete coincidence, Bark semantic tokens are 49.9hz, roughly the same as HuBERT's 50hz.
33 | - When adding new audio, run k-means to find out "what group" the new audio is in.
34 |
35 | So can't we just do this semantic token codebook generation ourselves? No; as Chann points out, there's no guarantee that our own training process will generate the same groups. Instead, similar to [Mini-GPT-4](https://arxiv.org/abs/2304.10592), we're training a linear projection from embeddings from frozen HuBERT to Bark's input embeddings for the semantic tokens, and enabling generation from embedded semantic history.
36 |
37 | Other stuff that probably needs to be done later:
38 | - Add batch inference mode for Bark, to speed up dataset generation and enable use cases like mass audiobook conversion
39 | - Write an eval harness, so we can gauge performance better than training objective loss or "playing it by ear"
40 |
41 | -------------------------------------------------------------------
42 | # Original README.md
43 |
44 |
45 | [](https://twitter.com/OnusFM)
46 | [](https://discord.gg/J2B2vsjKuE)
47 |
48 |
49 | [Examples](https://suno-ai.notion.site/Bark-Examples-5edae8b02a604b54a42244ba45ebc2e2) | [Model Card](./model-card.md) | [Playground Waitlist](https://3os84zs17th.typeform.com/suno-studio)
50 |
51 | Bark is a transformer-based text-to-audio model created by [Suno](https://suno.ai). Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.
52 |
53 |
54 |
55 |