├── .env.example
├── .gitignore
├── LICENSE
├── README.md
├── docs
    ├── 1e5bad6485828197234ab8722f3f646.jpg
    ├── 20240428.jpg
    ├── 20240506.jpg
    ├── 362ac8df0334eb69d6e529f08ef401d.jpg
    ├── 90845666f8491d218695ebd3540a94e.jpg
    ├── d50300d5db9d8cc71861174fc5d33b1.jpg
    └── wechat.jpg
├── main.py
├── requirements.txt
├── run_main.bat
├── setup.py
├── voice_type
    ├── BV001_streaming.npy
    ├── BV002_streaming.npy
    ├── BV005_streaming.npy
    ├── BV007_streaming.npy
    ├── BV056_streaming.npy
    ├── BV102_streaming.npy
    ├── BV113_streaming.npy
    ├── BV119_streaming.npy
    ├── BV700_streaming.npy
    └── BV701_streaming.npy
├── youdub
    ├── __init__.py
    ├── asr_damo.py
    ├── asr_whisper.py
    ├── asr_whisperX.py
    ├── cn_tx.py
    ├── demucs_vr.py
    ├── translation.py
    ├── translation_json.py
    ├── translation_unsafe.py
    ├── tts_bytedance.py
    ├── tts_paddle.py
    ├── tts_xttsv2.py
    ├── utils.py
    └── video_postprocess.py
└── 开发.md


/.env.example:
--------------------------------------------------------------------------------
 1 | OPENAI_API_KEY = 'sk-xxx'
 2 | OPENAI_API_BASE =
 3 | # MODEL_NAME = 'gpt-4'
 4 | MODEL_NAME = 'gpt-3.5-turbo'
 5 | 
 6 | HF_TOKEN =
 7 | 
 8 | # 火山引擎
 9 | APPID = 
10 | ACCESS_TOKEN = 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # poetry
 98 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102 | #poetry.lock
103 | 
104 | # pdm
105 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106 | #pdm.lock
107 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108 | #   in version control.
109 | #   https://pdm.fming.dev/#use-with-ide
110 | .pdm.toml
111 | 
112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113 | __pypackages__/
114 | 
115 | # Celery stuff
116 | celerybeat-schedule
117 | celerybeat.pid
118 | 
119 | # SageMath parsed files
120 | *.sage.py
121 | 
122 | # Environments
123 | .env
124 | .venv
125 | env/
126 | venv/
127 | ENV/
128 | env.bak/
129 | venv.bak/
130 | 
131 | # Spyder project settings
132 | .spyderproject
133 | .spyproject
134 | 
135 | # Rope project settings
136 | .ropeproject
137 | 
138 | # mkdocs documentation
139 | /site
140 | 
141 | # mypy
142 | .mypy_cache/
143 | .dmypy.json
144 | dmypy.json
145 | 
146 | # Pyre type checker
147 | .pyre/
148 | 
149 | # pytype static type analyzer
150 | .pytype/
151 | 
152 | # Cython debug symbols
153 | cython_debug/
154 | 
155 | # PyCharm
156 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
159 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
160 | #.idea/
161 | 
162 | input/
163 | output/
164 | test/
165 | finished/
166 | models/
167 | playground/
168 | finished/
169 | todo/
170 | # audio
171 | *.mp3
172 | *.wav
173 | # video
174 | *.mp4
175 | *.mkv
176 | # model
177 | *.pth
178 | Kurzgsaget/
179 | 
180 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # YouDub: 优质视频中文化工具
  3 | # 请移步 [YouDub-webui](https://github.com/liuzhao1225/YouDub-webui)
  4 | # 请移步 [YouDub-webui](https://github.com/liuzhao1225/YouDub-webui)
  5 | # 请移步 [YouDub-webui](https://github.com/liuzhao1225/YouDub-webui)
  6 | ## 目录
  7 | - [YouDub: 优质视频中文化工具](#youdub-优质视频中文化工具)
  8 |   - [目录](#目录)
  9 |   - [简介](#简介)
 10 |   - [主要特点](#主要特点)
 11 |   - [安装与使用指南](#安装与使用指南)
 12 |   - [使用步骤](#使用步骤)
 13 |   - [技术细节](#技术细节)
 14 |     - [AI 语音识别](#ai-语音识别)
 15 |     - [大型语言模型翻译](#大型语言模型翻译)
 16 |     - [AI 声音克隆](#ai-声音克隆)
 17 |     - [视频处理](#视频处理)
 18 |   - [贡献指南](#贡献指南)
 19 |   - [许可协议](#许可协议)
 20 |   - [支持与联系方式](#支持与联系方式)
 21 | 
 22 | ## 简介
 23 | `YouDub` 是一个创新的开源工具，专注于将 YouTube 等平台的优质视频翻译和配音为中文版本。此工具融合了先进的 AI 技术，包括语音识别、大型语言模型翻译以及 AI 声音克隆技术，为中文用户提供具有原始 YouTuber 音色的中文配音视频。更多示例和信息，欢迎访问我的[bilibili视频主页](https://space.bilibili.com/1263732318)。你也可以加入我们的微信群，扫描下方的[二维码](#支持与联系方式)即可。
 24 | 
 25 | ## 主要特点
 26 | - **AI 语音识别**：有效转换视频中的语音为文字。
 27 | - **大型语言模型翻译**：快速且精准地将文本翻译成中文。
 28 | - **AI 声音克隆**：生成与原视频配音相似的中文语音。
 29 | - **视频处理**：集成的功能实现音视频的同步处理。
 30 | 
 31 | ## 安装与使用指南
 32 | 1. **克隆仓库**：
 33 |    ```bash
 34 |    git clone https://github.com/liuzhao1225/YouDub.git
 35 |    ```
 36 | 2. **安装依赖**：
 37 |    进入 `YouDub` 目录并安装所需依赖：
 38 |    ```bash
 39 |    cd YouDub
 40 |    pip install -r requirements.txt
 41 |    ```
 42 |    如果出现 `paddlespeech` 相关的依赖错误，可以访问 [PaddleSpeech 官方网站](https://github.com/PaddlePaddle/PaddleSpeech#installation) 获取安装指南。
 43 | 
 44 |    默认安装为 CPU 版本的 PyTorch 如果你需要手动安装特定 CUDA 版本的 PyTorch，可以参考以下步骤：
 45 |    首先，根据你的环境和 PyTorch 的版本，从 [PyTorch 官方网站](https://pytorch.org/) 获取适用的安装命令。例如，如果你的 CUDA 版本是 11.8，你可以使用如下命令安装 PyTorch：
 46 |    ```bash
 47 |    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
 48 |    ```
 49 | 3. **环境设置**
 50 | 
 51 |       在运行程序之前，需要进行以下环境设置：
 52 | 
 53 |       **环境变量配置**：将 `.env.example` 文件改名为 `.env`，并填入相应的环境变量。以下是需要配置的环境变量：
 54 | 
 55 |       - `OPENAI_API_KEY`: OpenAI API 的密钥，一般格式为 `sk-xxx`。
 56 |       - `MODEL_NAME`: 使用的模型名称，如 'gpt-4' 或 'gpt-3.5-turbo'。对于翻译任务，使用 'gpt-3.5-turbo'足够了。'gpt-4'太贵了，一天干了我一百多刀。
 57 |       - `OPENAI_API_BASE`: 如果你使用的是 OpenAI 官方的 API，可以将此项留空。如果你有自己部署的支持 OpenAI API 的大模型，可以填入相应的 OpenAI API 访问的 base_url。
 58 |       - `HF_TOKEN`: 如果使用 speaker diarization 功能，需要提供你的 Hugging Face token，并同意 [pyannote's speaker diarization agreement](https://huggingface.co/pyannote/speaker-diarization-3.1)。`HF_TOKEN` 可以在 [Hugging Face 设置](https://huggingface.co/settings/tokens) 中获取。
 59 |       - `APPID` 和 `ACCESS_TOKEN`: 如果使用火山引擎的 TTS，需要提供火山引擎的 APPID 和 ACCESS_TOKEN，此项可能需要付费。
 60 | 
 61 |       **TTS 设置**：如果不希望使用付费的火山引擎 TTS，可以在 `main.py` 中将 `from youdub.tts_bytedance import TTS_Clone` 改为 `from youdub.tts_paddle import TTS_Clone`，但这可能会影响生成效果。
 62 | 
 63 | 4. **运行程序**：
 64 |    使用以下命令启动主程序：
 65 | 
 66 |    ```
 67 |    python main.py --input_folders /path/to/input1 /path/to/input2 --output_folders /path/to/output1 /path/to/output2 --diarize
 68 |    ```
 69 | 
 70 |    这个命令将会处理在`/path/to/input1`和`/path/to/input2`路径下的视频文件，并将处理后的视频文件存储在`/path/to/output1`和`/path/to/output2`路径下。`--diarize`选项可以开启视频文件的声纹分割功能。
 71 | 
 72 |    请注意，`--input_folders`和`--output_folders`参数都接受多个文件夹路径，每个路径之间用空格隔开。同时，输入文件夹和输出文件夹的数量必须相同，否则程序会报错。
 73 | 
 74 |    例如，如果你有两个输入文件夹和两个输出文件夹，你可以这样运行程序：
 75 | 
 76 |    ```
 77 |    python main.py --input_folders /path/to/input1 /path/to/input2 --output_folders /path/to/output1 /path/to/output2
 78 |    ```
 79 | 
 80 |    如果你只有一个输入文件夹和一个输出文件夹，你可以这样运行程序：
 81 | 
 82 |    ```
 83 |    python main.py --input_folders /path/to/input --output_folders /path/to/output
 84 |    ```
 85 | 
 86 |    如果你想在处理视频文件时开启声纹分割功能，你可以添加`--diarize`选项：
 87 | 
 88 |    ```
 89 |    python main.py --input_folders /path/to/input --output_folders /path/to/output --diarize
 90 |    ```
 91 | 
 92 |    如果在从 `huggingface` 下载模型时报错，你可以在 `.env` 文件中添加：
 93 | 
 94 |    ```
 95 |    HF_ENDPOINT=https://hf-mirror.com
 96 |    ```
 97 | 
 98 | ## 使用步骤
 99 | - 准备需要翻译的视频文件并放置于输入文件夹。
100 | - 指定输出文件夹以接收处理后的视频。
101 | - 系统将自动进行语音识别、翻译、声音克隆和视频处理。
102 |   
103 | ## 技术细节
104 | 
105 | ### AI 语音识别
106 | 目前，我们的 AI 语音识别功能是基于 [Whisper](https://github.com/openai/whisper) 实现的。Whisper 是 OpenAI 开发的一款强大的语音识别系统，能够精确地将语音转换为文本。考虑到未来的效率和性能提升，我们计划评估并可能迁移到 [WhisperX](https://github.com/m-bain/whisperX)，这是一个更高效的语音识别系统，旨在进一步提高处理速度和准确度。
107 | 
108 | ### 大型语言模型翻译
109 | 我们的翻译功能支持使用 OpenAI API 提供的各种模型，包括官方的 GPT 模型。此外，我们也在探索使用类似 [api-for-open-llm](https://github.com/xusenlinzy/api-for-open-llm) 这样的项目，以便更灵活地整合和利用不同的大型语言模型进行翻译工作。
110 | 
111 | ### AI 声音克隆
112 | 声音克隆方面，我们目前使用的是 [Paddle Speech](https://github.com/PaddlePaddle/PaddleSpeech)。虽然 Paddle Speech 提供了高质量的语音合成能力，但目前尚无法在同一句话中同时生成中文和英文。在此之前，我们也考虑过使用 [Coqui AI TTS](https://github.com/coqui-ai/TTS)，它能够进行高效的声音克隆，但同样面临一些限制。
113 | 
114 | ### 视频处理
115 | 我们的视频处理功能强调音视频的同步处理，例如确保音频与视频画面的完美对齐，以及生成准确的字幕，从而为用户提供一个无缝的观看体验。
116 | 
117 | ## 贡献指南
118 | 欢迎对 `YouDub` 进行贡献。您可以通过 GitHub Issue 或 Pull Request 提交改进建议或报告问题。
119 | 
120 | ## 许可协议
121 | `YouDub` 遵循 Apache License 2.0。使用本工具时，请确保遵守相关的法律和规定，包括版权法、数据保护法和隐私法。未经原始内容创作者和/或版权所有者许可，请勿使用此工具。
122 | 
123 | ## 支持与联系方式
124 | 如需帮助或有任何疑问，请通过 [GitHub Issues](https://github.com/liuzhao1225/YouDub/issues) 联系我们。你也可以加入我们的微信群，扫描下方的二维码即可：
125 | ![WeChat Group](docs/90845666f8491d218695ebd3540a94e.jpg)
126 | 
127 | 


--------------------------------------------------------------------------------
/docs/1e5bad6485828197234ab8722f3f646.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/1e5bad6485828197234ab8722f3f646.jpg


--------------------------------------------------------------------------------
/docs/20240428.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/20240428.jpg


--------------------------------------------------------------------------------
/docs/20240506.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/20240506.jpg


--------------------------------------------------------------------------------
/docs/362ac8df0334eb69d6e529f08ef401d.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/362ac8df0334eb69d6e529f08ef401d.jpg


--------------------------------------------------------------------------------
/docs/90845666f8491d218695ebd3540a94e.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/90845666f8491d218695ebd3540a94e.jpg


--------------------------------------------------------------------------------
/docs/d50300d5db9d8cc71861174fc5d33b1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/d50300d5db9d8cc71861174fc5d33b1.jpg


--------------------------------------------------------------------------------
/docs/wechat.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/docs/wechat.jpg


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import logging
  3 | import json
  4 | import re
  5 | import time
  6 | import numpy as np
  7 | from tqdm import tqdm
  8 | # from youdub.tts_bytedance import TTS_Clone as TTS_Clone_bytedance, audio_process_folder as audio_process_folder_bytedance
  9 | from youdub.tts_xttsv2 import TTS_Clone, audio_process_folder
 10 | from youdub.tts_bytedance import TTS_Clone as TTS_Clone_bytedance
 11 | from youdub.tts_bytedance import audio_process_folder as audio_process_folder_bytedance
 12 | from youdub.asr_whisperX import VideoProcessor
 13 | from youdub.video_postprocess import replace_audio_ffmpeg
 14 | from youdub.translation_unsafe import Translator
 15 | from youdub.utils import split_text
 16 | from multiprocessing import Process
 17 | import re
 18 | import argparse
 19 | 
 20 | allowed_chars = '[^a-zA-Z0-9_ .]'
 21 | 
 22 | 
 23 | def translate_from_folder(folder, translator: Translator, original_fname):
 24 |     with open(os.path.join(folder, 'en.json'), mode='r', encoding='utf-8') as f:
 25 |         transcript = json.load(f)
 26 |     _transcript = [sentence['text'] for sentence in transcript if sentence['text']]
 27 |     result = ['']
 28 |     while len(result) != len(_transcript):
 29 |         result, summary = translator.translate(_transcript, original_fname)
 30 |     for i, sentence in enumerate(result):
 31 |         transcript[i]['text'] = sentence
 32 |         
 33 |     transcript = split_text(transcript) # 使用whisperX后，会自动分句，所以不再需要手动分句。同时避免了将`“你好。”`分为`“你好。`和`”`的情况
 34 |     with open(os.path.join(folder, 'zh.json'), 'w', encoding='utf-8') as f:
 35 |         json.dump(transcript, f, ensure_ascii=False, indent=4)
 36 |     with open(os.path.join(folder, 'summary.txt'), 'w', encoding='utf-8') as f:
 37 |         f.write(summary)
 38 |         
 39 | # def main(input_folder, output_folder, diarize=False):
 40 | def main():
 41 |     parser = argparse.ArgumentParser(description='Process some videos.')
 42 |     parser.add_argument('--input_folders', type=str, nargs='+', required=True,
 43 |                         help='The list of input folders containing the videos')
 44 |     parser.add_argument('--output_folders', type=str, nargs='+', required=True, help='The list of output folders where the processed videos will be stored')
 45 |     parser.add_argument('--vocal_only_folders', type=str, nargs='+', default=[],
 46 |                         help='The list of input folders containing the videos that only need vocal for the final result.')
 47 |     
 48 |     parser.add_argument('--diarize', action='store_true',
 49 |                         help='Enable diarization')
 50 | 
 51 |     
 52 |     args = parser.parse_args()
 53 | 
 54 |     if len(args.input_folders) != len(args.output_folders):
 55 |         raise ValueError(
 56 |             "The number of input folders must match the number of output folders.")
 57 | 
 58 |     print('='*50)
 59 |     print('Initializing...')
 60 |     if args.diarize:
 61 |         print('Diarization enabled.')
 62 |     print('='*50)
 63 |     diarize = args.diarize
 64 |     processor = VideoProcessor(diarize=diarize)
 65 |     translator = Translator()
 66 |     tts = TTS_Clone()
 67 |     tts_bytedance = TTS_Clone_bytedance()
 68 | 
 69 |     for input_folder, output_folder in zip(args.input_folders, args.output_folders):
 70 |         if input_folder in args.vocal_only_folders:
 71 |             vocal_only = True
 72 |             print(f'Vocal only mode enabled for {input_folder}.')
 73 |         else:
 74 |             vocal_only = False
 75 |             
 76 |         if not os.path.exists(os.path.join(input_folder, '0_finished')):
 77 |             os.makedirs(os.path.join(input_folder, '0_finished'))
 78 |         if not os.path.exists(output_folder):
 79 |             os.makedirs(output_folder)
 80 |         if not os.path.exists(os.path.join(output_folder, '0_to_upload')):
 81 |             os.makedirs(os.path.join(output_folder, '0_to_upload'))
 82 |         if not os.path.exists(os.path.join(output_folder, '0_finished')):
 83 |             os.makedirs(os.path.join(output_folder, '0_finished'))
 84 |         print('='*50)
 85 |         print(
 86 |             f'Video processing started for {input_folder} to {output_folder}.')
 87 |         print('='*50)
 88 | 
 89 |         logging.info('Processing folder...')
 90 |         files = os.listdir(input_folder)
 91 |         t = tqdm(files, desc="Processing files")
 92 |         video_lists = []
 93 |         for file in t:
 94 |             print('='*50)
 95 |             t.set_description(f"Processing {file}")
 96 |             print('='*50)
 97 |             if file.endswith('.mp4') or file.endswith('.mkv') or file.endswith('.avi') or file.endswith('.flv'):
 98 |                 original_fname = file[:-4]
 99 |                 new_filename = re.sub(r'[^a-zA-Z0-9_. ]', '', file)
100 |                 new_filename = re.sub(r'\s+', ' ', new_filename)
101 |                 new_filename = new_filename.strip()
102 |                 os.rename(os.path.join(input_folder, file),
103 |                           os.path.join(input_folder, new_filename))
104 |                 file = new_filename
105 |                 video_lists.append(file)
106 |                 input_path = os.path.join(input_folder, file)
107 |                 output_path = os.path.join(output_folder, file[:-4]).strip()
108 |                 if not os.path.exists(output_path):
109 |                     os.makedirs(output_path)
110 |                 speaker_to_voice_type = processor.process_video(
111 |                     input_path, output_path)
112 |             else:
113 |                 continue
114 |             if not os.path.exists(os.path.join(output_path, 'zh.json')):
115 |                 translate_from_folder(output_path, translator, original_fname)
116 |             if len(speaker_to_voice_type) == 1:
117 |                 print('Only one speaker detected. Using TTS.')
118 |                 audio_process_folder_bytedance(
119 |                     output_path, tts_bytedance, speaker_to_voice_type, vocal_only=vocal_only)
120 |             else:
121 |                 print('Multiple speakers detected. Using XTTSv2.')
122 |                 audio_process_folder(
123 |                     output_path, tts)
124 |                 
125 |             replace_audio_ffmpeg(os.path.join(input_folder, file), os.path.join(
126 |                 output_path, 'zh.wav'),  os.path.join(output_path, 'transcript.json'), os.path.join(output_path, file))
127 |             print('='*50)
128 | 
129 |         print(
130 |             f'Video processing finished for {input_folder} to {output_folder}. {len(video_lists)} videos processed.')
131 | 
132 |         print(video_lists)
133 | if __name__ == '__main__':
134 |     # diarize = False
135 |     
136 |     # series = 'TED_Ed'
137 |     # # series = 'z_Others'
138 |     # # series = r'test'
139 |     # # series = 'Kurzgsaget'
140 |     # input_folder = os.path.join(r'input', series)
141 |     # output_folder = os.path.join(r'output', series)
142 |     # main(input_folder, output_folder, diarize=diarize)
143 |     main()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | audiostretchy==1.3.5
 2 | librosa
 3 | loguru==0.7.2
 4 | modelscope==1.10.0
 5 | moviepy==1.0.3
 6 | numpy==1.23.1
 7 | openai==0.28.0
 8 | paddlespeech==1.4.1
 9 | pydub==0.25.1
10 | python-dotenv==1.0.0
11 | Requests==2.31.0
12 | scipy==1.11.4
13 | tqdm==4.64.1
14 | openai-whisper
15 | pyannote.audio
16 | TTS
17 | git+https://github.com/facebookresearch/demucs#egg=demucs # demucs
18 | git+https://github.com/m-bain/whisperx.git # whisperx
19 | 


--------------------------------------------------------------------------------
/run_main.bat:
--------------------------------------------------------------------------------
 1 | @echo off
 2 | setlocal enabledelayedexpansion
 3 | 
 4 | set "_folders="
 5 | 
 6 | FOR /D %%G IN ("F:\YouDub\input\*") DO (
 7 |     SET "_folders=!_folders! %%~nG"
 8 | )
 9 | 
10 | @REM set "_folders=SamuelAlbanie1"
11 | 
12 | set "input_folders="
13 | set "output_folders="
14 | 
15 | for %%i in (%_folders%) do (
16 |     set "input_folders=!input_folders! input/%%i"
17 |     set "output_folders=!output_folders! output/%%i"
18 | )
19 | 
20 | echo !input_folders!
21 | echo !output_folders!
22 | 
23 | python main.py --input_folders !input_folders! --output_folders !output_folders! --diarize
24 | 
25 | endlocal
26 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/setup.py


--------------------------------------------------------------------------------
/voice_type/BV001_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV001_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV002_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV002_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV005_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV005_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV007_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV007_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV056_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV056_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV102_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV102_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV113_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV113_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV119_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV119_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV700_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV700_streaming.npy


--------------------------------------------------------------------------------
/voice_type/BV701_streaming.npy:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/voice_type/BV701_streaming.npy


--------------------------------------------------------------------------------
/youdub/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/liuzhao1225/YouDub/2428ecaa8089ad433f1e292b0f6dfeded3596f86/youdub/__init__.py


--------------------------------------------------------------------------------
/youdub/asr_damo.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | import os
 4 | import logging
 5 | from pydub import AudioSegment
 6 | from moviepy.editor import VideoFileClip
 7 | from modelscope.pipelines import pipeline
 8 | from modelscope.utils.constant import Tasks
 9 | 
10 | os.environ["MODELSCOPE_CACHE"] = r"./models"
11 | 
12 | # 设置日志级别和格式
13 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
14 | 
15 | class VideoProcessor:
16 |     def __init__(self, model='damo/speech_paraformer_asr-en-16k-vocab4199-pytorch', model_revision="v1.0.1"):
17 |         self.asr_pipeline = pipeline(
18 |             task=Tasks.auto_speech_recognition,
19 |             model=model,
20 |             model_revision=model_revision)
21 |         
22 |         # self.punc_pipeline = pipeline(
23 |         #     task=Tasks.punctuation,
24 |         #     model=r'damo/punc_ct-transformer_cn-en-common-vocab471067-large',
25 |         #     model_revision="v1.0.0")
26 | 
27 |     def extract_audio_from_video(self, video_path, audio_path):
28 |         logging.info(f'Extracting audio from video {video_path}...')
29 |         video = VideoFileClip(video_path)
30 |         video.audio.write_audiofile(audio_path)
31 |         logging.info(f'Audio extracted and saved to {audio_path}.')
32 | 
33 |     def convert_audio_to_wav(self, audio_path, wav_path):
34 |         logging.info(f'Converting audio {audio_path} to wav...')
35 |         audio = AudioSegment.from_file(audio_path)
36 |         audio.export(wav_path, format="wav")
37 |         logging.info(f'Audio converted and saved to {wav_path}.')
38 | 
39 |     def transcribe_audio(self, wav_path):
40 |         logging.info(f'Transcribing audio {wav_path}...')
41 |         rec_result = self.asr_pipeline(audio_in=wav_path)
42 |         # rec_result = self.punc_pipeline(text_in=rec_result['text'])
43 |         # rec_result = '\n'.join([f'[{sentence["start"]} - {sentence["end"]}]: {sentence["text"]}' for sentence in rec_result['sentences']])
44 |         rec_result = rec_result['text']
45 |         logging.info('Transcription completed.')
46 |         return rec_result
47 | 
48 |     def save_transcription_to_txt(self, transcription, txt_path):
49 |         logging.info(f'Saving transcription to {txt_path}...')
50 |         with open(txt_path, 'w', encoding='utf-8') as f:
51 |             f.write(transcription)
52 |         logging.info('Transcription saved.')
53 | 
54 |     def process_video(self, video_path, audio_path, wav_path, txt_path):
55 |         logging.info('Processing video...')
56 |         self.extract_audio_from_video(video_path, audio_path)
57 |         # self.convert_audio_to_wav(audio_path, wav_path)
58 |         transcription = self.transcribe_audio(audio_path)
59 |         self.save_transcription_to_txt(transcription, txt_path)
60 |         logging.info('Video processing completed.')
61 | 
62 | # 使用示例
63 | if __name__ == '__main__':
64 |     processor = VideoProcessor()
65 |     processor.process_video('input/Kurzgesagt Channel Trailer.mp4', 'output/audio.mp3', 'output/audio.wav', 'output/transcription.txt')
66 | 


--------------------------------------------------------------------------------
/youdub/asr_whisper.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import string
  4 | import os
  5 | import logging
  6 | import whisper
  7 | import json
  8 | from moviepy.editor import VideoFileClip
  9 | import sys
 10 | sys.path.append(os.getcwd())
 11 | 
 12 | # from vocal_remover.inference import 
 13 | from .demucs_vr import Demucs
 14 | 
 15 | 
 16 | # 设置日志级别和格式
 17 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 18 |     
 19 | def merge_segments(transcription, ending=string.punctuation):
 20 |     merged_transcription = []
 21 |     buffer_segment = None
 22 | 
 23 |     for segment in transcription:
 24 |         if buffer_segment is None:
 25 |             buffer_segment = segment
 26 |         else:
 27 |             # Check if the last character of the 'text' field is a punctuation mark
 28 |             if buffer_segment['text'][-1] in ending:
 29 |                 # If it is, add the buffered segment to the merged transcription
 30 |                 merged_transcription.append(buffer_segment)
 31 |                 buffer_segment = segment
 32 |             else:
 33 |                 # If it's not, merge this segment with the buffered segment
 34 |                 buffer_segment['text'] += ' ' + segment['text']
 35 |                 buffer_segment['end'] = segment['end']
 36 | 
 37 |     # Don't forget to add the last buffered segment
 38 |     if buffer_segment is not None:
 39 |         merged_transcription.append(buffer_segment)
 40 | 
 41 |     return merged_transcription
 42 | 
 43 | class VideoProcessor:
 44 |     def __init__(self, model='large', download_root='models/ASR/whisper'):
 45 |         logging.info(f'Loading model {model} from {download_root}...')
 46 |         self.model = whisper.load_model(model, download_root=download_root)
 47 |         self.vocal_remover = Demucs(model='htdemucs_ft')
 48 |         logging.info('Model loaded.')
 49 | 
 50 |     def transcribe_audio(self, wav_path):
 51 |         logging.debug(f'Transcribing audio {wav_path}...')
 52 |         rec_result = self.model.transcribe(
 53 |             wav_path, verbose=True, condition_on_previous_text=False, max_initial_timestamp=None)
 54 |         return rec_result
 55 |     
 56 |     def extract_audio_from_video(self, video_path, audio_path):
 57 |         logging.info(f'Extracting audio from video {video_path}...')
 58 |         video = VideoFileClip(video_path)
 59 |         video.audio.write_audiofile(audio_path)
 60 |         output_dir = os.path.dirname(audio_path)
 61 |         if not os.path.exists(os.path.join(output_dir, 'en_Vocals.wav')) or not os.path.exists(os.path.join(output_dir, 'en_Instruments.wav')):
 62 |             self.vocal_remover.inference(audio_path, os.path.dirname(audio_path))
 63 |         logging.info(f'Audio extracted and saved to {audio_path}.')
 64 |     
 65 |     
 66 | 
 67 |     def save_transcription_to_json(self, transcription, json_path):
 68 |         logging.debug(f'Saving transcription to {json_path}...')
 69 |         transcription_with_timestemp = [{'start': round(segment['start'], 3), 'end': round(segment['end'], 3), 'text': segment['text'].strip()} for segment in transcription['segments'] if segment['text'] != '']
 70 |         
 71 |         transcription_with_timestemp = merge_segments(transcription_with_timestemp)
 72 |         with open(json_path.replace('en.json', 'subtitle.json'), 'w', encoding='utf-8') as f:
 73 |             # f.write(transcription_with_timestemp)
 74 |             json.dump(
 75 |                 transcription_with_timestemp, f, ensure_ascii=False, indent=4)
 76 |         
 77 |         transcription_with_timestemp = merge_segments(
 78 |             transcription_with_timestemp, ending='.?!。？！')
 79 |         with open(json_path, 'w', encoding='utf-8') as f:
 80 |             # f.write(transcription_with_timestemp)
 81 |             json.dump(
 82 |                 transcription_with_timestemp, f, ensure_ascii=False, indent=8)
 83 |             
 84 |         logging.debug('Transcription saved.')
 85 | 
 86 |     def process_video(self, video_path, output_folder):
 87 |         logging.debug('Processing video...')
 88 |         if not os.path.exists(output_folder):
 89 |             os.makedirs(output_folder)
 90 |         self.extract_audio_from_video(video_path, os.path.join(output_folder, 'en.wav'))
 91 |         if not os.path.exists(os.path.join(output_folder, 'en.json')):
 92 |             transcription = self.transcribe_audio(
 93 |                 os.path.join(output_folder, 'en_Vocals.wav'))
 94 |             self.save_transcription_to_json(
 95 |                 transcription, os.path.join(output_folder, 'en.json'))
 96 |         logging.debug('Video processing completed.')
 97 |         
 98 | 
 99 | 
100 | # 使用示例
101 | if __name__ == '__main__':
102 |     processor = VideoProcessor()
103 |     folder = 'What if you experienced every human life in history_'
104 |     result = processor.transcribe_audio(
105 |         f'output/{folder}/en_Vocals.wav')
106 |     with open(f'output/{folder}/en_without_condition.json', 'w', encoding='utf-8') as f:
107 |         json.dump(result, f, ensure_ascii=False, indent=4)
108 |     # processor.replace_audio(r'input\Kurzgesagt Channel Trailer.mp4', r'output\Kurzgesagt Channel Trailer\zh.wav',
109 |     # r'output\Kurzgesagt Channel Trailer\zh.json',
110 |     # r'output\Kurzgesagt Channel Trailer\Kurzgesagt Channel Trailer.mp4')
111 | 
112 |     # with open(r'output\Ancient Life as Old as the Universe\en.json', 'r', encoding='utf-8') as f:
113 |     #     transcript = json.load(f)
114 |         
115 |     # merged_transcript = merge_segments(transcript)
116 |     # print(merged_transcript[:5])
117 |     # with open(r'output\Ancient Life as Old as the Universe\zh.json', 'w', encoding='utf-8') as f:
118 |     #     json.dump(merged_transcript, f, ensure_ascii=False, indent=4)
119 | 


--------------------------------------------------------------------------------
/youdub/asr_whisperX.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | from pyannote.audio import Inference, Model
  3 | from moviepy.editor import VideoFileClip
  4 | from scipy.spatial.distance import cosine
  5 | 
  6 | import numpy as np
  7 | import soundfile as sf
  8 | import sys, os
  9 | sys.path.append(os.getcwd())
 10 | import whisperx
 11 | import whisper
 12 | from youdub.demucs_vr import Demucs
 13 | import string
 14 | import os
 15 | import logging
 16 | import json
 17 | 
 18 | from dotenv import load_dotenv
 19 | load_dotenv()
 20 | 
 21 | # 设置日志级别和格式
 22 | logging.basicConfig(level=logging.INFO,
 23 |                     format='%(asctime)s - %(levelname)s - %(message)s')
 24 | 
 25 | 
 26 | def merge_segments(transcription, ending=string.punctuation):
 27 |     merged_transcription = []
 28 |     buffer_segment = None
 29 | 
 30 |     for segment in transcription:
 31 |         if buffer_segment is None:
 32 |             buffer_segment = segment
 33 |         else:
 34 |             # Check if the last character of the 'text' field is a punctuation mark
 35 |             if buffer_segment['text'][-1] in ending:
 36 |                 # If it is, add the buffered segment to the merged transcription
 37 |                 merged_transcription.append(buffer_segment)
 38 |                 buffer_segment = segment
 39 |             else:
 40 |                 # If it's not, merge this segment with the buffered segment
 41 |                 buffer_segment['text'] += ' ' + segment['text']
 42 |                 buffer_segment['end'] = segment['end']
 43 | 
 44 |     # Don't forget to add the last buffered segment
 45 |     if buffer_segment is not None:
 46 |         merged_transcription.append(buffer_segment)
 47 | 
 48 |     return merged_transcription
 49 | 
 50 | 
 51 | class VideoProcessor:
 52 |     def __init__(self, model='large', download_root='models/ASR/whisper', device='cuda', batch_size=32, diarize=False):
 53 |         logging.info(f'Loading model {model} from {download_root}...')
 54 |         self.device = device
 55 |         self.batch_size = batch_size
 56 |         self.model = model
 57 |         # self.model = whisperx.load_model(model, download_root=download_root, device=device)
 58 |         if model == 'large-v3':
 59 |             self.whisper_model = whisper.load_model(model, download_root=download_root, device=device) # whisperx doesn't support large-v3 yet, so use whisper instead
 60 |         else:
 61 |             self.whisper_model = whisperx.load_model(model, download_root=download_root, device=device)
 62 |         self.diarize = diarize
 63 |         if self.diarize:
 64 |             self.diarize_model = whisperx.DiarizationPipeline(use_auth_token=os.getenv('HF_TOKEN'), device=device)
 65 |             self.embedding_model = Model.from_pretrained("pyannote/embedding", use_auth_token=os.getenv('HF_TOKEN'))
 66 |             self.embedding_inference = Inference(
 67 |                 self.embedding_model, window="whole")
 68 |             self.voice_type_embedding = dict()
 69 |             voice_type_folder = r'voice_type'
 70 |             for file in os.listdir(voice_type_folder):
 71 |                 if file.endswith('.npy'):
 72 |                     voice_type = file.replace('.npy', '')
 73 |                     embedding = np.load(os.path.join(voice_type_folder, file))
 74 |                     self.voice_type_embedding[voice_type] = embedding
 75 |             logging.info(f'Loaded {len(self.voice_type_embedding)} voice types.')
 76 | 
 77 |         self.language_code = 'en'
 78 |         self.align_model, self.meta_data = whisperx.load_align_model(language_code=self.language_code, device=device)
 79 |         self.vocal_remover = Demucs(model='htdemucs_ft')
 80 |         logging.info('Model loaded.')
 81 | 
 82 |     def transcribe_audio(self, wav_path):
 83 |         logging.debug(f'Transcribing audio {wav_path}...')
 84 |         if self.model == 'large-v3':
 85 |             rec_result = self.whisper_model.transcribe(
 86 |                 wav_path, verbose=True, condition_on_previous_text=True, max_initial_timestamp=None)
 87 |         else:
 88 |             rec_result = self.whisper_model.transcribe(
 89 |                 wav_path, batch_size=self.batch_size, print_progress=True, combined_progress=True)
 90 |             
 91 |         if rec_result['language'] == 'nn':
 92 |             return None
 93 |         if rec_result['language'] != self.language_code:
 94 |             self.language_code = rec_result['language']
 95 |             print(self.language_code)
 96 |             self.align_model, self.meta_data = whisperx.load_align_model(language_code=self.language_code, device=self.device)
 97 |             
 98 |         rec_result = whisperx.align(rec_result['segments'], self.align_model, self.meta_data, wav_path, self.device, return_char_alignments=False, print_progress=True)
 99 |         return rec_result
100 |     
101 |     def diarize_transcribed_audio(self, wav_path, transcribe_result):
102 |         logging.info(f'Diarizing audio {wav_path}...')
103 |         diarize_segments = self.diarize_model(wav_path)
104 |         result = whisperx.assign_word_speakers(
105 |             diarize_segments, transcribe_result)
106 |         return result
107 |     
108 |     def get_speaker_embedding(self, json_path):
109 |         with open(json_path, 'r', encoding='utf-8') as f:
110 |             result = json.load(f)
111 |         wav_folder = os.path.dirname(json_path)
112 |         wav_path = os.path.join(wav_folder, 'en_Vocals.wav')
113 |         audio_data, samplerate = sf.read(wav_path)
114 |         speaker_dict = dict()
115 |         length = len(audio_data)
116 |         delay = 0.1
117 |         for segment in result:
118 |             start = max(0, int((segment['start'] - delay) * samplerate))
119 |             end = min(int((segment['end']+delay) * samplerate), length)
120 |             speaker_segment_audio = audio_data[start:end]
121 |             speaker_dict[segment['speaker']] = np.concatenate((speaker_dict.get(
122 |                 segment['speaker'], np.zeros((0,2))),speaker_segment_audio))
123 |         speaker_folder = os.path.join(wav_folder, 'SPEAKER')
124 |         if not os.path.exists(speaker_folder):
125 |             os.makedirs(speaker_folder)
126 |         for speaker, audio in speaker_dict.items():
127 |             speaker_file_path = os.path.join(
128 |                 speaker_folder, f"{speaker}.wav")
129 |             sf.write(speaker_file_path, audio, samplerate)
130 |             
131 |         for file in os.listdir(speaker_folder):
132 |             if file.startswith('SPEAKER') and file.endswith('.wav'):
133 |                 wav_path = os.path.join(speaker_folder, file)
134 |                 embedding = self.embedding_inference(wav_path)
135 |                 np.save(wav_path.replace('.wav', '.npy'), embedding)
136 |                 
137 |     def find_closest_unique_voice_type(self, speaker_embedding):
138 |         speaker_to_voice_type = {}
139 |         available_speakers = set(speaker_embedding.keys())
140 |         available_voice_types = set(self.voice_type_embedding.keys())
141 | 
142 |         while available_speakers and available_voice_types:
143 |             min_distance = float('inf')
144 |             closest_speaker = None
145 |             closest_voice_type = None
146 | 
147 |             for speaker in available_speakers:
148 |                 sp_embedding = speaker_embedding[speaker]
149 |                 for voice_type in available_voice_types:
150 |                     vt_embedding = self.voice_type_embedding[voice_type]
151 |                     distance = cosine(sp_embedding, vt_embedding)
152 | 
153 |                     if distance < min_distance:
154 |                         min_distance = distance
155 |                         closest_speaker = speaker
156 |                         closest_voice_type = voice_type
157 | 
158 |             if closest_speaker and closest_voice_type:
159 |                 speaker_to_voice_type[closest_speaker] = closest_voice_type
160 |                 available_speakers.remove(closest_speaker)
161 |                 available_voice_types.remove(closest_voice_type)
162 | 
163 |         return speaker_to_voice_type
164 | 
165 |     def get_speaker_to_voice_type_dict(self, json_path):
166 |         self.get_speaker_embedding(json_path)
167 |         wav_folder = os.path.dirname(json_path)
168 |         speaker_folder = os.path.join(wav_folder, 'SPEAKER')
169 |         speaker_embedding = dict()
170 |         for file in os.listdir(speaker_folder):
171 |             if file.startswith('SPEAKER') and file.endswith('.npy'):
172 |                 speaker_name = file.replace('.npy', '')
173 |                 embedding = np.load(os.path.join(speaker_folder, file))
174 |                 speaker_embedding[speaker_name] = embedding
175 | 
176 |         return self.find_closest_unique_voice_type(speaker_embedding)
177 |     
178 |     def extract_audio_from_video(self, video_path, audio_path):
179 |         logging.info(f'Extracting audio from video {video_path}...')
180 |         video = VideoFileClip(video_path)
181 |         video.audio.write_audiofile(audio_path)
182 |         output_dir = os.path.dirname(audio_path)
183 |         if not os.path.exists(os.path.join(output_dir, 'en_Vocals.wav')) or not os.path.exists(os.path.join(output_dir, 'en_Instruments.wav')):
184 |             self.vocal_remover.inference(
185 |                 audio_path, os.path.dirname(audio_path))
186 |         logging.info(f'Audio extracted and saved to {audio_path}.')
187 | 
188 |     def save_transcription_to_json(self, transcription, json_path):
189 |         logging.debug(f'Saving transcription to {json_path}...')
190 |         if transcription is None:
191 |             transcription_with_timestemp = []
192 |         else:
193 |             transcription_with_timestemp = [{'start': round(segment['start'], 3), 'end': round(
194 |             segment['end'], 3), 'text': segment['text'].strip(), 'speaker': segment.get('speaker', 'SPEAKER_00')} for segment in transcription['segments'] if segment['text'] != '']
195 | 
196 |         transcription_with_timestemp = merge_segments(
197 |             transcription_with_timestemp)
198 |         with open(json_path.replace('en.json', 'subtitle.json'), 'w', encoding='utf-8') as f:
199 |             # f.write(transcription_with_timestemp)
200 |             json.dump(
201 |                 transcription_with_timestemp, f, ensure_ascii=False, indent=4)
202 | 
203 |         transcription_with_timestemp = merge_segments(
204 |             transcription_with_timestemp, ending='.?!。？！')
205 |         with open(json_path, 'w', encoding='utf-8') as f:
206 |             # f.write(transcription_with_timestemp)
207 |             json.dump(
208 |                 transcription_with_timestemp, f, ensure_ascii=False, indent=8)
209 | 
210 |         logging.debug('Transcription saved.')
211 | 
212 |     def process_video(self, video_path, output_folder):
213 |         logging.debug('Processing video...')
214 |         if not os.path.exists(output_folder):
215 |             os.makedirs(output_folder)
216 |         if not os.path.exists(os.path.join(output_folder, 'en_Vocals.wav')):
217 |             self.extract_audio_from_video(video_path, os.path.join(output_folder, 'en.wav'))
218 |         if not os.path.exists(os.path.join(output_folder, 'en.json')):
219 |             transcription = self.transcribe_audio(
220 |                 os.path.join(output_folder, 'en_Vocals.wav'))
221 |             if self.diarize:
222 |                 transcription = self.diarize_transcribed_audio(
223 |                     os.path.join(output_folder, 'en.wav'), transcription)
224 |             self.save_transcription_to_json(
225 |                 transcription, os.path.join(output_folder, 'en.json'))
226 |         if not os.path.exists(os.path.join(output_folder, 'speaker_to_voice_type.json')):
227 |             if self.diarize:
228 |                 speaker_to_voice_type = self.get_speaker_to_voice_type_dict(
229 |                     os.path.join(output_folder, 'en.json'))
230 |                 with open(os.path.join(output_folder, 'speaker_to_voice_type.json'), 'w', encoding='utf-8') as f:
231 |                     json.dump(speaker_to_voice_type, f, ensure_ascii=False, indent=4)
232 |             else:
233 |                 speaker_to_voice_type = {'SPEAKER_00': 'BV701_streaming'}
234 |         else:
235 |             with open(os.path.join(output_folder, 'speaker_to_voice_type.json'), 'r', encoding='utf-8') as f:
236 |                 speaker_to_voice_type = json.load(f)
237 |         logging.debug('Video processing completed.')
238 |         return speaker_to_voice_type
239 | 
240 | 
241 | # 使用示例
242 | if __name__ == '__main__':
243 |     processor = VideoProcessor(diarize=True)
244 |     folder = r'output\z_Others\Handson with Gemini_ Interacting with multimodal AI'
245 |     # result = processor.transcribe_audio(
246 |     #     f'{folder}/en_Vocals.wav')
247 |     # with open(f'{folder}/en.json', 'w', encoding='utf-8') as f:
248 |     #     json.dump(result, f, ensure_ascii=False, indent=4)
249 |     # result = processor.diarize_transcribed_audio(
250 |     #     f'{folder}/en.wav', result)
251 |     # with open(f'{folder}/en_diarize.json', 'w', encoding='utf-8') as f:
252 |     #     json.dump(result, f, ensure_ascii=False, indent=4)
253 |     processor.process_video(r'input\z_Others\Handson with Gemini_ Interacting with multimodal AI\Handson with Gemini_ Interacting with multimodal AI.mp4', folder)


--------------------------------------------------------------------------------
/youdub/cn_tx.py:
--------------------------------------------------------------------------------
   1 | #!/usr/bin/env python3
   2 | # coding=utf-8
   3 | 
   4 | # Authors:
   5 | #   2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
   6 | #   2019.9 - 2022 Jiayu DU
   7 | #
   8 | # requirements:
   9 | #   - python 3.X
  10 | # notes: python 2.X WILL fail or produce misleading results
  11 | 
  12 | import sys
  13 | import os
  14 | import argparse
  15 | import string
  16 | import re
  17 | import csv
  18 | 
  19 | # ================================================================================ #
  20 | #                                    basic constant
  21 | # ================================================================================ #
  22 | CHINESE_DIGIS = u'零一二三四五六七八九'
  23 | BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
  24 | BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
  25 | SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
  26 | SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
  27 | LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
  28 | LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
  29 | SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
  30 | SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
  31 | 
  32 | ZERO_ALT = u'〇'
  33 | ONE_ALT = u'幺'
  34 | TWO_ALTS = [u'两', u'兩']
  35 | 
  36 | POSITIVE = [u'正', u'正']
  37 | NEGATIVE = [u'负', u'負']
  38 | POINT = [u'点', u'點']
  39 | # PLUS = [u'加', u'加']
  40 | # SIL = [u'杠', u'槓']
  41 | 
  42 | FILLER_CHARS = ['呃', '啊']
  43 | 
  44 | ER_WHITELIST = '(儿女|儿子|儿孙|女儿|儿媳|妻儿|' \
  45 |     '胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|' \
  46 |     '儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|' \
  47 |     '佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)'
  48 | ER_WHITELIST_PATTERN = re.compile(ER_WHITELIST)
  49 | 
  50 | # 中文数字系统类型
  51 | NUMBERING_TYPES = ['low', 'mid', 'high']
  52 | 
  53 | CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
  54 |                  '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
  55 | CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
  56 | COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
  57 |                   '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
  58 |                   '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
  59 |                   '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
  60 |                   '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
  61 |                   '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
  62 | 
  63 | 
  64 | # Punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
  65 | CN_PUNCS_STOP = '！？｡。'
  66 | CN_PUNCS_NONSTOP = '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏·〈〉-'
  67 | CN_PUNCS = CN_PUNCS_STOP + CN_PUNCS_NONSTOP
  68 | 
  69 | PUNCS = CN_PUNCS + string.punctuation
  70 | PUNCS_TRANSFORM = str.maketrans(
  71 |     PUNCS, ' ' * len(PUNCS), '')  # replace puncs with space
  72 | 
  73 | 
  74 | # https://zh.wikipedia.org/wiki/全行和半行
  75 | QJ2BJ = {
  76 |     '　': ' ',
  77 |     '！': '!',
  78 |     '＂': '"',
  79 |     '＃': '#',
  80 |     '＄': '$',
  81 |     '％': '%',
  82 |     '＆': '&',
  83 |     '＇': "'",
  84 |     '（': '(',
  85 |     '）': ')',
  86 |     '＊': '*',
  87 |     '＋': '+',
  88 |     '，': ',',
  89 |     '－': '-',
  90 |     '．': '.',
  91 |     '／': '/',
  92 |     '０': '0',
  93 |     '１': '1',
  94 |     '２': '2',
  95 |     '３': '3',
  96 |     '４': '4',
  97 |     '５': '5',
  98 |     '６': '6',
  99 |     '７': '7',
 100 |     '８': '8',
 101 |     '９': '9',
 102 |     '：': ':',
 103 |     '；': ';',
 104 |     '＜': '<',
 105 |     '＝': '=',
 106 |     '＞': '>',
 107 |     '？': '?',
 108 |     '＠': '@',
 109 |     'Ａ': 'A',
 110 |     'Ｂ': 'B',
 111 |     'Ｃ': 'C',
 112 |     'Ｄ': 'D',
 113 |     'Ｅ': 'E',
 114 |     'Ｆ': 'F',
 115 |     'Ｇ': 'G',
 116 |     'Ｈ': 'H',
 117 |     'Ｉ': 'I',
 118 |     'Ｊ': 'J',
 119 |     'Ｋ': 'K',
 120 |     'Ｌ': 'L',
 121 |     'Ｍ': 'M',
 122 |     'Ｎ': 'N',
 123 |     'Ｏ': 'O',
 124 |     'Ｐ': 'P',
 125 |     'Ｑ': 'Q',
 126 |     'Ｒ': 'R',
 127 |     'Ｓ': 'S',
 128 |     'Ｔ': 'T',
 129 |     'Ｕ': 'U',
 130 |     'Ｖ': 'V',
 131 |     'Ｗ': 'W',
 132 |     'Ｘ': 'X',
 133 |     'Ｙ': 'Y',
 134 |     'Ｚ': 'Z',
 135 |     '［': '[',
 136 |     '＼': '\\',
 137 |     '］': ']',
 138 |     '＾': '^',
 139 |     '＿': '_',
 140 |     '｀': '`',
 141 |     'ａ': 'a',
 142 |     'ｂ': 'b',
 143 |     'ｃ': 'c',
 144 |     'ｄ': 'd',
 145 |     'ｅ': 'e',
 146 |     'ｆ': 'f',
 147 |     'ｇ': 'g',
 148 |     'ｈ': 'h',
 149 |     'ｉ': 'i',
 150 |     'ｊ': 'j',
 151 |     'ｋ': 'k',
 152 |     'ｌ': 'l',
 153 |     'ｍ': 'm',
 154 |     'ｎ': 'n',
 155 |     'ｏ': 'o',
 156 |     'ｐ': 'p',
 157 |     'ｑ': 'q',
 158 |     'ｒ': 'r',
 159 |     'ｓ': 's',
 160 |     'ｔ': 't',
 161 |     'ｕ': 'u',
 162 |     'ｖ': 'v',
 163 |     'ｗ': 'w',
 164 |     'ｘ': 'x',
 165 |     'ｙ': 'y',
 166 |     'ｚ': 'z',
 167 |     '｛': '{',
 168 |     '｜': '|',
 169 |     '｝': '}',
 170 |     '～': '~',
 171 | }
 172 | QJ2BJ_TRANSFORM = str.maketrans(
 173 |     ''.join(QJ2BJ.keys()), ''.join(QJ2BJ.values()), '')
 174 | 
 175 | 
 176 | # 2013 China National Standard: https://zh.wikipedia.org/wiki/通用规范汉字表, raw resources:
 177 | #   https://github.com/mozillazg/pinyin-data/blob/master/kMandarin_8105.txt with 8105 chinese chars in total
 178 | CN_CHARS_COMMON = (
 179 |     '一丁七万丈三上下不与丏丐丑专且丕世丘丙业丛东丝丞丢两严丧个丫中丰串临丸丹为主丽举'
 180 |     '乂乃久么义之乌乍乎乏乐乒乓乔乖乘乙乜九乞也习乡书乩买乱乳乸乾了予争事二亍于亏云互'
 181 |     '亓五井亘亚些亟亡亢交亥亦产亨亩享京亭亮亲亳亵亶亸亹人亿什仁仂仃仄仅仆仇仉今介仍从'
 182 |     '仑仓仔仕他仗付仙仝仞仟仡代令以仨仪仫们仰仲仳仵件价任份仿企伈伉伊伋伍伎伏伐休众优'
 183 |     '伙会伛伞伟传伢伣伤伥伦伧伪伫伭伯估伲伴伶伸伺似伽伾佁佃但位低住佐佑体何佖佗佘余佚'
 184 |     '佛作佝佞佟你佣佤佥佩佬佯佰佳佴佶佸佺佻佼佽佾使侁侂侃侄侈侉例侍侏侑侔侗侘供依侠侣'
 185 |     '侥侦侧侨侩侪侬侮侯侴侵侹便促俄俅俊俍俎俏俐俑俗俘俙俚俜保俞俟信俣俦俨俩俪俫俭修俯'
 186 |     '俱俳俵俶俸俺俾倌倍倏倒倓倔倕倘候倚倜倞借倡倥倦倧倨倩倪倬倭倮倴债倻值倾偁偃假偈偌'
 187 |     '偎偏偓偕做停偡健偬偭偰偲偶偷偻偾偿傀傃傅傈傉傍傒傕傣傥傧储傩催傲傺傻僇僎像僔僖僚'
 188 |     '僦僧僬僭僮僰僳僵僻儆儇儋儒儡儦儳儴儿兀允元兄充兆先光克免兑兔兕兖党兜兢入全八公六'
 189 |     '兮兰共关兴兵其具典兹养兼兽冀冁内冈冉册再冏冒冔冕冗写军农冠冢冤冥冬冮冯冰冱冲决况'
 190 |     '冶冷冻冼冽净凄准凇凉凋凌减凑凓凘凛凝几凡凤凫凭凯凰凳凶凸凹出击凼函凿刀刁刃分切刈'
 191 |     '刊刍刎刑划刖列刘则刚创初删判刨利别刬刭刮到刳制刷券刹刺刻刽刿剀剁剂剃剅削剋剌前剐'
 192 |     '剑剔剕剖剜剞剟剡剥剧剩剪副割剽剿劁劂劄劈劐劓力劝办功加务劢劣动助努劫劬劭励劲劳劼'
 193 |     '劾势勃勇勉勋勍勐勒勔勖勘勚募勠勤勰勺勾勿匀包匆匈匍匏匐匕化北匙匜匝匠匡匣匦匪匮匹'
 194 |     '区医匼匾匿十千卅升午卉半华协卑卒卓单卖南博卜卞卟占卡卢卣卤卦卧卫卬卮卯印危即却卵'
 195 |     '卷卸卺卿厂厄厅历厉压厌厍厕厖厘厚厝原厢厣厥厦厨厩厮去厾县叁参叆叇又叉及友双反发叔'
 196 |     '叕取受变叙叚叛叟叠口古句另叨叩只叫召叭叮可台叱史右叵叶号司叹叻叼叽吁吃各吆合吉吊'
 197 |     '同名后吏吐向吒吓吕吖吗君吝吞吟吠吡吣否吧吨吩含听吭吮启吱吲吴吵吸吹吻吼吽吾呀呃呆'
 198 |     '呇呈告呋呐呒呓呔呕呖呗员呙呛呜呢呣呤呦周呱呲味呵呶呷呸呻呼命咀咂咄咆咇咉咋和咍咎'
 199 |     '咏咐咒咔咕咖咙咚咛咝咡咣咤咥咦咧咨咩咪咫咬咯咱咳咴咸咺咻咽咿哀品哂哃哄哆哇哈哉哌'
 200 |     '响哎哏哐哑哒哓哔哕哗哙哚哝哞哟哢哥哦哧哨哩哪哭哮哱哲哳哺哼哽哿唁唆唇唉唏唐唑唔唛'
 201 |     '唝唠唢唣唤唧唪唬售唯唰唱唳唵唷唼唾唿啁啃啄商啉啊啐啕啖啜啡啤啥啦啧啪啫啬啭啮啰啴'
 202 |     '啵啶啷啸啻啼啾喀喁喂喃善喆喇喈喉喊喋喏喑喔喘喙喜喝喟喤喧喱喳喵喷喹喻喽喾嗄嗅嗉嗌'
 203 |     '嗍嗐嗑嗒嗓嗔嗖嗜嗝嗞嗟嗡嗣嗤嗥嗦嗨嗪嗫嗬嗯嗲嗳嗵嗷嗽嗾嘀嘁嘈嘉嘌嘎嘏嘘嘚嘛嘞嘟嘡'
 204 |     '嘣嘤嘧嘬嘭嘱嘲嘴嘶嘹嘻嘿噀噂噇噌噍噎噔噗噘噙噜噢噤器噩噪噫噬噱噶噻噼嚄嚅嚆嚎嚏嚓'
 205 |     '嚚嚣嚭嚯嚷嚼囊囔囚四回囟因囡团囤囫园困囱围囵囷囹固国图囿圃圄圆圈圉圊圌圐圙圜土圢'
 206 |     '圣在圩圪圫圬圭圮圯地圲圳圹场圻圾址坂均坉坊坋坌坍坎坏坐坑坒块坚坛坜坝坞坟坠坡坤坥'
 207 |     '坦坨坩坪坫坬坭坯坰坳坷坻坼坽垂垃垄垆垈型垌垍垎垏垒垓垕垙垚垛垞垟垠垡垢垣垤垦垧垩'
 208 |     '垫垭垮垯垱垲垴垵垸垺垾垿埂埃埆埇埋埌城埏埒埔埕埗埘埙埚埝域埠埤埪埫埭埯埴埵埸培基'
 209 |     '埼埽堂堃堆堇堉堋堌堍堎堐堑堕堙堞堠堡堤堧堨堪堰堲堵堼堽堾塄塅塆塌塍塑塔塘塝塞塥填'
 210 |     '塬塱塾墀墁境墅墈墉墐墒墓墕墘墙墚增墟墡墣墦墨墩墼壁壅壑壕壤士壬壮声壳壶壸壹处备复'
 211 |     '夏夐夔夕外夙多夜够夤夥大天太夫夬夭央夯失头夷夸夹夺夼奁奂奄奇奈奉奋奎奏契奓奔奕奖'
 212 |     '套奘奚奠奡奢奥奭女奴奶奸她好妁如妃妄妆妇妈妊妍妒妓妖妗妘妙妞妣妤妥妧妨妩妪妫妭妮'
 213 |     '妯妲妹妻妾姆姈姊始姐姑姒姓委姗姘姚姜姝姞姣姤姥姨姬姮姱姶姹姻姽姿娀威娃娄娅娆娇娈'
 214 |     '娉娌娑娓娘娜娟娠娣娥娩娱娲娴娵娶娼婀婆婉婊婌婍婕婘婚婞婠婢婤婧婪婫婳婴婵婶婷婺婻'
 215 |     '婼婿媂媄媆媒媓媖媚媛媞媪媭媱媲媳媵媸媾嫁嫂嫄嫉嫌嫒嫔嫕嫖嫘嫚嫜嫠嫡嫣嫦嫩嫪嫫嫭嫱'
 216 |     '嫽嬉嬖嬗嬛嬥嬬嬴嬷嬿孀孅子孑孓孔孕孖字存孙孚孛孜孝孟孢季孤孥学孩孪孬孰孱孳孵孺孽'
 217 |     '宁它宄宅宇守安宋完宏宓宕宗官宙定宛宜宝实宠审客宣室宥宦宧宪宫宬宰害宴宵家宸容宽宾'
 218 |     '宿寁寂寄寅密寇富寐寒寓寝寞察寡寤寥寨寮寰寸对寺寻导寿封射将尉尊小少尔尕尖尘尚尜尝'
 219 |     '尢尤尥尧尨尪尬就尴尸尹尺尻尼尽尾尿局屁层屃居屈屉届屋屎屏屐屑展屙属屠屡屣履屦屯山'
 220 |     '屹屺屼屾屿岁岂岈岊岌岍岐岑岔岖岗岘岙岚岛岜岞岠岢岣岨岩岫岬岭岱岳岵岷岸岽岿峁峂峃'
 221 |     '峄峋峒峗峘峙峛峡峣峤峥峦峧峨峪峭峰峱峻峿崀崁崂崃崄崆崇崌崎崒崔崖崚崛崞崟崡崤崦崧'
 222 |     '崩崭崮崴崶崽崾崿嵁嵅嵇嵊嵋嵌嵎嵖嵘嵚嵛嵝嵩嵫嵬嵯嵲嵴嶂嶅嶍嶒嶓嶙嶝嶟嶦嶲嶷巅巇巉'
 223 |     '巍川州巡巢工左巧巨巩巫差巯己已巳巴巷巽巾币市布帅帆师希帏帐帑帔帕帖帘帙帚帛帜帝帡'
 224 |     '带帧帨席帮帱帷常帻帼帽幂幄幅幌幔幕幖幛幞幡幢幪干平年并幸幺幻幼幽广庄庆庇床庋序庐'
 225 |     '庑库应底庖店庙庚府庞废庠庤庥度座庭庱庳庵庶康庸庹庼庾廆廉廊廋廑廒廓廖廙廛廨廪延廷'
 226 |     '建廿开弁异弃弄弆弇弈弊弋式弑弓引弗弘弛弟张弢弥弦弧弨弩弭弯弱弶弸弹强弼彀归当录彖'
 227 |     '彗彘彝彟形彤彦彧彩彪彬彭彰影彳彷役彻彼往征徂径待徇很徉徊律徐徒徕得徘徙徛徜御徨循'
 228 |     '徭微徵德徼徽心必忆忉忌忍忏忐忑忒忖志忘忙忝忞忠忡忤忧忪快忭忮忱忳念忸忺忻忽忾忿怀'
 229 |     '态怂怃怄怅怆怊怍怎怏怒怔怕怖怙怛怜思怠怡急怦性怨怩怪怫怯怵总怼怿恁恂恃恋恍恐恒恓'
 230 |     '恔恕恙恚恝恢恣恤恧恨恩恪恫恬恭息恰恳恶恸恹恺恻恼恽恿悃悄悆悈悉悌悍悒悔悖悚悛悝悟'
 231 |     '悠悢患悦您悫悬悭悯悰悱悲悴悸悻悼情惆惇惊惋惎惑惔惕惘惙惚惛惜惝惟惠惦惧惨惩惫惬惭'
 232 |     '惮惯惰想惴惶惹惺愀愁愃愆愈愉愍愎意愐愔愕愚感愠愣愤愦愧愫愭愿慆慈慊慌慎慑慕慝慢慥'
 233 |     '慧慨慬慭慰慵慷憋憎憔憕憙憧憨憩憬憭憷憺憾懂懈懊懋懑懒懔懦懵懿戆戈戊戋戌戍戎戏成我'
 234 |     '戒戕或戗战戚戛戟戡戢戣戤戥截戬戭戮戳戴户戽戾房所扁扂扃扅扆扇扈扉扊手才扎扑扒打扔'
 235 |     '托扛扞扣扦执扩扪扫扬扭扮扯扰扳扶批扺扼扽找承技抃抄抉把抑抒抓抔投抖抗折抚抛抟抠抡'
 236 |     '抢护报抨披抬抱抵抹抻押抽抿拂拃拄担拆拇拈拉拊拌拍拎拐拒拓拔拖拗拘拙招拜拟拢拣拤拥'
 237 |     '拦拧拨择括拭拮拯拱拳拴拶拷拼拽拾拿持挂指挈按挎挑挓挖挚挛挝挞挟挠挡挣挤挥挦挨挪挫'
 238 |     '振挲挹挺挽捂捃捅捆捉捋捌捍捎捏捐捕捞损捡换捣捧捩捭据捯捶捷捺捻捽掀掂掇授掉掊掌掎'
 239 |     '掏掐排掖掘掞掠探掣接控推掩措掬掭掮掰掳掴掷掸掺掼掾揄揆揉揍描提插揕揖揠握揣揩揪揭'
 240 |     '揳援揶揸揽揿搀搁搂搅搋搌搏搐搒搓搔搛搜搞搠搡搦搪搬搭搴携搽摁摄摅摆摇摈摊摏摒摔摘'
 241 |     '摛摞摧摩摭摴摸摹摽撂撄撅撇撑撒撕撖撙撞撤撩撬播撮撰撵撷撸撺撼擀擂擅操擎擐擒擘擞擢'
 242 |     '擤擦擿攀攉攒攘攥攫攮支收攸改攻攽放政故效敉敌敏救敔敕敖教敛敝敞敢散敦敩敫敬数敲整'
 243 |     '敷文斋斌斐斑斓斗料斛斜斝斟斠斡斤斥斧斩斫断斯新斶方於施旁旃旄旅旆旋旌旎族旐旒旖旗'
 244 |     '旞无既日旦旧旨早旬旭旮旯旰旱旴旵时旷旸旺旻旿昀昂昃昄昆昇昈昉昊昌明昏昒易昔昕昙昝'
 245 |     '星映昡昣昤春昧昨昪昫昭是昱昳昴昵昶昺昼昽显晁晃晅晊晋晌晏晐晒晓晔晕晖晗晙晚晞晟晡'
 246 |     '晢晤晦晨晪晫普景晰晱晴晶晷智晾暂暄暅暇暌暑暕暖暗暝暧暨暮暲暴暵暶暹暾暿曈曌曙曛曜'
 247 |     '曝曦曩曰曲曳更曷曹曼曾替最月有朋服朏朐朓朔朕朗望朝期朦木未末本札术朱朳朴朵朸机朽'
 248 |     '杀杂权杄杆杈杉杌李杏材村杓杕杖杙杜杞束杠条来杧杨杩杪杭杯杰杲杳杵杷杻杼松板极构枅'
 249 |     '枇枉枋枍析枕林枘枚果枝枞枢枣枥枧枨枪枫枭枯枰枲枳枵架枷枸枹柁柃柄柈柊柏某柑柒染柔'
 250 |     '柖柘柙柚柜柝柞柠柢查柩柬柯柰柱柳柴柷柽柿栀栅标栈栉栊栋栌栎栏栐树栒栓栖栗栝栟校栩'
 251 |     '株栲栳栴样核根栻格栽栾桀桁桂桃桄桅框案桉桊桌桎桐桑桓桔桕桠桡桢档桤桥桦桧桨桩桫桯'
 252 |     '桲桴桶桷桹梁梃梅梆梌梏梓梗梠梢梣梦梧梨梭梯械梳梴梵梼梽梾梿检棁棂棉棋棍棐棒棓棕棘'
 253 |     '棚棠棣棤棨棪棫棬森棰棱棵棹棺棻棼棽椀椁椅椆椋植椎椐椑椒椓椟椠椤椪椭椰椴椸椹椽椿楂'
 254 |     '楒楔楗楙楚楝楞楠楣楦楩楪楫楮楯楷楸楹楼概榃榄榅榆榇榈榉榍榑榔榕榖榛榜榧榨榫榭榰榱'
 255 |     '榴榷榻槁槃槊槌槎槐槔槚槛槜槟槠槭槱槲槽槿樊樗樘樟模樨横樯樱樵樽樾橄橇橐橑橘橙橛橞'
 256 |     '橡橥橦橱橹橼檀檄檎檐檑檗檞檠檩檫檬櫆欂欠次欢欣欤欧欲欸欹欺欻款歃歅歆歇歉歌歙止正'
 257 |     '此步武歧歪歹死歼殁殂殃殄殆殇殉殊残殍殒殓殖殚殛殡殣殪殳殴段殷殿毁毂毅毋毌母每毐毒'
 258 |     '毓比毕毖毗毙毛毡毪毫毯毳毵毹毽氅氆氇氍氏氐民氓气氕氖氘氙氚氛氟氡氢氤氦氧氨氩氪氮'
 259 |     '氯氰氲水永氾氿汀汁求汆汇汈汉汊汋汐汔汕汗汛汜汝汞江池污汤汧汨汩汪汫汭汰汲汴汶汹汽'
 260 |     '汾沁沂沃沄沅沆沇沈沉沌沏沐沓沔沘沙沚沛沟没沣沤沥沦沧沨沩沪沫沭沮沱河沸油沺治沼沽'
 261 |     '沾沿泂泃泄泅泇泉泊泌泐泓泔法泖泗泙泚泛泜泞泠泡波泣泥注泪泫泮泯泰泱泳泵泷泸泺泻泼'
 262 |     '泽泾洁洄洇洈洋洌洎洑洒洓洗洘洙洚洛洞洢洣津洧洨洪洫洭洮洱洲洳洴洵洸洹洺活洼洽派洿'
 263 |     '流浃浅浆浇浈浉浊测浍济浏浐浑浒浓浔浕浙浚浛浜浞浟浠浡浣浥浦浩浪浬浭浮浯浰浲浴海浸'
 264 |     '浼涂涄涅消涉涌涍涎涐涑涓涔涕涘涛涝涞涟涠涡涢涣涤润涧涨涩涪涫涮涯液涴涵涸涿淀淄淅'
 265 |     '淆淇淋淌淏淑淖淘淙淜淝淞淟淠淡淤淦淫淬淮淯深淳淴混淹添淼清渊渌渍渎渐渑渔渗渚渝渟'
 266 |     '渠渡渣渤渥温渫渭港渰渲渴游渺渼湃湄湉湍湎湑湓湔湖湘湛湜湝湟湣湫湮湲湴湾湿溁溃溅溆'
 267 |     '溇溉溍溏源溘溚溜溞溟溠溢溥溦溧溪溯溱溲溴溵溶溷溹溺溻溽滁滂滃滆滇滉滋滍滏滑滓滔滕'
 268 |     '滗滘滚滞滟滠满滢滤滥滦滧滨滩滪滫滴滹漂漆漈漉漋漏漓演漕漖漠漤漦漩漪漫漭漯漱漳漴漶'
 269 |     '漷漹漻漼漾潆潇潋潍潏潖潘潜潞潟潢潦潩潭潮潲潴潵潸潺潼潽潾澂澄澈澉澌澍澎澛澜澡澥澧'
 270 |     '澪澭澳澴澶澹澼澽激濂濉濋濑濒濞濠濡濩濮濯瀌瀍瀑瀔瀚瀛瀣瀱瀵瀹瀼灈灌灏灞火灭灯灰灵'
 271 |     '灶灸灼灾灿炀炅炆炉炊炌炎炒炔炕炖炘炙炜炝炟炣炫炬炭炮炯炱炳炷炸点炻炼炽烀烁烂烃烈'
 272 |     '烊烔烘烙烛烜烝烟烠烤烦烧烨烩烫烬热烯烶烷烹烺烻烽焆焉焊焌焐焓焕焖焗焘焙焚焜焞焦焯'
 273 |     '焰焱然煁煃煅煊煋煌煎煓煜煞煟煤煦照煨煮煲煳煴煸煺煽熄熇熊熏熔熘熙熛熜熟熠熥熨熬熵'
 274 |     '熹熻燃燊燋燎燏燔燕燚燠燥燧燮燹爆爇爔爚爝爟爨爪爬爰爱爵父爷爸爹爻爽爿牁牂片版牌牍'
 275 |     '牒牖牙牚牛牝牟牡牢牤牥牦牧物牮牯牲牵特牺牻牾牿犀犁犄犇犊犋犍犏犒犟犨犬犯犰犴状犷'
 276 |     '犸犹狁狂狃狄狈狉狍狎狐狒狗狙狝狞狠狡狨狩独狭狮狯狰狱狲狳狴狷狸狺狻狼猁猃猄猇猊猎'
 277 |     '猕猖猗猛猜猝猞猡猢猥猩猪猫猬献猯猰猱猴猷猹猺猾猿獍獐獒獗獠獬獭獯獴獾玃玄率玉王玎'
 278 |     '玑玒玓玕玖玘玙玚玛玞玟玠玡玢玤玥玦玩玫玭玮环现玱玲玳玶玷玹玺玻玼玿珀珂珅珇珈珉珊'
 279 |     '珋珌珍珏珐珑珒珕珖珙珛珝珞珠珢珣珥珦珧珩珪珫班珰珲珵珷珸珹珺珽琀球琄琅理琇琈琉琊'
 280 |     '琎琏琐琔琚琛琟琡琢琤琥琦琨琪琫琬琭琮琯琰琲琳琴琵琶琼瑀瑁瑂瑃瑄瑅瑆瑑瑓瑔瑕瑖瑗瑙'
 281 |     '瑚瑛瑜瑝瑞瑟瑢瑧瑨瑬瑭瑰瑱瑳瑶瑷瑾璀璁璃璆璇璈璋璎璐璒璘璜璞璟璠璥璧璨璩璪璬璮璱'
 282 |     '璲璺瓀瓒瓖瓘瓜瓞瓠瓢瓣瓤瓦瓮瓯瓴瓶瓷瓻瓿甄甍甏甑甓甗甘甚甜生甡甥甦用甩甪甫甬甭甯'
 283 |     '田由甲申电男甸町画甾畀畅畈畋界畎畏畔畖留畚畛畜畤略畦番畬畯畲畴畸畹畿疁疃疆疍疏疐'
 284 |     '疑疔疖疗疙疚疝疟疠疡疢疣疤疥疫疬疭疮疯疰疱疲疳疴疵疸疹疼疽疾痂痃痄病症痈痉痊痍痒'
 285 |     '痓痔痕痘痛痞痢痣痤痦痧痨痪痫痰痱痴痹痼痿瘀瘁瘃瘅瘆瘊瘌瘐瘕瘗瘘瘙瘛瘟瘠瘢瘤瘥瘦瘩'
 286 |     '瘪瘫瘭瘰瘳瘴瘵瘸瘼瘾瘿癀癃癌癍癔癖癗癜癞癣癫癯癸登白百癿皂的皆皇皈皋皎皑皓皕皖皙'
 287 |     '皛皞皤皦皭皮皱皲皴皿盂盅盆盈盉益盍盎盏盐监盒盔盖盗盘盛盟盥盦目盯盱盲直盷相盹盼盾'
 288 |     '省眄眇眈眉眊看眍眙眚真眠眢眦眨眩眬眭眯眵眶眷眸眺眼着睁睃睄睇睎睐睑睚睛睡睢督睥睦'
 289 |     '睨睫睬睹睽睾睿瞀瞄瞅瞋瞌瞍瞎瞑瞒瞟瞠瞢瞥瞧瞩瞪瞫瞬瞭瞰瞳瞵瞻瞽瞿矍矗矛矜矞矢矣知'
 290 |     '矧矩矫矬短矮矰石矶矸矻矼矾矿砀码砂砄砆砉砌砍砑砒研砖砗砘砚砜砝砟砠砣砥砧砫砬砭砮'
 291 |     '砰破砵砷砸砹砺砻砼砾础硁硅硇硊硌硍硎硐硒硔硕硖硗硙硚硝硪硫硬硭确硼硿碃碇碈碉碌碍'
 292 |     '碎碏碑碓碗碘碚碛碜碟碡碣碥碧碨碰碱碲碳碴碶碹碾磁磅磉磊磋磏磐磔磕磙磜磡磨磬磲磴磷'
 293 |     '磹磻礁礅礌礓礞礴礵示礼社祀祁祃祆祇祈祉祊祋祎祏祐祓祕祖祗祚祛祜祝神祟祠祢祥祧票祭'
 294 |     '祯祲祷祸祺祼祾禀禁禄禅禊禋福禒禔禘禚禛禤禧禳禹禺离禽禾秀私秃秆秉秋种科秒秕秘租秣'
 295 |     '秤秦秧秩秫秬秭积称秸移秽秾稀稂稃稆程稌稍税稑稔稗稙稚稞稠稣稳稷稹稻稼稽稿穄穆穑穗'
 296 |     '穙穜穟穰穴究穷穸穹空穿窀突窃窄窅窈窊窍窎窑窒窕窖窗窘窜窝窟窠窣窥窦窨窬窭窳窸窿立'
 297 |     '竑竖竘站竞竟章竣童竦竫竭端竹竺竽竿笃笄笆笈笊笋笏笑笔笕笙笛笞笠笤笥符笨笪笫第笮笯'
 298 |     '笱笳笸笺笼笾筀筅筇等筋筌筏筐筑筒答策筘筚筛筜筝筠筢筤筥筦筮筱筲筵筶筷筹筻筼签简箅'
 299 |     '箍箐箓箔箕箖算箜管箢箦箧箨箩箪箫箬箭箱箴箸篁篆篇篌篑篓篙篚篝篡篥篦篪篮篯篱篷篼篾'
 300 |     '簃簇簉簋簌簏簕簖簝簟簠簧簪簰簸簿籀籁籍籥米籴类籼籽粉粑粒粕粗粘粜粝粞粟粢粤粥粪粮'
 301 |     '粱粲粳粹粼粽精粿糁糅糇糈糊糌糍糒糕糖糗糙糜糟糠糨糯糵系紊素索紧紫累絜絮絷綦綮縠縢'
 302 |     '縻繁繄繇纂纛纠纡红纣纤纥约级纨纩纪纫纬纭纮纯纰纱纲纳纴纵纶纷纸纹纺纻纼纽纾线绀绁'
 303 |     '绂练组绅细织终绉绊绋绌绍绎经绐绑绒结绔绕绖绗绘给绚绛络绝绞统绠绡绢绣绤绥绦继绨绩'
 304 |     '绪绫续绮绯绰绱绲绳维绵绶绷绸绹绺绻综绽绾绿缀缁缂缃缄缅缆缇缈缉缊缌缎缐缑缒缓缔缕'
 305 |     '编缗缘缙缚缛缜缝缞缟缠缡缢缣缤缥缦缧缨缩缪缫缬缭缮缯缰缱缲缳缴缵缶缸缺罂罄罅罍罐'
 306 |     '网罔罕罗罘罚罟罡罢罨罩罪置罱署罴罶罹罽罾羁羊羌美羑羓羔羕羖羚羝羞羟羡群羧羯羰羱羲'
 307 |     '羸羹羼羽羿翀翁翂翃翅翈翊翌翎翔翕翘翙翚翛翟翠翡翥翦翩翮翯翰翱翳翷翻翼翾耀老考耄者'
 308 |     '耆耇耋而耍耏耐耑耒耔耕耖耗耘耙耜耠耢耤耥耦耧耨耩耪耰耱耳耵耶耷耸耻耽耿聂聃聆聊聋'
 309 |     '职聍聒联聘聚聩聪聱聿肃肄肆肇肉肋肌肓肖肘肚肛肝肟肠股肢肤肥肩肪肫肭肮肯肱育肴肷肸'
 310 |     '肺肼肽肾肿胀胁胂胃胄胆胈背胍胎胖胗胙胚胛胜胝胞胠胡胣胤胥胧胨胩胪胫胬胭胯胰胱胲胳'
 311 |     '胴胶胸胺胼能脂脆脉脊脍脎脏脐脑脒脓脔脖脘脚脞脟脩脬脯脱脲脶脸脾脿腆腈腊腋腌腐腑腒'
 312 |     '腓腔腕腘腙腚腠腥腧腨腩腭腮腯腰腱腴腹腺腻腼腽腾腿膀膂膈膊膏膑膘膙膛膜膝膦膨膳膺膻'
 313 |     '臀臂臃臆臊臌臑臜臣臧自臬臭至致臻臼臾舀舁舂舄舅舆舌舍舐舒舔舛舜舞舟舠舢舣舥航舫般'
 314 |     '舭舯舰舱舲舳舴舵舶舷舸船舻舾艄艅艇艉艋艎艏艘艚艟艨艮良艰色艳艴艺艽艾艿节芃芄芈芊'
 315 |     '芋芍芎芏芑芒芗芘芙芜芝芟芠芡芣芤芥芦芨芩芪芫芬芭芮芯芰花芳芴芷芸芹芼芽芾苁苄苇苈'
 316 |     '苉苊苋苌苍苎苏苑苒苓苔苕苗苘苛苜苞苟苠苡苣苤若苦苧苫苯英苴苷苹苻苾茀茁茂范茄茅茆'
 317 |     '茈茉茋茌茎茏茑茓茔茕茗茚茛茜茝茧茨茫茬茭茯茱茳茴茵茶茸茹茺茼茽荀荁荃荄荆荇草荏荐'
 318 |     '荑荒荓荔荖荙荚荛荜荞荟荠荡荣荤荥荦荧荨荩荪荫荬荭荮药荷荸荻荼荽莅莆莉莎莒莓莘莙莛'
 319 |     '莜莝莞莠莨莩莪莫莰莱莲莳莴莶获莸莹莺莼莽莿菀菁菂菅菇菉菊菌菍菏菔菖菘菜菝菟菠菡菥'
 320 |     '菩菪菰菱菲菹菼菽萁萃萄萆萋萌萍萎萏萑萘萚萜萝萣萤营萦萧萨萩萱萳萸萹萼落葆葎葑葖著'
 321 |     '葙葚葛葜葡董葩葫葬葭葰葱葳葴葵葶葸葺蒂蒄蒇蒈蒉蒋蒌蒎蒐蒗蒙蒜蒟蒡蒨蒯蒱蒲蒴蒸蒹蒺'
 322 |     '蒻蒽蒿蓁蓂蓄蓇蓉蓊蓍蓏蓐蓑蓓蓖蓝蓟蓠蓢蓣蓥蓦蓬蓰蓼蓿蔀蔃蔈蔊蔌蔑蔓蔗蔚蔟蔡蔫蔬蔷'
 323 |     '蔸蔹蔺蔻蔼蔽蕃蕈蕉蕊蕖蕗蕙蕞蕤蕨蕰蕲蕴蕹蕺蕻蕾薁薄薅薇薏薛薜薢薤薨薪薮薯薰薳薷薸'
 324 |     '薹薿藁藉藏藐藓藕藜藟藠藤藦藨藩藻藿蘅蘑蘖蘘蘧蘩蘸蘼虎虏虐虑虒虓虔虚虞虢虤虫虬虮虱'
 325 |     '虷虸虹虺虻虼虽虾虿蚀蚁蚂蚄蚆蚊蚋蚌蚍蚓蚕蚜蚝蚣蚤蚧蚨蚩蚪蚬蚯蚰蚱蚲蚴蚶蚺蛀蛃蛄蛆'
 326 |     '蛇蛉蛊蛋蛎蛏蛐蛑蛔蛘蛙蛛蛞蛟蛤蛩蛭蛮蛰蛱蛲蛳蛴蛸蛹蛾蜀蜂蜃蜇蜈蜉蜊蜍蜎蜐蜒蜓蜕蜗'
 327 |     '蜘蜚蜜蜞蜡蜢蜣蜥蜩蜮蜱蜴蜷蜻蜾蜿蝇蝈蝉蝌蝎蝓蝗蝘蝙蝠蝣蝤蝥蝮蝰蝲蝴蝶蝻蝼蝽蝾螂螃'
 328 |     '螅螈螋融螗螟螠螣螨螫螬螭螯螱螳螵螺螽蟀蟆蟊蟋蟏蟑蟒蟛蟠蟥蟪蟫蟮蟹蟾蠃蠊蠋蠓蠕蠖蠡'
 329 |     '蠢蠲蠹蠼血衃衄衅行衍衎衒衔街衙衠衡衢衣补表衩衫衬衮衰衲衷衽衾衿袁袂袄袅袆袈袋袍袒'
 330 |     '袖袗袜袢袤袪被袭袯袱袷袼裁裂装裆裈裉裎裒裔裕裘裙裛裟裢裣裤裥裨裰裱裳裴裸裹裼裾褂'
 331 |     '褊褐褒褓褕褙褚褛褟褡褥褪褫褯褰褴褶襁襄襕襚襜襞襟襦襫襻西要覃覆见观觃规觅视觇览觉'
 332 |     '觊觋觌觎觏觐觑角觖觚觜觞觟解觥触觫觭觯觱觳觿言訄訇訚訾詈詟詹誉誊誓謇警譬计订讣认'
 333 |     '讥讦讧讨让讪讫训议讯记讱讲讳讴讵讶讷许讹论讻讼讽设访诀证诂诃评诅识诇诈诉诊诋诌词'
 334 |     '诎诏诐译诒诓诔试诖诗诘诙诚诛诜话诞诟诠诡询诣诤该详诧诨诩诫诬语诮误诰诱诲诳说诵请'
 335 |     '诸诹诺读诼诽课诿谀谁谂调谄谅谆谇谈谊谋谌谍谎谏谐谑谒谓谔谕谖谗谙谚谛谜谝谞谟谠谡'
 336 |     '谢谣谤谥谦谧谨谩谪谫谬谭谮谯谰谱谲谳谴谵谶谷谼谿豁豆豇豉豌豕豚象豢豨豪豫豮豳豸豹'
 337 |     '豺貂貅貆貉貊貌貔貘贝贞负贡财责贤败账货质贩贪贫贬购贮贯贰贱贲贳贴贵贶贷贸费贺贻贼'
 338 |     '贽贾贿赀赁赂赃资赅赆赇赈赉赊赋赌赍赎赏赐赑赒赓赔赕赖赗赘赙赚赛赜赝赞赟赠赡赢赣赤'
 339 |     '赦赧赪赫赭走赳赴赵赶起趁趄超越趋趑趔趟趣趯趱足趴趵趸趺趼趾趿跂跃跄跆跋跌跎跏跐跑'
 340 |     '跖跗跚跛距跞跟跣跤跨跪跬路跱跳践跶跷跸跹跺跻跽踅踉踊踌踏踒踔踝踞踟踢踣踦踩踪踬踮'
 341 |     '踯踱踵踶踹踺踽蹀蹁蹂蹄蹅蹇蹈蹉蹊蹋蹐蹑蹒蹙蹚蹜蹢蹦蹩蹬蹭蹯蹰蹲蹴蹶蹼蹽蹾蹿躁躅躇'
 342 |     '躏躐躔躜躞身躬躯躲躺车轧轨轩轪轫转轭轮软轰轱轲轳轴轵轶轷轸轹轺轻轼载轾轿辀辁辂较'
 343 |     '辄辅辆辇辈辉辊辋辌辍辎辏辐辑辒输辔辕辖辗辘辙辚辛辜辞辟辣辨辩辫辰辱边辽达辿迁迂迄'
 344 |     '迅过迈迎运近迓返迕还这进远违连迟迢迤迥迦迨迩迪迫迭迮述迳迷迸迹迺追退送适逃逄逅逆'
 345 |     '选逊逋逍透逐逑递途逖逗通逛逝逞速造逡逢逦逭逮逯逴逵逶逸逻逼逾遁遂遄遆遇遍遏遐遑遒'
 346 |     '道遗遘遛遢遣遥遨遭遮遴遵遹遽避邀邂邃邈邋邑邓邕邗邘邙邛邝邠邡邢那邦邨邪邬邮邯邰邱'
 347 |     '邲邳邴邵邶邸邹邺邻邽邾邿郁郃郄郅郇郈郊郎郏郐郑郓郗郚郛郜郝郡郢郤郦郧部郪郫郭郯郴'
 348 |     '郸都郾郿鄀鄂鄃鄄鄅鄌鄑鄗鄘鄙鄚鄜鄞鄠鄢鄣鄫鄯鄱鄹酂酃酅酆酉酊酋酌配酎酏酐酒酗酚酝'
 349 |     '酞酡酢酣酤酥酦酩酪酬酮酯酰酱酲酴酵酶酷酸酹酺酽酾酿醅醇醉醋醌醍醐醑醒醚醛醢醨醪醭'
 350 |     '醮醯醴醵醺醾采釉释里重野量釐金釜鉴銎銮鋆鋈錾鍪鎏鏊鏖鐾鑫钆钇针钉钊钋钌钍钎钏钐钒'
 351 |     '钓钔钕钖钗钘钙钚钛钜钝钞钟钠钡钢钣钤钥钦钧钨钩钪钫钬钭钮钯钰钱钲钳钴钵钷钹钺钻钼'
 352 |     '钽钾钿铀铁铂铃铄铅铆铈铉铊铋铌铍铎铏铐铑铒铕铖铗铘铙铚铛铜铝铞铟铠铡铢铣铤铥铧铨'
 353 |     '铩铪铫铬铭铮铯铰铱铲铳铴铵银铷铸铹铺铻铼铽链铿销锁锂锃锄锅锆锇锈锉锊锋锌锍锎锏锐'
 354 |     '锑锒锓锔锕锖锗锘错锚锛锜锝锞锟锡锢锣锤锥锦锧锨锩锪锫锬锭键锯锰锱锲锳锴锵锶锷锸锹'
 355 |     '锺锻锼锽锾锿镀镁镂镃镄镅镆镇镈镉镊镋镌镍镎镏镐镑镒镓镔镕镖镗镘镚镛镜镝镞镠镡镢镣'
 356 |     '镤镥镦镧镨镩镪镫镬镭镮镯镰镱镲镳镴镵镶长门闩闪闫闭问闯闰闱闲闳间闵闶闷闸闹闺闻闼'
 357 |     '闽闾闿阀阁阂阃阄阅阆阇阈阉阊阋阌阍阎阏阐阑阒阔阕阖阗阘阙阚阜队阡阪阮阱防阳阴阵阶'
 358 |     '阻阼阽阿陀陂附际陆陇陈陉陋陌降陎限陑陔陕陛陞陟陡院除陧陨险陪陬陲陴陵陶陷隃隅隆隈'
 359 |     '隋隍随隐隔隗隘隙障隧隩隰隳隶隹隺隼隽难雀雁雄雅集雇雉雊雌雍雎雏雒雕雠雨雩雪雯雱雳'
 360 |     '零雷雹雾需霁霄霅霆震霈霉霍霎霏霓霖霜霞霨霪霭霰露霸霹霾青靓靖静靛非靠靡面靥革靬靰'
 361 |     '靳靴靶靸靺靼靽靿鞁鞅鞋鞍鞑鞒鞔鞘鞠鞡鞣鞧鞨鞫鞬鞭鞮鞯鞲鞳鞴韂韦韧韨韩韪韫韬韭音韵'
 362 |     '韶页顶顷顸项顺须顼顽顾顿颀颁颂颃预颅领颇颈颉颊颋颌颍颎颏颐频颓颔颖颗题颙颚颛颜额'
 363 |     '颞颟颠颡颢颤颥颦颧风飏飐飑飒飓飔飕飗飘飙飞食飧飨餍餐餮饔饕饥饧饨饩饪饫饬饭饮饯饰'
 364 |     '饱饲饳饴饵饶饷饸饹饺饻饼饽饿馁馃馄馅馆馇馈馉馊馋馌馍馏馐馑馒馓馔馕首馗馘香馝馞馥'
 365 |     '馧馨马驭驮驯驰驱驲驳驴驵驶驷驸驹驺驻驼驽驾驿骀骁骂骃骄骅骆骇骈骉骊骋验骍骎骏骐骑'
 366 |     '骒骓骕骖骗骘骙骚骛骜骝骞骟骠骡骢骣骤骥骦骧骨骰骱骶骷骸骺骼髀髁髂髃髅髋髌髎髑髓高'
 367 |     '髡髢髦髫髭髯髹髻髽鬃鬈鬏鬒鬓鬘鬟鬣鬯鬲鬶鬷鬻鬼魁魂魃魄魅魆魇魈魉魋魍魏魑魔鱼鱽鱾'
 368 |     '鱿鲀鲁鲂鲃鲅鲆鲇鲈鲉鲊鲋鲌鲍鲎鲏鲐鲑鲒鲔鲕鲖鲗鲘鲙鲚鲛鲜鲝鲞鲟鲠鲡鲢鲣鲤鲥鲦鲧鲨'
 369 |     '鲩鲪鲫鲬鲭鲮鲯鲰鲱鲲鲳鲴鲵鲷鲸鲹鲺鲻鲼鲽鲾鲿鳀鳁鳂鳃鳄鳅鳇鳈鳉鳊鳌鳍鳎鳏鳐鳑鳒鳓'
 370 |     '鳔鳕鳖鳗鳘鳙鳚鳛鳜鳝鳞鳟鳠鳡鳢鳣鳤鸟鸠鸡鸢鸣鸤鸥鸦鸧鸨鸩鸪鸫鸬鸭鸮鸯鸰鸱鸲鸳鸵鸶'
 371 |     '鸷鸸鸹鸺鸻鸼鸽鸾鸿鹀鹁鹂鹃鹄鹅鹆鹇鹈鹉鹊鹋鹌鹍鹎鹏鹐鹑鹒鹔鹕鹖鹗鹘鹙鹚鹛鹜鹝鹞鹟'
 372 |     '鹠鹡鹢鹣鹤鹦鹧鹨鹩鹪鹫鹬鹭鹮鹯鹰鹱鹲鹳鹴鹾鹿麀麂麇麈麋麑麒麓麖麝麟麦麸麹麻麽麾黄'
 373 |     '黇黉黍黎黏黑黔默黛黜黝黟黠黡黢黥黧黩黪黯黹黻黼黾鼋鼍鼎鼐鼒鼓鼗鼙鼠鼢鼩鼫鼬鼯鼱鼷'
 374 |     '鼹鼻鼽鼾齁齇齉齐齑齿龀龁龂龃龄龅龆龇龈龉龊龋龌龙龚龛龟龠龢鿍鿎鿏㑇㑊㕮㘎㙍㙘㙦㛃'
 375 |     '㛚㛹㟃㠇㠓㤘㥄㧐㧑㧟㫰㬊㬎㬚㭎㭕㮾㰀㳇㳘㳚㴔㵐㶲㸆㸌㺄㻬㽏㿠䁖䂮䃅䃎䅟䌹䎃䎖䏝䏡'
 376 |     '䏲䐃䓖䓛䓨䓫䓬䗖䗛䗪䗴䜣䝙䢺䢼䣘䥽䦃䲟䲠䲢䴓䴔䴕䴖䴗䴘䴙䶮𠅤𠙶𠳐𡎚𡐓𣗋𣲗𣲘𣸣𤧛𤩽'
 377 |     '𤫉𥔲𥕢𥖨𥻗𦈡𦒍𦙶𦝼𦭜𦰡𧿹𨐈𨙸𨚕𨟠𨭉𨱇𨱏𨱑𨱔𨺙𩽾𩾃𩾌𪟝𪣻𪤗𪨰𪨶𪩘𪾢𫄧𫄨𫄷𫄸𫇭𫌀𫍣𫍯'
 378 |     '𫍲𫍽𫐄𫐐𫐓𫑡𫓧𫓯𫓶𫓹𫔍𫔎𫔶𫖮𫖯𫖳𫗧𫗴𫘜𫘝𫘦𫘧𫘨𫘪𫘬𫚕𫚖𫚭𫛭𫞩𫟅𫟦𫟹𫟼𫠆𫠊𫠜𫢸𫫇𫭟'
 379 |     '𫭢𫭼𫮃𫰛𫵷𫶇𫷷𫸩𬀩𬀪𬂩𬃊𬇕𬇙𬇹𬉼𬊈𬊤𬌗𬍛𬍡𬍤𬒈𬒔𬒗𬕂𬘓𬘘𬘡𬘩𬘫𬘬𬘭𬘯𬙂𬙊𬙋𬜬𬜯𬞟'
 380 |     '𬟁𬟽𬣙𬣞𬣡𬣳𬤇𬤊𬤝𬨂𬨎𬩽𬪩𬬩𬬭𬬮𬬱𬬸𬬹𬬻𬬿𬭁𬭊𬭎𬭚𬭛𬭤𬭩𬭬𬭯𬭳𬭶𬭸𬭼𬮱𬮿𬯀𬯎𬱖𬱟'
 381 |     '𬳵𬳶𬳽𬳿𬴂𬴃𬴊𬶋𬶍𬶏𬶐𬶟𬶠𬶨𬶭𬶮𬷕𬸘𬸚𬸣𬸦𬸪𬹼𬺈𬺓'
 382 | )
 383 | CN_CHARS_EXT = '吶诶屌囧飚屄'
 384 | 
 385 | CN_CHARS = CN_CHARS_COMMON + CN_CHARS_EXT
 386 | IN_CH_CHARS = {c: True for c in CN_CHARS}
 387 | 
 388 | EN_CHARS = string.ascii_letters + string.digits
 389 | IN_EN_CHARS = {c: True for c in EN_CHARS}
 390 | 
 391 | VALID_CHARS = CN_CHARS + EN_CHARS + ' ' + PUNCS
 392 | IN_VALID_CHARS = {c: True for c in VALID_CHARS}
 393 | 
 394 | # ================================================================================ #
 395 | #                                    basic class
 396 | # ================================================================================ #
 397 | 
 398 | 
 399 | class ChineseChar(object):
 400 |     """
 401 |     中文字符
 402 |     每个字符对应简体和繁体,
 403 |     e.g. 简体 = '负', 繁体 = '負'
 404 |     转换时可转换为简体或繁体
 405 |     """
 406 | 
 407 |     def __init__(self, simplified, traditional):
 408 |         self.simplified = simplified
 409 |         self.traditional = traditional
 410 |         # self.__repr__ = self.__str__
 411 | 
 412 |     def __str__(self):
 413 |         return self.simplified or self.traditional or None
 414 | 
 415 |     def __repr__(self):
 416 |         return self.__str__()
 417 | 
 418 | 
 419 | class ChineseNumberUnit(ChineseChar):
 420 |     """
 421 |     中文数字/数位字符
 422 |     每个字符除繁简体外还有一个额外的大写字符
 423 |     e.g. '陆' 和 '陸'
 424 |     """
 425 | 
 426 |     def __init__(self, power, simplified, traditional, big_s, big_t):
 427 |         super(ChineseNumberUnit, self).__init__(simplified, traditional)
 428 |         self.power = power
 429 |         self.big_s = big_s
 430 |         self.big_t = big_t
 431 | 
 432 |     def __str__(self):
 433 |         return '10^{}'.format(self.power)
 434 | 
 435 |     @classmethod
 436 |     def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
 437 | 
 438 |         if small_unit:
 439 |             return ChineseNumberUnit(power=index + 1,
 440 |                                      simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
 441 |         elif numbering_type == NUMBERING_TYPES[0]:
 442 |             return ChineseNumberUnit(power=index + 8,
 443 |                                      simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
 444 |         elif numbering_type == NUMBERING_TYPES[1]:
 445 |             return ChineseNumberUnit(power=(index + 2) * 4,
 446 |                                      simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
 447 |         elif numbering_type == NUMBERING_TYPES[2]:
 448 |             return ChineseNumberUnit(power=pow(2, index + 3),
 449 |                                      simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
 450 |         else:
 451 |             raise ValueError(
 452 |                 'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
 453 | 
 454 | 
 455 | class ChineseNumberDigit(ChineseChar):
 456 |     """
 457 |     中文数字字符
 458 |     """
 459 | 
 460 |     def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
 461 |         super(ChineseNumberDigit, self).__init__(simplified, traditional)
 462 |         self.value = value
 463 |         self.big_s = big_s
 464 |         self.big_t = big_t
 465 |         self.alt_s = alt_s
 466 |         self.alt_t = alt_t
 467 | 
 468 |     def __str__(self):
 469 |         return str(self.value)
 470 | 
 471 |     @classmethod
 472 |     def create(cls, i, v):
 473 |         return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
 474 | 
 475 | 
 476 | class ChineseMath(ChineseChar):
 477 |     """
 478 |     中文数位字符
 479 |     """
 480 | 
 481 |     def __init__(self, simplified, traditional, symbol, expression=None):
 482 |         super(ChineseMath, self).__init__(simplified, traditional)
 483 |         self.symbol = symbol
 484 |         self.expression = expression
 485 |         self.big_s = simplified
 486 |         self.big_t = traditional
 487 | 
 488 | 
 489 | CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
 490 | 
 491 | 
 492 | class NumberSystem(object):
 493 |     """
 494 |     中文数字系统
 495 |     """
 496 |     pass
 497 | 
 498 | 
 499 | class MathSymbol(object):
 500 |     """
 501 |     用于中文数字系统的数学符号 (繁/简体), e.g.
 502 |     positive = ['正', '正']
 503 |     negative = ['负', '負']
 504 |     point = ['点', '點']
 505 |     """
 506 | 
 507 |     def __init__(self, positive, negative, point):
 508 |         self.positive = positive
 509 |         self.negative = negative
 510 |         self.point = point
 511 | 
 512 |     def __iter__(self):
 513 |         for v in self.__dict__.values():
 514 |             yield v
 515 | 
 516 | 
 517 | # class OtherSymbol(object):
 518 | #     """
 519 | #     其他符号
 520 | #     """
 521 | #
 522 | #     def __init__(self, sil):
 523 | #         self.sil = sil
 524 | #
 525 | #     def __iter__(self):
 526 | #         for v in self.__dict__.values():
 527 | #             yield v
 528 | 
 529 | 
 530 | # ================================================================================ #
 531 | #                                    basic utils
 532 | # ================================================================================ #
 533 | def create_system(numbering_type=NUMBERING_TYPES[1]):
 534 |     """
 535 |     根据数字系统类型返回创建相应的数字系统，默认为 mid
 536 |     NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
 537 |         low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
 538 |         mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
 539 |         high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
 540 |     返回对应的数字系统
 541 |     """
 542 | 
 543 |     # chinese number units of '亿' and larger
 544 |     all_larger_units = zip(
 545 |         LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
 546 |     larger_units = [CNU.create(i, v, numbering_type, False)
 547 |                     for i, v in enumerate(all_larger_units)]
 548 |     # chinese number units of '十, 百, 千, 万'
 549 |     all_smaller_units = zip(
 550 |         SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
 551 |     smaller_units = [CNU.create(i, v, small_unit=True)
 552 |                      for i, v in enumerate(all_smaller_units)]
 553 |     # digis
 554 |     chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
 555 |                         BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
 556 |     digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
 557 |     digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
 558 |     digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
 559 |     digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
 560 | 
 561 |     # symbols
 562 |     positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
 563 |     negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
 564 |     point_cn = CM(POINT[0], POINT[1], '.', lambda x,
 565 |                   y: float(str(x) + '.' + str(y)))
 566 |     # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
 567 |     system = NumberSystem()
 568 |     system.units = smaller_units + larger_units
 569 |     system.digits = digits
 570 |     system.math = MathSymbol(positive_cn, negative_cn, point_cn)
 571 |     # system.symbols = OtherSymbol(sil_cn)
 572 |     return system
 573 | 
 574 | 
 575 | def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
 576 | 
 577 |     def get_symbol(char, system):
 578 |         for u in system.units:
 579 |             if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
 580 |                 return u
 581 |         for d in system.digits:
 582 |             if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
 583 |                 return d
 584 |         for m in system.math:
 585 |             if char in [m.traditional, m.simplified]:
 586 |                 return m
 587 | 
 588 |     def string2symbols(chinese_string, system):
 589 |         int_string, dec_string = chinese_string, ''
 590 |         for p in [system.math.point.simplified, system.math.point.traditional]:
 591 |             if p in chinese_string:
 592 |                 int_string, dec_string = chinese_string.split(p)
 593 |                 break
 594 |         return [get_symbol(c, system) for c in int_string], \
 595 |                [get_symbol(c, system) for c in dec_string]
 596 | 
 597 |     def correct_symbols(integer_symbols, system):
 598 |         """
 599 |         一百八 to 一百八十
 600 |         一亿一千三百万 to 一亿 一千万 三百万
 601 |         """
 602 | 
 603 |         if integer_symbols and isinstance(integer_symbols[0], CNU):
 604 |             if integer_symbols[0].power == 1:
 605 |                 integer_symbols = [system.digits[1]] + integer_symbols
 606 | 
 607 |         if len(integer_symbols) > 1:
 608 |             if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
 609 |                 integer_symbols.append(
 610 |                     CNU(integer_symbols[-2].power - 1, None, None, None, None))
 611 | 
 612 |         result = []
 613 |         unit_count = 0
 614 |         for s in integer_symbols:
 615 |             if isinstance(s, CND):
 616 |                 result.append(s)
 617 |                 unit_count = 0
 618 |             elif isinstance(s, CNU):
 619 |                 current_unit = CNU(s.power, None, None, None, None)
 620 |                 unit_count += 1
 621 | 
 622 |             if unit_count == 1:
 623 |                 result.append(current_unit)
 624 |             elif unit_count > 1:
 625 |                 for i in range(len(result)):
 626 |                     if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
 627 |                         result[-i - 1] = CNU(result[-i - 1].power +
 628 |                                              current_unit.power, None, None, None, None)
 629 |         return result
 630 | 
 631 |     def compute_value(integer_symbols):
 632 |         """
 633 |         Compute the value.
 634 |         When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
 635 |         e.g. '两千万' = 2000 * 10000 not 2000 + 10000
 636 |         """
 637 |         value = [0]
 638 |         last_power = 0
 639 |         for s in integer_symbols:
 640 |             if isinstance(s, CND):
 641 |                 value[-1] = s.value
 642 |             elif isinstance(s, CNU):
 643 |                 value[-1] *= pow(10, s.power)
 644 |                 if s.power > last_power:
 645 |                     value[:-1] = list(map(lambda v: v *
 646 |                                           pow(10, s.power), value[:-1]))
 647 |                     last_power = s.power
 648 |                 value.append(0)
 649 |         return sum(value)
 650 | 
 651 |     system = create_system(numbering_type)
 652 |     int_part, dec_part = string2symbols(chinese_string, system)
 653 |     int_part = correct_symbols(int_part, system)
 654 |     int_str = str(compute_value(int_part))
 655 |     dec_str = ''.join([str(d.value) for d in dec_part])
 656 |     if dec_part:
 657 |         return '{0}.{1}'.format(int_str, dec_str)
 658 |     else:
 659 |         return int_str
 660 | 
 661 | 
 662 | def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
 663 |             traditional=False, alt_zero=False, alt_one=False, alt_two=True,
 664 |             use_zeros=True, use_units=True):
 665 | 
 666 |     def get_value(value_string, use_zeros=True):
 667 | 
 668 |         striped_string = value_string.lstrip('0')
 669 | 
 670 |         # record nothing if all zeros
 671 |         if not striped_string:
 672 |             return []
 673 | 
 674 |         # record one digits
 675 |         elif len(striped_string) == 1:
 676 |             if use_zeros and len(value_string) != len(striped_string):
 677 |                 return [system.digits[0], system.digits[int(striped_string)]]
 678 |             else:
 679 |                 return [system.digits[int(striped_string)]]
 680 | 
 681 |         # recursively record multiple digits
 682 |         else:
 683 |             result_unit = next(u for u in reversed(
 684 |                 system.units) if u.power < len(striped_string))
 685 |             result_string = value_string[:-result_unit.power]
 686 |             return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
 687 | 
 688 |     system = create_system(numbering_type)
 689 | 
 690 |     int_dec = number_string.split('.')
 691 |     if len(int_dec) == 1:
 692 |         int_string = int_dec[0]
 693 |         dec_string = ""
 694 |     elif len(int_dec) == 2:
 695 |         int_string = int_dec[0]
 696 |         dec_string = int_dec[1]
 697 |     else:
 698 |         raise ValueError(
 699 |             "invalid input num string with more than one dot: {}".format(number_string))
 700 | 
 701 |     if use_units and len(int_string) > 1:
 702 |         result_symbols = get_value(int_string)
 703 |     else:
 704 |         result_symbols = [system.digits[int(c)] for c in int_string]
 705 |     dec_symbols = [system.digits[int(c)] for c in dec_string]
 706 |     if dec_string:
 707 |         result_symbols += [system.math.point] + dec_symbols
 708 | 
 709 |     if alt_two:
 710 |         liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
 711 |                     system.digits[2].big_s, system.digits[2].big_t)
 712 |         for i, v in enumerate(result_symbols):
 713 |             if isinstance(v, CND) and v.value == 2:
 714 |                 next_symbol = result_symbols[i +
 715 |                                              1] if i < len(result_symbols) - 1 else None
 716 |                 previous_symbol = result_symbols[i - 1] if i > 0 else None
 717 |                 if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
 718 |                     if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
 719 |                         result_symbols[i] = liang
 720 | 
 721 |     # if big is True, '两' will not be used and `alt_two` has no impact on output
 722 |     if big:
 723 |         attr_name = 'big_'
 724 |         if traditional:
 725 |             attr_name += 't'
 726 |         else:
 727 |             attr_name += 's'
 728 |     else:
 729 |         if traditional:
 730 |             attr_name = 'traditional'
 731 |         else:
 732 |             attr_name = 'simplified'
 733 | 
 734 |     result = ''.join([getattr(s, attr_name) for s in result_symbols])
 735 | 
 736 |     # if not use_zeros:
 737 |     #     result = result.strip(getattr(system.digits[0], attr_name))
 738 | 
 739 |     if alt_zero:
 740 |         result = result.replace(
 741 |             getattr(system.digits[0], attr_name), system.digits[0].alt_s)
 742 | 
 743 |     if alt_one:
 744 |         result = result.replace(
 745 |             getattr(system.digits[1], attr_name), system.digits[1].alt_s)
 746 | 
 747 |     for i, p in enumerate(POINT):
 748 |         if result.startswith(p):
 749 |             return CHINESE_DIGIS[0] + result
 750 | 
 751 |     # ^10, 11, .., 19
 752 |     if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
 753 |                                           SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
 754 |             result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
 755 |         result = result[1:]
 756 | 
 757 |     return result
 758 | 
 759 | 
 760 | # ================================================================================ #
 761 | #                          different types of rewriters
 762 | # ================================================================================ #
 763 | class Cardinal:
 764 |     """
 765 |     CARDINAL类
 766 |     """
 767 | 
 768 |     def __init__(self, cardinal=None, chntext=None):
 769 |         self.cardinal = cardinal
 770 |         self.chntext = chntext
 771 | 
 772 |     def chntext2cardinal(self):
 773 |         return chn2num(self.chntext)
 774 | 
 775 |     def cardinal2chntext(self):
 776 |         return num2chn(self.cardinal)
 777 | 
 778 | 
 779 | class Digit:
 780 |     """
 781 |     DIGIT类
 782 |     """
 783 | 
 784 |     def __init__(self, digit=None, chntext=None):
 785 |         self.digit = digit
 786 |         self.chntext = chntext
 787 | 
 788 |     # def chntext2digit(self):
 789 |     #     return chn2num(self.chntext)
 790 | 
 791 |     def digit2chntext(self):
 792 |         return num2chn(self.digit, alt_two=False, use_units=False)
 793 | 
 794 | 
 795 | class TelePhone:
 796 |     """
 797 |     TELEPHONE类
 798 |     """
 799 | 
 800 |     def __init__(self, telephone=None, raw_chntext=None, chntext=None):
 801 |         self.telephone = telephone
 802 |         self.raw_chntext = raw_chntext
 803 |         self.chntext = chntext
 804 | 
 805 |     # def chntext2telephone(self):
 806 |     #     sil_parts = self.raw_chntext.split('<SIL>')
 807 |     #     self.telephone = '-'.join([
 808 |     #         str(chn2num(p)) for p in sil_parts
 809 |     #     ])
 810 |     #     return self.telephone
 811 | 
 812 |     def telephone2chntext(self, fixed=False):
 813 | 
 814 |         if fixed:
 815 |             sil_parts = self.telephone.split('-')
 816 |             self.raw_chntext = '<SIL>'.join([
 817 |                 num2chn(part, alt_two=False, use_units=False) for part in sil_parts
 818 |             ])
 819 |             self.chntext = self.raw_chntext.replace('<SIL>', '')
 820 |         else:
 821 |             sp_parts = self.telephone.strip('+').split()
 822 |             self.raw_chntext = '<SP>'.join([
 823 |                 num2chn(part, alt_two=False, use_units=False) for part in sp_parts
 824 |             ])
 825 |             self.chntext = self.raw_chntext.replace('<SP>', '')
 826 |         return self.chntext
 827 | 
 828 | 
 829 | class Fraction:
 830 |     """
 831 |     FRACTION类
 832 |     """
 833 | 
 834 |     def __init__(self, fraction=None, chntext=None):
 835 |         self.fraction = fraction
 836 |         self.chntext = chntext
 837 | 
 838 |     def chntext2fraction(self):
 839 |         denominator, numerator = self.chntext.split('分之')
 840 |         return chn2num(numerator) + '/' + chn2num(denominator)
 841 | 
 842 |     def fraction2chntext(self):
 843 |         numerator, denominator = self.fraction.split('/')
 844 |         return num2chn(denominator) + '分之' + num2chn(numerator)
 845 | 
 846 | 
 847 | class Date:
 848 |     """
 849 |     DATE类
 850 |     """
 851 | 
 852 |     def __init__(self, date=None, chntext=None):
 853 |         self.date = date
 854 |         self.chntext = chntext
 855 | 
 856 |     # def chntext2date(self):
 857 |     #     chntext = self.chntext
 858 |     #     try:
 859 |     #         year, other = chntext.strip().split('年', maxsplit=1)
 860 |     #         year = Digit(chntext=year).digit2chntext() + '年'
 861 |     #     except ValueError:
 862 |     #         other = chntext
 863 |     #         year = ''
 864 |     #     if other:
 865 |     #         try:
 866 |     #             month, day = other.strip().split('月', maxsplit=1)
 867 |     #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
 868 |     #         except ValueError:
 869 |     #             day = chntext
 870 |     #             month = ''
 871 |     #         if day:
 872 |     #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
 873 |     #     else:
 874 |     #         month = ''
 875 |     #         day = ''
 876 |     #     date = year + month + day
 877 |     #     self.date = date
 878 |     #     return self.date
 879 | 
 880 |     def date2chntext(self):
 881 |         date = self.date
 882 |         try:
 883 |             year, other = date.strip().split('年', 1)
 884 |             year = Digit(digit=year).digit2chntext() + '年'
 885 |         except ValueError:
 886 |             other = date
 887 |             year = ''
 888 |         if other:
 889 |             try:
 890 |                 month, day = other.strip().split('月', 1)
 891 |                 month = Cardinal(cardinal=month).cardinal2chntext() + '月'
 892 |             except ValueError:
 893 |                 day = date
 894 |                 month = ''
 895 |             if day:
 896 |                 day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
 897 |         else:
 898 |             month = ''
 899 |             day = ''
 900 |         chntext = year + month + day
 901 |         self.chntext = chntext
 902 |         return self.chntext
 903 | 
 904 | 
 905 | class Money:
 906 |     """
 907 |     MONEY类
 908 |     """
 909 | 
 910 |     def __init__(self, money=None, chntext=None):
 911 |         self.money = money
 912 |         self.chntext = chntext
 913 | 
 914 |     # def chntext2money(self):
 915 |     #     return self.money
 916 | 
 917 |     def money2chntext(self):
 918 |         money = self.money
 919 |         pattern = re.compile(r'(\d+(\.\d+)?)')
 920 |         matchers = pattern.findall(money)
 921 |         if matchers:
 922 |             for matcher in matchers:
 923 |                 money = money.replace(matcher[0], Cardinal(
 924 |                     cardinal=matcher[0]).cardinal2chntext())
 925 |         self.chntext = money
 926 |         return self.chntext
 927 | 
 928 | 
 929 | class Percentage:
 930 |     """
 931 |     PERCENTAGE类
 932 |     """
 933 | 
 934 |     def __init__(self, percentage=None, chntext=None):
 935 |         self.percentage = percentage
 936 |         self.chntext = chntext
 937 | 
 938 |     def chntext2percentage(self):
 939 |         return chn2num(self.chntext.strip().strip('百分之')) + '%'
 940 | 
 941 |     def percentage2chntext(self):
 942 |         return '百分之' + num2chn(self.percentage.strip().strip('%'))
 943 | 
 944 | 
 945 | def normalize_nsw(raw_text):
 946 |     text = '^' + raw_text + '$'
 947 | 
 948 |     # 规范化日期
 949 |     pattern = re.compile(
 950 |         r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
 951 |     matchers = pattern.findall(text)
 952 |     if matchers:
 953 |         # print('date')
 954 |         for matcher in matchers:
 955 |             text = text.replace(matcher[0], Date(
 956 |                 date=matcher[0]).date2chntext(), 1)
 957 | 
 958 |     # 规范化金钱
 959 |     pattern = re.compile(
 960 |         r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
 961 |     matchers = pattern.findall(text)
 962 |     if matchers:
 963 |         # print('money')
 964 |         for matcher in matchers:
 965 |             text = text.replace(matcher[0], Money(
 966 |                 money=matcher[0]).money2chntext(), 1)
 967 | 
 968 |     # 规范化固话/手机号码
 969 |     # 手机
 970 |     # http://www.jihaoba.com/news/show/13680
 971 |     # 移动：139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
 972 |     # 联通：130、131、132、156、155、186、185、176
 973 |     # 电信：133、153、189、180、181、177
 974 |     pattern = re.compile(
 975 |         r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
 976 |     matchers = pattern.findall(text)
 977 |     if matchers:
 978 |         # print('telephone')
 979 |         for matcher in matchers:
 980 |             text = text.replace(matcher[0], TelePhone(
 981 |                 telephone=matcher[0]).telephone2chntext(), 1)
 982 |     # 固话
 983 |     pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
 984 |     matchers = pattern.findall(text)
 985 |     if matchers:
 986 |         # print('fixed telephone')
 987 |         for matcher in matchers:
 988 |             text = text.replace(matcher[0], TelePhone(
 989 |                 telephone=matcher[0]).telephone2chntext(fixed=True), 1)
 990 | 
 991 |     # 规范化分数
 992 |     pattern = re.compile(r"(\d+/\d+)")
 993 |     matchers = pattern.findall(text)
 994 |     if matchers:
 995 |         # print('fraction')
 996 |         for matcher in matchers:
 997 |             text = text.replace(matcher, Fraction(
 998 |                 fraction=matcher).fraction2chntext(), 1)
 999 | 
1000 |     # 规范化百分数
1001 |     text = text.replace('％', '%')
1002 |     pattern = re.compile(r"(\d+(\.\d+)?%)")
1003 |     matchers = pattern.findall(text)
1004 |     if matchers:
1005 |         # print('percentage')
1006 |         for matcher in matchers:
1007 |             text = text.replace(matcher[0], Percentage(
1008 |                 percentage=matcher[0]).percentage2chntext(), 1)
1009 | 
1010 |     # 规范化纯数+量词
1011 |     pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
1012 |     matchers = pattern.findall(text)
1013 |     if matchers:
1014 |         # print('cardinal+quantifier')
1015 |         for matcher in matchers:
1016 |             text = text.replace(matcher[0], Cardinal(
1017 |                 cardinal=matcher[0]).cardinal2chntext(), 1)
1018 | 
1019 |     # 规范化数字编号
1020 |     pattern = re.compile(r"(\d{4,32})")
1021 |     matchers = pattern.findall(text)
1022 |     if matchers:
1023 |         # print('digit')
1024 |         for matcher in matchers:
1025 |             text = text.replace(matcher, Digit(
1026 |                 digit=matcher).digit2chntext(), 1)
1027 | 
1028 |     # 规范化纯数
1029 |     pattern = re.compile(r"(\d+(\.\d+)?)")
1030 |     matchers = pattern.findall(text)
1031 |     if matchers:
1032 |         # print('cardinal')
1033 |         for matcher in matchers:
1034 |             text = text.replace(matcher[0], Cardinal(
1035 |                 cardinal=matcher[0]).cardinal2chntext(), 1)
1036 | 
1037 |     # restore P2P, O2O, B2C, B2B etc
1038 |     pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
1039 |     matchers = pattern.findall(text)
1040 |     if matchers:
1041 |         # print('particular')
1042 |         for matcher in matchers:
1043 |             text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
1044 | 
1045 |     return text.lstrip('^').rstrip('$')
1046 | 
1047 | 
1048 | def remove_erhua(text):
1049 |     """
1050 |     去除儿化音词中的儿:
1051 |     他女儿在那边儿 -> 他女儿在那边
1052 |     """
1053 | 
1054 |     new_str = ''
1055 |     while re.search('儿', text):
1056 |         a = re.search('儿', text).span()
1057 |         remove_er_flag = 0
1058 | 
1059 |         if ER_WHITELIST_PATTERN.search(text):
1060 |             b = ER_WHITELIST_PATTERN.search(text).span()
1061 |             if b[0] <= a[0]:
1062 |                 remove_er_flag = 1
1063 | 
1064 |         if remove_er_flag == 0:
1065 |             new_str = new_str + text[0:a[0]]
1066 |             text = text[a[1]:]
1067 |         else:
1068 |             new_str = new_str + text[0:b[1]]
1069 |             text = text[b[1]:]
1070 | 
1071 |     text = new_str + text
1072 |     return text
1073 | 
1074 | 
1075 | def remove_space(text):
1076 |     tokens = text.split()
1077 |     new = []
1078 |     for k, t in enumerate(tokens):
1079 |         if k != 0:
1080 |             if IN_EN_CHARS.get(tokens[k-1][-1]) and IN_EN_CHARS.get(t[0]):
1081 |                 new.append(' ')
1082 |         new.append(t)
1083 |     return ''.join(new)
1084 | 
1085 | 
1086 | class TextNorm:
1087 |     def __init__(self,
1088 |                  to_banjiao: bool = False,
1089 |                  to_upper: bool = False,
1090 |                  to_lower: bool = False,
1091 |                  remove_fillers: bool = False,
1092 |                  remove_erhua: bool = False,
1093 |                  check_chars: bool = False,
1094 |                  remove_space: bool = False,
1095 |                  cc_mode: str = '',
1096 |                  ):
1097 |         self.to_banjiao = to_banjiao
1098 |         self.to_upper = to_upper
1099 |         self.to_lower = to_lower
1100 |         self.remove_fillers = remove_fillers
1101 |         self.remove_erhua = remove_erhua
1102 |         self.check_chars = check_chars
1103 |         self.remove_space = remove_space
1104 | 
1105 |         self.cc = None
1106 |         if cc_mode:
1107 |             from opencc import OpenCC  # Open Chinese Convert: pip install opencc
1108 |             self.cc = OpenCC(cc_mode)
1109 | 
1110 |     def __call__(self, text):
1111 |         if self.cc:
1112 |             text = self.cc.convert(text)
1113 | 
1114 |         if self.to_banjiao:
1115 |             text = text.translate(QJ2BJ_TRANSFORM)
1116 | 
1117 |         if self.to_upper:
1118 |             text = text.upper()
1119 | 
1120 |         if self.to_lower:
1121 |             text = text.lower()
1122 | 
1123 |         if self.remove_fillers:
1124 |             for c in FILLER_CHARS:
1125 |                 text = text.replace(c, '')
1126 | 
1127 |         if self.remove_erhua:
1128 |             text = remove_erhua(text)
1129 | 
1130 |         text = normalize_nsw(text)
1131 | 
1132 |         # text = text.translate(PUNCS_TRANSFORM)
1133 | 
1134 |         if self.check_chars:
1135 |             for c in text:
1136 |                 if not IN_VALID_CHARS.get(c):
1137 |                     print(
1138 |                         f'WARNING: illegal char {c} in: {text}', file=sys.stderr)
1139 |                     return ''
1140 | 
1141 |         if self.remove_space:
1142 |             text = remove_space(text)
1143 | 
1144 |         return text
1145 | 
1146 | 
1147 | if __name__ == '__main__':
1148 |     p = argparse.ArgumentParser()
1149 | 
1150 |     # normalizer options
1151 |     p.add_argument('--to_banjiao', action='store_true',
1152 |                    help='convert quanjiao chars to banjiao')
1153 |     p.add_argument('--to_upper', action='store_true',
1154 |                    help='convert to upper case')
1155 |     p.add_argument('--to_lower', action='store_true',
1156 |                    help='convert to lower case')
1157 |     p.add_argument('--remove_fillers', action='store_true',
1158 |                    help='remove filler chars such as "呃, 啊"')
1159 |     p.add_argument('--remove_erhua', action='store_true',
1160 |                    help='remove erhua chars such as "他女儿在那边儿 -> 他女儿在那边"')
1161 |     p.add_argument('--check_chars', action='store_true',
1162 |                    help='skip sentences containing illegal chars')
1163 |     p.add_argument('--remove_space', action='store_true',
1164 |                    help='remove whitespace')
1165 |     p.add_argument('--cc_mode', choices=['', 't2s', 's2t'],
1166 |                    default='', help='convert between traditional to simplified')
1167 | 
1168 |     # I/O options
1169 |     p.add_argument('--log_interval', type=int, default=10000,
1170 |                    help='log interval in number of processed lines')
1171 |     p.add_argument('--has_key', action='store_true',
1172 |                    help="will be deprecated, set --format ark instead")
1173 |     p.add_argument('--format', type=str,
1174 |                    choices=['txt', 'ark', 'tsv'], default='txt', help='input format')
1175 |     p.add_argument('ifile', help='input filename, assume utf-8 encoding')
1176 |     p.add_argument('ofile', help='output filename')
1177 | 
1178 |     args = p.parse_args()
1179 | 
1180 |     if args.has_key:
1181 |         args.format = 'ark'
1182 | 
1183 |     normalizer = TextNorm(
1184 |         to_banjiao=args.to_banjiao,
1185 |         to_upper=args.to_upper,
1186 |         to_lower=args.to_lower,
1187 |         remove_fillers=args.remove_fillers,
1188 |         remove_erhua=args.remove_erhua,
1189 |         check_chars=args.check_chars,
1190 |         remove_space=args.remove_space,
1191 |         cc_mode=args.cc_mode,
1192 |     )
1193 | 
1194 |     ndone = 0
1195 |     with open(args.ifile, 'r', encoding='utf8') as istream, open(args.ofile, 'w+', encoding='utf8') as ostream:
1196 |         if args.format == 'tsv':
1197 |             reader = csv.DictReader(istream, delimiter='\t')
1198 |             assert ('TEXT' in reader.fieldnames)
1199 |             print('\t'.join(reader.fieldnames), file=ostream)
1200 | 
1201 |             for item in reader:
1202 |                 text = item['TEXT']
1203 | 
1204 |                 if text:
1205 |                     text = normalizer(text)
1206 | 
1207 |                 if text:
1208 |                     item['TEXT'] = text
1209 |                     print('\t'.join([item[f]
1210 |                           for f in reader.fieldnames]), file=ostream)
1211 | 
1212 |                 ndone += 1
1213 |                 if ndone % args.log_interval == 0:
1214 |                     print(f'text norm: {ndone} lines done.',
1215 |                           file=sys.stderr, flush=True)
1216 |         else:
1217 |             for l in istream:
1218 |                 key, text = '', ''
1219 |                 if args.format == 'ark':  # KALDI archive, line format: "key text"
1220 |                     cols = l.strip().split(maxsplit=1)
1221 |                     key, text = cols[0], cols[1] if len(cols) == 2 else ''
1222 |                 else:
1223 |                     text = l.strip()
1224 | 
1225 |                 if text:
1226 |                     text = normalizer(text)
1227 | 
1228 |                 if text:
1229 |                     if args.format == 'ark':
1230 |                         print(key + '\t' + text, file=ostream)
1231 |                     else:
1232 |                         print(text, file=ostream)
1233 | 
1234 |                 ndone += 1
1235 |                 if ndone % args.log_interval == 0:
1236 |                     print(f'text norm: {ndone} lines done.',
1237 |                           file=sys.stderr, flush=True)
1238 |     print(f'text norm: {ndone} lines done in total.',
1239 |           file=sys.stderr, flush=True)
1240 | 


--------------------------------------------------------------------------------
/youdub/demucs_vr.py:
--------------------------------------------------------------------------------
 1 | from demucs.api import Separator
 2 | import os
 3 | import numpy as np
 4 | from scipy.io import wavfile
 5 | 
 6 | class Demucs:
 7 |     def __init__(self, model="htdemucs_ft", device='cuda', progress=True, shifts=5) -> None:
 8 |         print(f'Loading Demucs model {model}...')
 9 |         self.separator = Separator(
10 |             model=model, device=device, progress=progress, shifts=shifts)
11 |         print('Demucs model loaded.')
12 | 
13 |     def inference(self, audio_path: str, output_folder: str) -> None:
14 |         print(f'Demucs separating {audio_path}...')
15 |         origin, separated = self.separator.separate_audio_file(audio_path)
16 |         print(f'Demucs separated {audio_path}.')
17 |         vocals = separated['vocals'].numpy().T
18 |         # vocals.to_file(os.path.join(output_folder, 'en_Vocals.wav'))
19 |         instruments = (separated['drums'] + separated['bass'] + separated['other']).numpy().T
20 |         
21 |         vocal_output_path = os.path.join(output_folder, 'en_Vocals.wav')
22 |         self.save_wav(vocals, vocal_output_path)
23 |         print(f'Demucs saved vocal to {vocal_output_path}.')
24 |         
25 |         instruments_output_path = os.path.join(output_folder, 'en_Instruments.wav')
26 |         self.save_wav(instruments, instruments_output_path)
27 |         print(f'Demucs saved instruments to {instruments_output_path}.')
28 | 
29 |     def save_wav(self, wav: np.ndarray, output_path:str):
30 |         # wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
31 |         wav_norm = wav * 32767
32 |         wavfile.write(output_path, 44100, wav_norm.astype(np.int16))
33 | 
34 | if __name__ == '__main__':
35 |     # demucs = Demucs(model='htdemucs_ft')
36 |     demucs = Demucs(model='hdemucs_mmi')
37 |     demucs.inference(r'output\TwoMinutePapers\10000 Of These Train ChatGPT In 4 Minutes\en.wav',
38 |                      r'output\TwoMinutePapers\10000 Of These Train ChatGPT In 4 Minutes')
39 | 


--------------------------------------------------------------------------------
/youdub/translation.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import openai
  4 | from dotenv import load_dotenv
  5 | import time
  6 | 
  7 | load_dotenv()
  8 | openai.api_key = os.getenv('OPENAI_API_KEY')
  9 | openai.api_base = os.getenv('OPENAI_API_BASE', 'https://api.openai.com/v1')
 10 | 
 11 | system_message = \
 12 | """请你扮演科普专家的角色。这是一个为视频配音设计的翻译任务，将各种语言精准而优雅地转化为尽量简短的中文。请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。在这个过程中，请特别注意维护中文特有的语序和句式结构，使翻译文本既忠于原意又符合中文的表达习惯。
 13 | **注意事项**：
 14 | - 鼓励用自己的话重新诠释文本，避免逐字逐句的直译。采用意译而非直译的方式，用你的话语表达原文的精髓。
 15 | - 长句子可以分成多个短句子，便于观众理解。
 16 | - 保留专有名词的原文，如人名、地名、机构名等。
 17 | - 化学式用中文表示，例如CO2说二氧化碳，H2O说水。
 18 | - 数学公式用中文表示，例如x2或x^2或x²说x的平方，a+b说a加b。
 19 | - 严格遵循回答格式，将翻译结果放入"引号"中。
 20 | """   
 21 | 
 22 | caution = """请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。请特别注意维护中文特有的语序和句式结构，使翻译文本既忠于原意又符合中文的表达习惯。翻译尽量简短且正确。特别注意，数学公式用中文表示，例如x2或x^2说x的平方，a+b说a加b。"""
 23 | 
 24 | prefix = '中文：'
 25 | class Translator:
 26 |     def __init__(self):
 27 |         self.system_message = system_message
 28 |         self.messages = []
 29 |         
 30 |     def translate(self, transcript):
 31 |         print('总结中...')
 32 |         retry = 10
 33 |         while retry:
 34 |             try:
 35 |                 response = openai.ChatCompletion.create(
 36 |                     model="gpt-3.5-turbo",
 37 |                     messages=[{"role": "system", "content": '你是一个科普专家。你的目的是总结文本中的主要科学知识。'}] + [{"role": "user", "content": f"让我们深呼吸，一步一步地思考。你的总结应该是信息丰富且真实的，涵盖主题的最重要方面。概括这个视频的主要科学内容: {''.join(transcript)}。使用中文简单总结。"},], timeout=240)
 38 |                 summary = response.choices[0].message.content
 39 |                 print(summary)
 40 |                 retry = 0
 41 |             except Exception as e:
 42 |                 retry -= 1
 43 |                 print('总结失败')
 44 |                 print(e)
 45 |                 print('重新总结')
 46 |                 time.sleep(1)
 47 |                 if retry == 0:
 48 |                     raise Exception('总结失败')
 49 |         self.fixed_messages = [{'role': 'user', 'content': 'Hello!'}, {
 50 |             'role': 'assistant', 'content': f'{prefix}"你好！"'}, {'role': 'user', 'content': 'Animation videos explaining things with optimistic nihilism since 2,013.'}, {
 51 |             'role': 'assistant', 'content': f'{prefix}"从2013年开始，我们以乐观的虚无主义制作动画，进行科普。"'}]
 52 |         # self.fixed_messages = []
 53 |         self.messages = []
 54 |         final_result = []
 55 |         print('\n翻译中...')
 56 |         for sentence in transcript:
 57 |             if not sentence:
 58 |                 continue
 59 |             success = False
 60 |             retry_message = ''
 61 |             
 62 |             # print(messages)
 63 |             while not success:
 64 |                 messages = [{"role": "system", "content": summary + '\n' + self.system_message}] + self.fixed_messages + \
 65 |                     self.messages[-20:] + [{"role": "user",
 66 |                                             "content": f'{caution}{retry_message}\n请按照```"回答格式"```翻译成中文："{sentence}"\n'},]
 67 |                 try:
 68 |                     response = openai.ChatCompletion.create(
 69 |                     model="gpt-3.5-turbo",
 70 |                     messages=messages,
 71 |                     temperature=0.2,
 72 |                     timeout=60,
 73 |                     )
 74 |                     response = response.choices[0].message.content
 75 |                     matches = re.findall(r'"((.|\n)*?)"', response)
 76 |                     if matches:
 77 |                         result = matches[-1][0].replace("'", '"').strip()
 78 |                         result = re.sub(r'\（[^)]*\）', '', result)
 79 |                         result = result.replace('...', '，')
 80 |                         result = re.sub(r'(?<=\d),(?=\d)', '', result)
 81 |                         result = result.replace('²', '的平方').replace('————', '：').replace('——', '：')
 82 |                     else:
 83 |                         result = None
 84 |                         raise Exception('没有找到相应格式的翻译结果')
 85 |                     if result:
 86 |                         self.messages.append(
 87 |                             {'role': 'user', 'content': f"{sentence}"})
 88 |                         self.messages.append(
 89 |                             {'role': 'assistant', 'content': f'{prefix}"{result}"'})
 90 |                         print(sentence)
 91 |                         print(response)
 92 |                         print(f'最终结果：{result}')
 93 |                         print('='*50)
 94 |                         final_result.append(result)
 95 |                         success = True
 96 |                 except Exception as e:
 97 |                     print(response)
 98 |                     print(e)
 99 |                     print('翻译失败')
100 |                     retry_message += f'严格遵循回答格式，将翻译结果放入"引号"中，例如，请翻译成中文："Hello!"，你的回答应该是```{prefix}"你好！"```\n'
101 |                     time.sleep(0.5)
102 |         return final_result
103 | 
104 | 
105 |         
106 | if __name__ == '__main__':
107 |     import json
108 |     output_folder = r"output\z_Others\1hr_Talk_Intro_to_Large_Language_Models"
109 |     with open(os.path.join(output_folder, 'en.json'), 'r', encoding='utf-8') as f:
110 |         transcript = json.load(f)
111 |     transcript = [sentence['text'] for sentence in transcript if sentence['text']]
112 |     # transcript = ['毕达哥拉斯的公式是a2+b2=c2']
113 |     # transcript = ["Humans are apes with smartphones, living on a tiny moist rock, which is speeding around a burning sphere a million times bigger than itself.", "But our star is only one in billions in a milky way, which itself is only one in billions of galaxies.", "Everything around us is filled with complexity, but usually we don't notice, because being a human takes up a lot of time.", "So we try to explain the universe and our existence one video at a time.", "What is life? Are there aliens? What happens if you step on a black hole?", "If you want to find out, you should click here and subscribe to the Kurzgesagt In A Nutshell YouTube channel."]
114 |     print(transcript)
115 |     translator = Translator()
116 |     result = translator.translate(transcript)
117 |     print(result)
118 |     # translate_from_folder(r'output\Kurzgesagt Channel Trailer', translator)
119 |     
120 | 


--------------------------------------------------------------------------------
/youdub/translation_json.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import openai
  4 | from dotenv import load_dotenv
  5 | import time
  6 | import json
  7 | 
  8 | load_dotenv()
  9 | openai.api_key = os.getenv('OPENAI_API_KEY')
 10 | openai.api_base = os.getenv('OPENAI_API_BASE', 'https://api.openai.com/v1')
 11 | 
 12 | system_message = \
 13 | """请你扮演科普专家的角色。这是一个为视频配音设计的翻译任务，将各种语言精准而优雅地转化为尽量简短的正确的翻译。请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。在这个过程中，请特别注意维护正确的翻译特有的语序和句式结构，使翻译文本既忠于原意又符合正确的翻译的表达习惯。
 14 | **注意事项**：
 15 | - 紧密关注上下文的逻辑关系，确保翻译的连贯性和准确性。
 16 | - 遵循正确的翻译的语序原则，即定语放在被修饰的名词前，状语放在谓语前，以保持正确的翻译的自然语感。
 17 | - 鼓励用自己的话重新诠释文本，避免逐字逐句的直译。采用意译而非直译的方式，用你的话语表达原文的精髓。
 18 | - 保留专有名词的原文，如人名、地名、机构名等。
 19 | - 化学式用正确的翻译表示，例如CO2说二氧化碳，H2O说水。
 20 | - 长句子可以分成多个短句子，便于观众理解。
 21 | - 使用正确的翻译字符。
 22 | - 严格遵循回答格式。
 23 | 回答格式：
 24 | ```json
 25 | {
 26 |     "原文": "重复需要翻译的内容",
 27 |     "分析与思考": "首先，对原文进行理解；其次分析上下文语境；然后根据正确的翻译的语序和句式结构，对原文进行修改；最后进行Sanity Check。",
 28 |     "正确的翻译": "最终经过修改后的正确的翻译。"
 29 | }
 30 | ```
 31 | """ 
 32 | caution = """请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。在这个过程中，请特别注意维护正确的翻译特有的语序和句式结构，使翻译文本既忠于原意又符合正确的翻译的表达习惯。"""
 33 | 
 34 | prefix = '正确的翻译：'
 35 | class Translator:
 36 |     def __init__(self):
 37 |         self.system_message = system_message
 38 |         self.messages = []
 39 |         
 40 |     def translate(self, transcript):
 41 |         print('总结中...')
 42 |         retry = 10
 43 |         while retry:
 44 |             try:
 45 |                 response = openai.ChatCompletion.create(
 46 |                     model="gpt-3.5-turbo",
 47 |                     messages=[{"role": "system", "content": '你是一个科普专家。你的目的是总结文本中的主要科学知识。'}] + [{"role": "user", "content": f"让我们深呼吸，一步一步地思考。你的总结应该是信息丰富且真实的，涵盖主题的最重要方面。概括这个视频的主要科学内容: {''.join(transcript)}。使用正确的翻译总结。"},], timeout=240)
 48 |                 summary = response.choices[0].message.content
 49 |                 print(summary)
 50 |                 retry = 0
 51 |             except Exception as e:
 52 |                 retry -= 1
 53 |                 print('总结失败')
 54 |                 print(e)
 55 |                 print('重新总结')
 56 |                 time.sleep(1)
 57 |                 if retry == 0:
 58 |                     raise Exception('总结失败')
 59 |         hello = {
 60 |             "原文": "Hello, this is kurzgesagt's YouTube Channel.",
 61 |             "分析与思考": "首先，这句话是一个简单的介绍，告诉观众这是kurzgesagt的YouTube频道。在正确的翻译中，我们通常会将'这是'放在句子的开头，然后是频道的名称，最后是'的YouTube频道'。",
 62 |             "正确的翻译": "你好，这是kurzgesagt的YouTube频道。"
 63 |         }
 64 |         intro = {
 65 |             "原文": "We started making animation videos explaining things with optimistic nihilism since 12,013.",
 66 |             "详细思考": "首先，这句话的主要意思是说，他们从12013年开始制作一种用乐观的虚无主义来解释事物的动画视频。但是，我们知道现在是2023年，还没有到12013年，所以这里的12013可能是一个错误。在正确的翻译的语序中，我们通常会将时间状语放在句首。",
 67 |             "正确的翻译": "自2013年以来，我们开始制作用乐观的虚无主义来解释事物的动画视频。"
 68 |         }
 69 | 
 70 |         self.fixed_messages = [{'role': 'user', 'content': '```json\n{"原文": "Hello, this is kurzgesagt\' YouTube Channel."}\n```'}, {
 71 |             'role': 'assistant', 'content': '```json' + json.dumps(hello, ensure_ascii=False)+'```'}, {'role': 'user', 'content': '```json\n{"原文": "We started making animation videos explaining things with optimistic nihilism since 2,013."}\n```'}, {
 72 |             'role': 'assistant', 'content': '```json'+ json.dumps(intro, ensure_ascii=False)+'```'}]
 73 |         # self.fixed_messages = []
 74 |         self.messages = []
 75 |         final_result = []
 76 |         print('\n翻译中...')
 77 |         for sentence in transcript:
 78 |             if not sentence:
 79 |                 continue
 80 |             success = False
 81 |             retry_message = ''
 82 |             prompt = {
 83 |                 "原文": sentence
 84 |             }
 85 |             # print(messages)
 86 |             while not success:
 87 |                 time.sleep(0.1)
 88 |                 messages = [{"role": "system", "content": summary + '\n' + self.system_message}] + self.fixed_messages + \
 89 |                     self.messages[-20:] + [{"role": "user",
 90 |                                             "content": retry_message + '```json' + json.dumps(prompt, ensure_ascii=False)+'```'},]
 91 |                 try:
 92 |                     response = openai.ChatCompletion.create(
 93 |                     model="gpt-3.5-turbo",
 94 |                     messages=messages,
 95 |                     temperature=0.2,
 96 |                     timeout=60,
 97 |                     )
 98 |                     response = response.choices[0].message.content
 99 |                     matches = re.findall(r'```json((.|\n)*?)```', response)
100 |                     if matches:
101 |                         result = matches[-1][0].strip()
102 |                         result = json.loads(result)
103 |                         if result['分析与思考'] and result['正确的翻译'] and result['原文'] == sentence:
104 |                             result = result['正确的翻译'].replace("'", '"')
105 |                             result = re.sub(r'\（[^)]*\）', '', result)
106 |                             result = result.replace('...', '，')
107 |                         else:
108 |                             result = None
109 |                             raise Exception('没有找到相应格式的正确的翻译')
110 |                     else:
111 |                         result = None
112 |                         raise Exception('没有找到相应格式的正确的翻译')
113 |                     if result:
114 |                         self.messages.append(
115 |                             {'role': 'user', 'content': '```json' + json.dumps(prompt, ensure_ascii=False)+'```'})
116 |                         self.messages.append(
117 |                             {'role': 'assistant', 'content': response})
118 |                         print(sentence)
119 |                         print(response)
120 |                         print('='*50)
121 |                         final_result.append(result)
122 |                         success = True
123 |                 except Exception as e:
124 |                     print(response)
125 |                     print(e)
126 |                     print('翻译失败')
127 |                     retry_message += """严格遵循回答格式，放在```json```中：```json
128 | {
129 |     "原文": "重复需要翻译的内容",
130 |     "分析与思考": "首先，对原文进行理解；其次分析上下文语境；然后根据正确的翻译的语序和句式结构，对原文进行修改；最后进行Sanity Check。",
131 |     "正确的翻译": "最终经过修改后的正确的翻译。"
132 | }
133 | ```"""
134 |                     time.sleep(0.5)
135 |                 finally:
136 |                     time.sleep(0.5)
137 |         return final_result
138 | 
139 | 
140 |         
141 | if __name__ == '__main__':
142 |     output_folder = r"output\Can You Upload Your Mind & Live Forever-"
143 |     with open(os.path.join(output_folder, 'en.json'), 'r', encoding='utf-8') as f:
144 |         transcript = json.load(f)
145 |     transcript = [sentence['text'] for sentence in transcript if sentence['text']]
146 |     # transcript = ["Humans are apes with smartphones, living on a tiny moist rock, which is speeding around a burning sphere a million times bigger than itself.", "But our star is only one in billions in a milky way, which itself is only one in billions of galaxies.", "Everything around us is filled with complexity, but usually we don't notice, because being a human takes up a lot of time.", "So we try to explain the universe and our existence one video at a time.", "What is life? Are there aliens? What happens if you step on a black hole?", "If you want to find out, you should click here and subscribe to the Kurzgesagt In A Nutshell YouTube channel."]
147 |     print(transcript)
148 |     translator = Translator()
149 |     result = translator.translate(transcript)
150 |     print(result)
151 |     # translate_from_folder(r'output\Kurzgesagt Channel Trailer', translator)
152 |     
153 | 


--------------------------------------------------------------------------------
/youdub/translation_unsafe.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import openai
  4 | from dotenv import load_dotenv
  5 | import time
  6 | 
  7 | load_dotenv()
  8 | openai.api_key = os.getenv('OPENAI_API_KEY')
  9 | openai.api_base = os.getenv('OPENAI_API_BASE', 'https://api.openai.com/v1')
 10 | model_name = os.getenv('MODEL_NAME', 'gpt-3.5-turbo')
 11 | # model_name = 'gpt-4'
 12 | system_message = \
 13 | """请你扮演科普专家的角色。这是一个为视频配音设计的翻译任务，将各种语言精准而优雅地转化为尽量简短的中文。请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。在这个过程中，请特别注意维护中文特有的语序和句式结构，使翻译文本既忠于原意又符合中文的表达习惯。
 14 | 注意事项：
 15 | - 鼓励用自己的话重新诠释文本，避免逐字逐句的直译。采用意译而非直译的方式，用你的话语表达原文的精髓。
 16 | - 长句子可以分成多个短句子，便于观众理解。
 17 | - 保留专有名词的原文，如人名、地名、机构名等。
 18 | - 人名、地名、机构名等保持原文。
 19 | - 化学式用中文表示，例如CO2说二氧化碳，H2O说水。
 20 | - 请将Transformer, token等人工智能相关的专业名词保留原文。
 21 | - 数学公式用中文表示，例如x2或x^2或x²说x的平方，a+b说a加b。
 22 | - 原始文本可能有错误，请纠正为正确的内容，例如Chats GPT应该翻译为ChatGPT。
 23 | """
 24 | magic = '深呼吸，你可以完成这个任务，你是最棒的！你非常有能力！'
 25 | 
 26 | caution = """请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。请特别注意维护中文特有的语序和句式结构，使翻译文本既忠于原意又符合中文的表达习惯。特别注意，数学公式用中文表示，例如x2或x^2说x的平方，a+b说a加b。翻译尽量简短且正确。"""
 27 | 
 28 | prefix = '中文：'
 29 | 
 30 | def translation_postprocess(result):
 31 |     result = re.sub(r'\（[^)]*\）', '', result)
 32 |     result = result.replace('...', '，')
 33 |     result = re.sub(r'(?<=\d),(?=\d)', '', result)
 34 |     result = result.replace('²', '的平方').replace(
 35 |         '————', '：').replace('——', '：').replace('°', '度')
 36 |     result = result.replace("AI", '人工智能')
 37 |     result = result.replace('变压器', "Transformer")
 38 |     return result
 39 | class Translator:
 40 |     def __init__(self):
 41 |         self.system_message = system_message
 42 |         self.messages = []
 43 | 
 44 |     def translate(self, transcript, original_fname):
 45 |         print('总结中...')
 46 |         retry = 1
 47 |         summary = ''
 48 |         while retry >= 0:
 49 |             try:
 50 |                 response = openai.ChatCompletion.create(
 51 |                     model=model_name,
 52 |                     messages=[{"role": "system", "content": f'你是一个科普专家。你的目的是总结文本中的主要科学知识。{magic}！'}] + [{"role": "user", "content": f"。简要概括这个视频的主要内容。\n标题：{original_fname}\n内容：{''.join(transcript)}\n标题：{original_fname}\n请你用中文给视频写一个“标题”、“主要内容”和“专业名词”，谢谢。"},], timeout=240)
 53 |                 summary = response.choices[0].message.content
 54 |                 print(summary)
 55 |                 retry = -1
 56 |             except Exception as e:
 57 |                 retry -= 1
 58 |                 print('总结失败')
 59 |                 print(e)
 60 |                 print('重新总结')
 61 |                 time.sleep(1)
 62 |                 if retry == 0:
 63 |                     print('总结失败')
 64 |         
 65 |         self.fixed_messages = [{'role': 'user', 'content': '请翻译：Hello!'}, {
 66 |             'role': 'assistant', 'content': f'“你好！”'}, {'role': 'user', 'content': '请翻译：Animation videos explaining things with optimistic nihilism since 2,013.'}, {
 67 |             'role': 'assistant', 'content': f'“从2013年开始，我们以乐观的虚无主义制作动画，进行科普。”'}]
 68 |         # self.fixed_messages = []
 69 |         self.messages = []
 70 |         final_result = []
 71 |         print('\n翻译中...')
 72 |         for sentence in transcript:
 73 |             if not sentence:
 74 |                 continue
 75 |             retry = 20
 76 |             retry_message = ''
 77 | 
 78 |             # print(messages)
 79 |             # [{"role": "system", "content": summary + '\n' + self.system_message}] + self.fixed_messages + \
 80 |             history = " ".join(final_result[-30:])
 81 |             while retry > 0:
 82 |                 retry -= 1
 83 |                 messages = [
 84 |                     {"role": "system", "content": f'请你扮演科普专家的角色。这是一个为视频配音设计的翻译任务，将各种语言精准而优雅地转化为尽量简短的中文。请在翻译时避免生硬的直译，而是追求自然流畅、贴近原文而又不失文学韵味的表达。在这个过程中，请特别注意维护中文特有的语序和句式结构，使翻译文本既忠于原意又符合中文的表达习惯。{magic}'}] + self.fixed_messages + [{"role": "user", "content": f'{summary}\n{self.system_message}\n请将Transformer, token等人工智能相关的专业名词保留原文。长句分成几个短句。\n历史内容：\n{history}\n以上为参考的历史内容。\n{retry_message}\n深呼吸，请正确翻译这句英文:“{sentence}”翻译成简洁中文。'},]
 85 |                 try:
 86 |                     response = openai.ChatCompletion.create(
 87 |                         model=model_name,
 88 |                         messages=messages,
 89 |                         temperature=0.3,
 90 |                         timeout=60,
 91 |                     )
 92 |                     response = response.choices[0].message.content
 93 |                     result = response.strip()
 94 |                     if retry != 0:
 95 |                         if '\n' in result:
 96 |                             retry_message += '无视前面的内容，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
 97 |                             raise Exception('存在换行')
 98 |                         if '翻译' in result:
 99 |                             retry_message += '无视前面的内容，请不要出现“翻译”字样，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
100 |                             raise Exception('存在"翻译"字样')
101 |                         if '这句话的意思是' in result:
102 |                             retry_message += '无视前面的内容，请不要出现“这句话的意思是”字样，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
103 |                             raise Exception('存在"这句话的意思是"字样')
104 |                         if '这句话的意译是' in result:
105 |                             retry_message += '无视前面的内容，请不要出现“这句话的意译是”字样，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
106 |                             raise Exception('存在"这句话的意译是"字样')
107 |                         if '这句' in result:
108 |                             retry_message += '无视前面的内容，请不要出现“这句话”字样，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
109 |                             raise Exception('存在"这句"字样')
110 |                         if '深呼吸' in result:
111 |                             retry_message += '无视前面的内容，请不要出现“深呼吸”字样，仅仅只翻译下面的英文，请简短翻译，只输出翻译结果。'
112 |                             raise Exception('存在"深呼吸"字样')
113 |                         if (result.startswith('“') and result.endswith('”')) or (result.startswith('"') and result.endswith('"')):
114 |                             result = result[1:-1]
115 |                         if len(sentence) <= 10:
116 |                             if len(result) > 20:
117 |                                 retry_message += '注意：仅仅只翻译下面的内容，请简短翻译，只输出翻译结果。'
118 |                                 raise Exception('翻译过长')
119 |                         elif len(result) > len(sentence)*0.75:
120 |                             retry_message += '注意：仅仅只翻译下面的内容，请简短翻译，只输出翻译结果。'
121 |                             raise Exception('翻译过长')
122 |                     result = translation_postprocess(result)
123 |                     
124 |                     if result:
125 |                         self.messages.append(
126 |                             {'role': 'user', 'content': f"{sentence}"})
127 |                         self.messages.append(
128 |                             {'role': 'assistant', 'content': f'{result}'})
129 |                         print(sentence)
130 |                         print(response)
131 |                         print(f'最终结果：{result}')
132 |                         print('='*50)
133 |                         final_result.append(result)
134 |                         retry = 0
135 |                 except Exception as e:
136 |                     print(sentence)
137 |                     print(response)
138 |                     print(e)
139 |                     print('翻译失败')
140 |                     retry_message += f''
141 |                     time.sleep(0.5)
142 |         return final_result, summary
143 | 
144 | 
145 | if __name__ == '__main__':
146 |     import json
147 |     output_folder = r"output\test\Blood concrete and dynamite Building the Hoover Dam Alex Gendler"
148 |     with open(os.path.join(output_folder, 'en.json'), 'r', encoding='utf-8') as f:
149 |         transcript = json.load(f)
150 |     transcript = [sentence['text']
151 |                   for sentence in transcript if sentence['text']]
152 |     # transcript = ['毕达哥拉斯的公式是a2+b2=c2']
153 |     # transcript = ["Humans are apes with smartphones, living on a tiny moist rock, which is speeding around a burning sphere a million times bigger than itself.", "But our star is only one in billions in a milky way, which itself is only one in billions of galaxies.", "Everything around us is filled with complexity, but usually we don't notice, because being a human takes up a lot of time.", "So we try to explain the universe and our existence one video at a time.", "What is life? Are there aliens? What happens if you step on a black hole?", "If you want to find out, you should click here and subscribe to the Kurzgesagt In A Nutshell YouTube channel."]
154 |     print(transcript)
155 |     translator = Translator()
156 |     result = translator.translate(
157 |         transcript, original_fname='Blood concrete and dynamite Building the Hoover Dam Alex Gendler')
158 |     print(result)
159 |     with open(os.path.join(output_folder, 'zh.json'), 'w', encoding='utf-8') as f:
160 |         json.dump(result, f, ensure_ascii=False, indent=4)
161 |     # translate_from_folder(r'output\Kurzgesagt Channel Trailer', translator)
162 | 


--------------------------------------------------------------------------------
/youdub/tts_bytedance.py:
--------------------------------------------------------------------------------
  1 | # coding=utf-8
  2 | from datetime import datetime
  3 | import librosa
  4 | from requests.exceptions import ConnectionError, Timeout, RequestException
  5 | import os
  6 | import sys
  7 | import time
  8 | 
  9 | import numpy as np
 10 | sys.path.append(os.getcwd())
 11 | from youdub.utils import save_wav, adjust_audio_length, split_text, tts_preprocess_text
 12 | import logging
 13 | from loguru import logger
 14 | import requests
 15 | import uuid
 16 | import os
 17 | import json
 18 | import re
 19 | 
 20 | import base64
 21 | from dotenv import load_dotenv
 22 | load_dotenv()
 23 | 
 24 | # [700, 705, 701, 001, 406, 407, 002, 701, 123, 120, 119, 115, 107, 100, 104, 004, 113, 102, 405]
 25 |     
 26 | class TTS_Clone:
 27 |     def __init__(self):
 28 |         self.appid = os.getenv('APPID')
 29 |         self.access_token = os.getenv('ACCESS_TOKEN')
 30 |         self.cluster = "volcano_tts"
 31 |         self.host = "openspeech.bytedance.com"
 32 |         self.api_url = f"https://{self.host}/api/v1/tts"
 33 |         self.header = {"Authorization": f"Bearer;{self.access_token}"}
 34 |         self.request_json = {
 35 |             "app": {
 36 |                 "appid": self.appid,
 37 |                 "token": "access_token",
 38 |                 "cluster": self.cluster
 39 |             },
 40 |             "user": {
 41 |                 "uid": "388808087185088"
 42 |             },
 43 |             "audio": {
 44 |                 "voice_type": '',
 45 |                 "encoding": "wav",
 46 |                 "speed_ratio": 1,
 47 |                 "volume_ratio": 1.0,
 48 |                 "pitch_ratio": 1.0,
 49 |             },
 50 |             "request": {
 51 |                 "reqid": str(uuid.uuid4()),
 52 |                 "text": "字节跳动语音合成",
 53 |                 "text_type": "plain",
 54 |                 "operation": "query",
 55 |                 "with_frontend": 1,
 56 |                 "frontend_type": "unitTson"
 57 | 
 58 |             }
 59 |         }
 60 |         self.output_path = r'.'
 61 |         if not os.path.exists(self.output_path):
 62 |             os.mkdir(self.output_path)
 63 | 
 64 |     def inference(self, text, output_wav_path, speaker='SPEAKER_00', speaker_to_voice_type={'SPEAKER_00': 'BV701_streaming'}):
 65 |         self.request_json['request']['text'] = text
 66 |         self.request_json['request']['reqid'] = str(uuid.uuid4())
 67 |         self.request_json['audio']['voice_type'] = speaker_to_voice_type.get(
 68 |                 speaker, 'BV701_streaming')
 69 |         max_retries = 5
 70 |         timeout_seconds = 10  # Set your desired timeout in seconds
 71 | 
 72 |         for attempt in range(max_retries):
 73 |             try:
 74 |                 resp = requests.post(self.api_url, json.dumps(
 75 |                     self.request_json), headers=self.header, timeout=timeout_seconds)
 76 |                 if resp.status_code == 200:
 77 |                     data = resp.json()["data"]
 78 |                     data = base64.b64decode(data)
 79 |                     with open(output_wav_path, "wb") as f:
 80 |                         f.write(data)
 81 |                     print(f'{output_wav_path}: {text}')
 82 |                     return np.frombuffer(data, dtype=np.int16)
 83 |                 else:
 84 |                     print(f"Request failed with status code: {resp.status_code}")
 85 |                     if resp.status_code == 500:
 86 |                         return None
 87 |                     raise Exception(f"Request failed with status code: {resp.status_code}")
 88 |             except Exception as e:
 89 |                 print(f"Request failed: {e}, retrying ({attempt+1}/{max_retries})")
 90 |                 time.sleep(2)  # Wait 2 seconds before retrying
 91 | 
 92 |         print("Max retries reached, request failed")
 93 |         return None
 94 | 
 95 | 
 96 | def audio_process_folder(folder, tts: TTS_Clone, speaker_to_voice_type, vocal_only=False):
 97 |     logging.info(f'TTS processing folder {folder}...')
 98 |     logging.info(f'speaker_to_voice_type: {speaker_to_voice_type}')
 99 |     with open(os.path.join(folder, 'zh.json'), 'r', encoding='utf-8') as f:
100 |         transcript = json.load(f)
101 |     full_wav = np.zeros((0,))
102 |     if not os.path.exists(os.path.join(folder, 'temp')):
103 |         os.makedirs(os.path.join(folder, 'temp'))
104 | 
105 |     for i, line in enumerate(transcript):
106 |         text = line['text']
107 |         # start = line['start']
108 |         start = line['start']
109 |         last_end = len(full_wav)/24000
110 |         if start > last_end:
111 |             full_wav = np.concatenate(
112 |                 (full_wav, np.zeros((int(24000 * (start - last_end)),))))
113 |         start = len(full_wav)/24000
114 |         line['start'] = start
115 |         end = line['end']
116 |         if os.path.exists(os.path.join(folder, 'temp', f'zh_{str(i).zfill(3)}.wav')):
117 |             wav = librosa.load(os.path.join(
118 |                 folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), sr=24000)[0]
119 |         else:
120 |             wav = tts.inference(tts_preprocess_text(text), os.path.join(
121 |                 folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), speaker=line.get('speaker', 'SPEAKER_00'), speaker_to_voice_type=speaker_to_voice_type)
122 |             time.sleep(0.1)
123 |         # save_wav(wav, )
124 |         wav_adjusted, adjusted_length = adjust_audio_length(wav, os.path.join(folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), os.path.join(
125 |             folder, 'temp',  f'zh_{str(i).zfill(3)}_adjusted.wav'), end - start)
126 | 
127 |         wav_adjusted /= wav_adjusted.max()
128 |         line['end'] = line['start'] + adjusted_length
129 |         full_wav = np.concatenate(
130 |             (full_wav, wav_adjusted))
131 |     # load os.path.join(folder, 'en_Instruments.wav')
132 |     # combine with full_wav (the length of the two audio might not be equal)
133 |     transcript = split_text(transcript, punctuations=[
134 |                             '，', '；', '：', '。', '？', '！', '\n', '”'])
135 |     with open(os.path.join(folder, 'transcript.json'), 'w', encoding='utf-8') as f:
136 |         json.dump(transcript, f, ensure_ascii=False, indent=4)
137 |     instruments_wav, sr = librosa.load(
138 |         os.path.join(folder, 'en_Instruments.wav'), sr=24000)
139 | 
140 |     len_full_wav = len(full_wav)
141 |     len_instruments_wav = len(instruments_wav)
142 | 
143 |     if len_full_wav > len_instruments_wav:
144 |         # 如果 full_wav 更长，将 instruments_wav 延伸到相同长度
145 |         instruments_wav = np.pad(
146 |             instruments_wav, (0, len_full_wav - len_instruments_wav), mode='constant')
147 |     elif len_instruments_wav > len_full_wav:
148 |         # 如果 instruments_wav 更长，将 full_wav 延伸到相同长度
149 |         full_wav = np.pad(
150 |             full_wav, (0, len_instruments_wav - len_full_wav), mode='constant')
151 |     # 合并两个音频
152 |     full_wav /= np.max(np.abs(full_wav))
153 |     save_wav(full_wav, os.path.join(folder, f'zh_Vocals.wav'))
154 |     # instruments_wav /= np.max(np.abs(instruments_wav))
155 |     instrument_coefficient = 1
156 |     if vocal_only:
157 |         instrument_coefficient = 0
158 |     combined_wav = full_wav + instruments_wav*instrument_coefficient
159 |     combined_wav /= np.max(np.abs(combined_wav))
160 |     save_wav(combined_wav, os.path.join(folder, f'zh.wav'))
161 | 
162 | 
163 | if __name__ == '__main__':
164 |     tts = TTS_Clone()
165 |     # process_folder(
166 |     #     r'output\test\Blood concrete and dynamite Building the Hoover Dam Alex Gendler', tts)
167 |     from tqdm import tqdm
168 |     voice_type_folder = r'voice_type'
169 |     voice_type_lst = []
170 |     for fname in os.listdir(voice_type_folder):
171 |         voice_type_lst.append(fname.split('.')[0])
172 |     for voice_type in tqdm(voice_type_lst):
173 |         # voice_type = f'BV{str(i).zfill(3)}_streaming'
174 |         output_wav = f'voice_type/{voice_type}.wav'
175 |         # if os.path.exists(output_wav):
176 |         #     continue
177 |         try:
178 |             tts.inference(
179 |                 'YouDub 是一个创新的开源工具，专注于将 YouTube 等平台的优质视频翻译和配音为中文版本。此工具融合了先进的 AI 技术，包括语音识别、大型语言模型翻译以及 AI 声音克隆技术，为中文用户提供具有原始 YouTuber 音色的中文配音视频。更多示例和信息，欢迎访问我的bilibili视频主页。你也可以加入我们的微信群，扫描下方的二维码即可。', output_wav, voice_type=voice_type)
180 |         except:
181 |             print(f'voice {voice_type} failed.')
182 |         time.sleep(0.1)
183 | 
184 | 


--------------------------------------------------------------------------------
/youdub/tts_paddle.py:
--------------------------------------------------------------------------------
 1 | 
 2 | import os, sys
 3 | sys.path.append(os.getcwd())
 4 | from paddlespeech.cli.tts import TTSExecutor
 5 | import numpy as np
 6 | import json
 7 | import logging
 8 | 
 9 | from youdub.utils import save_wav, adjust_audio_length
10 | 
11 | 
12 | 
13 | class TTS_Clone:
14 |     def __init__(self, model_path="fastspeech2_male", voc='pwgan_male',device='gpu:0', language='mix'):
15 |         logging.info(f'Loading TTS model {model_path}...')
16 |         self.am = model_path
17 |         self.voc = voc
18 |         self.tts = TTSExecutor()
19 |         self.language = language
20 |         logging.info('Model TTS loaded.')
21 |         
22 |     def inference(self, text, output) -> np.ndarray:
23 |         self.tts(
24 |             text=text,
25 |             am=self.am,
26 |             voc=self.voc,
27 |             lang=self.language,
28 |             output=output,
29 |             use_onnx=True)
30 |         print(f'{output}: {text}')
31 |         
32 |         return self.tts._outputs['wav']
33 | 
34 | def process_folder(folder, tts: TTS_Clone):
35 |     logging.info(f'TTS processing folder {folder}...')
36 |     with open(os.path.join(folder, 'zh.json'), 'r', encoding='utf-8') as f:
37 |         transcript = json.load(f)
38 |     full_wav = []
39 |     if not os.path.exists(os.path.join(folder, 'temp')):
40 |         os.makedirs(os.path.join(folder, 'temp'))
41 |     
42 |     previous_end = 0
43 |     for i, line in enumerate(transcript):
44 |         text = line['text']
45 |         start = line['start']
46 |         end = line['end']
47 |         
48 |         wav = tts.inference(text, os.path.join(folder, 'temp', f'zh_{i}.wav'))
49 |         wav_adjusted = adjust_audio_length(wav, os.path.join(folder, 'temp', f'zh_{i}.wav'), os.path.join(
50 |             folder, 'temp',  f'zh_{i}_adjusted.wav'), end - start)
51 |         length = len(wav_adjusted)/24000
52 |         end = start + length
53 |         if start > previous_end:
54 |             full_wav.append(np.zeros(( int(24000 * (start - previous_end)),)))
55 |         full_wav.append(wav_adjusted)
56 |         previous_end = end
57 |     full_wav = np.concatenate(full_wav)
58 |     save_wav(full_wav, os.path.join(folder, f'zh.wav'))
59 |         
60 | 
61 | if __name__ == '__main__':
62 |     tts = TTS_Clone()
63 |     process_folder(r'output\Why_The_War_on_Drugs_Is_a_Huge_Failure', tts)
64 |     
65 |     
66 | 


--------------------------------------------------------------------------------
/youdub/tts_xttsv2.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import os, sys
  3 | import time
  4 | sys.path.append(os.getcwd())
  5 | import re
  6 | from TTS.api import TTS
  7 | import librosa
  8 | from tqdm import tqdm
  9 | import numpy as np
 10 | import json
 11 | import logging
 12 | from youdub.utils import save_wav, adjust_audio_length, split_text, tts_preprocess_text
 13 | from youdub.cn_tx import TextNorm
 14 | # Get device
 15 | # import torch
 16 | # device = 'cuda' if torch.cuda.is_available() else 'cpu'
 17 | device = 'cuda'
 18 | 
 19 | class TTS_Clone:
 20 |     def __init__(self, model_path="tts_models/multilingual/multi-dataset/xtts_v2", device='cuda', language='zh-cn'):
 21 |         logging.info(f'Loading TTS model {model_path}...')
 22 |         self.tts = TTS(model_path).to(device)
 23 |         self.language = language
 24 |         logging.info('Model TTS loaded.')
 25 |         
 26 |     def inference(self, text, output_path, speaker_wav) -> np.ndarray:
 27 |         wav = self.tts.tts(
 28 |                 text=text, speaker_wav=speaker_wav, language=self.language)
 29 |         wav = np.array(wav)
 30 |         save_wav(wav, output_path)
 31 |         # wav /= np.max(np.abs(wav))
 32 |         return wav
 33 | 
 34 | 
 35 | 
 36 | 
 37 | 
 38 | def audio_process_folder(folder, tts: TTS_Clone, speaker_to_voice_type=None, vocal_only=False):
 39 |     logging.info(f'TTS processing folder {folder}...')
 40 |     logging.info(f'speaker_to_voice_type: {speaker_to_voice_type}')
 41 |     with open(os.path.join(folder, 'zh.json'), 'r', encoding='utf-8') as f:
 42 |         transcript = json.load(f)
 43 |     full_wav = np.zeros((0,))
 44 |     if not os.path.exists(os.path.join(folder, 'temp')):
 45 |         os.makedirs(os.path.join(folder, 'temp'))
 46 | 
 47 |     for i, line in enumerate(transcript):
 48 |         text = line['text']
 49 |         # start = line['start']
 50 |         start = line['start']
 51 |         last_end = len(full_wav)/24000
 52 |         if start > last_end:
 53 |             full_wav = np.concatenate(
 54 |                 (full_wav, np.zeros((int(24000 * (start - last_end)),))))
 55 |         start = len(full_wav)/24000
 56 |         line['start'] = start
 57 |         end = line['end']
 58 |         if os.path.exists(os.path.join(folder, 'temp', f'zh_{str(i).zfill(3)}.wav')):
 59 |             wav = librosa.load(os.path.join(
 60 |                 folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), sr=24000)[0]
 61 |         else:
 62 |             speaker = line.get('speaker', 'SPEAKER_00')
 63 |             speaker_wav = os.path.join(folder, 'SPEAKER', f'{speaker}.wav')
 64 |             wav = tts.inference(tts_preprocess_text(text), os.path.join(
 65 |                 folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), speaker_wav)
 66 |             time.sleep(0.1)
 67 |         # save_wav(wav, )
 68 |         wav_adjusted, adjusted_length = adjust_audio_length(wav, os.path.join(folder, 'temp', f'zh_{str(i).zfill(3)}.wav'), os.path.join(
 69 |             folder, 'temp',  f'zh_{str(i).zfill(3)}_adjusted.wav'), end - start)
 70 | 
 71 |         wav_adjusted /= wav_adjusted.max()
 72 |         line['end'] = line['start'] + adjusted_length
 73 |         full_wav = np.concatenate(
 74 |             (full_wav, wav_adjusted))
 75 |     # load os.path.join(folder, 'en_Instruments.wav')
 76 |     # combine with full_wav (the length of the two audio might not be equal)
 77 |     transcript = split_text(transcript, punctuations=[
 78 |                             '，', '；', '：', '。', '？', '！', '\n','”'])
 79 |     with open(os.path.join(folder, 'transcript.json'), 'w', encoding='utf-8') as f:
 80 |         json.dump(transcript, f, ensure_ascii=False, indent=4)
 81 |     instruments_wav, sr = librosa.load(
 82 |         os.path.join(folder, 'en_Instruments.wav'), sr=24000)
 83 | 
 84 |     len_full_wav = len(full_wav)
 85 |     len_instruments_wav = len(instruments_wav)
 86 | 
 87 |     if len_full_wav > len_instruments_wav:
 88 |         # 如果 full_wav 更长，将 instruments_wav 延伸到相同长度
 89 |         instruments_wav = np.pad(
 90 |             instruments_wav, (0, len_full_wav - len_instruments_wav), mode='constant')
 91 |     elif len_instruments_wav > len_full_wav:
 92 |         # 如果 instruments_wav 更长，将 full_wav 延伸到相同长度
 93 |         full_wav = np.pad(
 94 |             full_wav, (0, len_instruments_wav - len_full_wav), mode='constant')
 95 |     # 合并两个音频
 96 |     full_wav /= np.max(np.abs(full_wav))
 97 |     save_wav(full_wav, os.path.join(folder, f'zh_Vocals.wav'))
 98 |     # instruments_wav /= np.max(np.abs(instruments_wav))
 99 |     instrument_coefficient = 1
100 |     if vocal_only:
101 |         instrument_coefficient = 0
102 |     combined_wav = full_wav + instruments_wav*instrument_coefficient
103 |     combined_wav /= np.max(np.abs(combined_wav))
104 |     save_wav(combined_wav, os.path.join(folder, f'zh.wav'))
105 |         
106 | 
107 | if __name__ == '__main__':
108 |     folder = r'output\test\Elon Musk on Sam Altman and ChatGPT I am the reason OpenAI exists'
109 |     tts = TTS_Clone("tts_models/multilingual/multi-dataset/xtts_v2", language='zh-cn')
110 |     audio_process_folder(folder, tts)
111 |     
112 |     
113 | 


--------------------------------------------------------------------------------
/youdub/utils.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import numpy as np
  3 | import librosa
  4 | from audiostretchy.stretch import stretch_audio
  5 | from scipy.io import wavfile
  6 | from youdub.cn_tx import TextNorm
  7 | normalizer = TextNorm()
  8 | 
  9 | def tts_preprocess_text(text):
 10 |     # 使用正则表达式查找所有的大写字母，并在它们前面加上空格
 11 |     # 正则表达式说明：(?<!^) 表示如果不是字符串开头，则匹配，[A-Z] 匹配任何大写字母
 12 |     text = text.replace('AI', '人工智能')
 13 |     text = re.sub(r'(?<!^)([A-Z])', r' \1', text)
 14 |     text = normalizer(text)
 15 |     # 使用正则表达式在字母和数字之间插入空格
 16 |     text = re.sub(r'(?<=[a-zA-Z])(?=\d)|(?<=\d)(?=[a-zA-Z])', ' ', text)
 17 |     return text
 18 | 
 19 | def split_text(input_data,
 20 |                punctuations=['。', '？', '！', '\n', "”"]):
 21 |     # Chinese punctuation marks for sentence ending
 22 | 
 23 |     # Function to check if a character is a Chinese ending punctuation
 24 |     def is_punctuation(char):
 25 |         return char in punctuations
 26 | 
 27 |     # Process each item in the input data
 28 |     output_data = []
 29 |     for item in input_data:
 30 |         start = item["start"]
 31 |         text = item["text"]
 32 |         speaker = item.get("speaker", "SPEAKER_00")
 33 |         sentence_start = 0
 34 | 
 35 |         # Calculate the duration for each character
 36 |         duration_per_char = (item["end"] - item["start"]) / len(text)
 37 |         for i, char in enumerate(text):
 38 |             # If the character is a punctuation, split the sentence
 39 |             if not is_punctuation(char) and i != len(text) - 1:
 40 |                 continue
 41 |             if i - sentence_start < 5 and i != len(text) - 1:
 42 |                 continue
 43 |             if i < len(text) - 1 and is_punctuation(text[i+1]):
 44 |                 continue
 45 |             sentence = text[sentence_start:i+1]
 46 |             sentence_end = start + duration_per_char * len(sentence)
 47 | 
 48 |             # Append the new item
 49 |             output_data.append({
 50 |                 "start": round(start, 3),
 51 |                 "end": round(sentence_end, 3),
 52 |                 "text": sentence,
 53 |                 "speaker": speaker
 54 |             })
 55 | 
 56 |             # Update the start for the next sentence
 57 |             start = sentence_end
 58 |             sentence_start = i + 1
 59 | 
 60 |     return output_data
 61 | 
 62 | def adjust_audio_length(wav, src_path, dst_path,  desired_length: float, sample_rate: int = 24000) -> np.ndarray:
 63 |     """Adjust the length of the audio.
 64 | 
 65 |     Args:
 66 |         wav (np.ndarray): Original waveform.
 67 |         sample_rate (int): Sampling rate of the audio.
 68 |         desired_length (float): Desired length of the audio in seconds.
 69 | 
 70 |     Returns:
 71 |         np.ndarray: Waveform with adjusted length.
 72 |     """
 73 |     current_length = wav.shape[0] / sample_rate
 74 |     speed_factor = max(min(desired_length / current_length, 1.1), 2/3)
 75 |     desired_length = current_length * speed_factor
 76 |     stretch_audio(src_path, dst_path, ratio=speed_factor,
 77 |                   sample_rate=sample_rate)
 78 |     y, sr = librosa.load(dst_path, sr=sample_rate)
 79 |     return y[:int(desired_length * sr)], desired_length
 80 | 
 81 | 
 82 | def save_wav(wav: np.ndarray, path: str, sample_rate: int = 24000) -> None:
 83 |     """Save float waveform to a file using Scipy.
 84 | 
 85 |     Args:
 86 |         wav (np.ndarray): Waveform with float values in range [-1, 1] to save.
 87 |         path (str): Path to a output file.
 88 |         sample_rate (int, optional): Sampling rate used for saving to the file. Defaults to 24000.
 89 |     """
 90 |     # wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
 91 |     wav_norm = wav * 32767
 92 |     wavfile.write(path, sample_rate, wav_norm.astype(np.int16))
 93 | 
 94 | def load_wav(wav_path: str, sample_rate: int = 24000) -> np.ndarray:
 95 |     """Load waveform from a file using librosa.
 96 | 
 97 |     Args:
 98 |         wav_path (str): Path to a file to load.
 99 |         sample_rate (int, optional): Sampling rate used for loading the file. Defaults to 24000.
100 | 
101 |     Returns:
102 |         np.ndarray: Waveform with float values in range [-1, 1].
103 |     """
104 |     return librosa.load(wav_path, sr=sample_rate)[0]


--------------------------------------------------------------------------------
/youdub/video_postprocess.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | from moviepy.editor import VideoFileClip, AudioFileClip
  3 | import os
  4 | import subprocess
  5 | def format_timestamp(seconds):
  6 |     """Converts seconds to the SRT time format."""
  7 |     millisec = int((seconds - int(seconds)) * 1000)
  8 |     hours, seconds = divmod(int(seconds), 3600)
  9 |     minutes, seconds = divmod(seconds, 60)
 10 |     return f"{hours:02}:{minutes:02}:{seconds:02},{millisec:03}"
 11 | 
 12 | 
 13 | def convert_json_to_srt(json_file, srt_file, max_line_char=30):
 14 |     print(f'Converting {json_file} to {srt_file}...')
 15 |     with open(json_file, 'r', encoding='utf-8') as f:
 16 |         subtitles = json.load(f)
 17 | 
 18 |     with open(srt_file, 'w', encoding='utf-8') as f:
 19 |         for i, subtitle in enumerate(subtitles, 1):
 20 |             start = format_timestamp(subtitle['start'])
 21 |             end = format_timestamp(subtitle['end'])
 22 |             text = subtitle['text']
 23 |             line = len(text)//(max_line_char+1) + 1
 24 |             avg = min(round(len(text)/line), max_line_char)
 25 |             text = '\n'.join([text[i*avg:(i+1)*avg]
 26 |                              for i in range(line)])
 27 | 
 28 |             f.write(f"{i}\n")
 29 |             f.write(f"{start} --> {end}\n")
 30 |             f.write(f"{text}\n\n")
 31 | 
 32 | 
 33 | def replace_audio_ffmpeg(input_video: str, input_audio: str, input_subtitles: str, output_path: str, fps=30) -> None:
 34 |     input_folder = os.path.dirname(input_video)
 35 |     dst_folder = os.path.join(input_folder, '0_finished')
 36 |     if not os.path.exists(dst_folder):
 37 |         os.mkdir(dst_folder)
 38 |     
 39 |     if os.path.exists(output_path):
 40 |         command = f'move "{input_video}" "{dst_folder}"'
 41 |         subprocess.Popen(command, shell=True)
 42 |         return
 43 | 
 44 |     # Extract the video name from the input video path
 45 |     video_name = os.path.basename(input_video)
 46 | 
 47 |     # Replace video file extension with '.srt' for subtitles
 48 |     srt_name = video_name.replace('.mp4', '.srt').replace(
 49 |         '.mkv', '.srt').replace('.avi', '.srt').replace('.flv', '.srt')
 50 | 
 51 |     # Construct the path for the subtitles file
 52 |     srt_path = os.path.join(os.path.dirname(input_audio), srt_name)
 53 | 
 54 |     # Convert subtitles from JSON to SRT format
 55 |     convert_json_to_srt(input_subtitles, srt_path)
 56 | 
 57 |     # Determine the output folder and define a temporary file path
 58 |     output_folder = os.path.dirname(output_path)
 59 |     tmp = os.path.join(output_folder, 'tmp.mp4')
 60 | 
 61 |     # Prepare a list to hold FFmpeg commands
 62 |     commands = []
 63 | 
 64 |     # FFmpeg command to speed up the video by 1.05 times
 65 |     speed_up = 1.05
 66 |     
 67 |     if speed_up == 1:
 68 |         tmp = output_path
 69 |     commands.append(f'ffmpeg -i "{input_video}" -i "{input_audio}" -vf "subtitles={srt_path}:force_style=\'FontName=Arial,FontSize=20,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,WrapStyle=2\'" -c:v libx264 -r {fps} -c:a aac -map 0:v:0 -map 1:a:0 "{tmp}" -y'.replace('\\', '/'))
 70 |     
 71 |     # commands.append(f'ffmpeg -i "{input_video}" -i "{input_audio}" -c:v libx264 -r {fps} -c:a aac -map 0:v:0 -map 1:a:0 "{tmp}" -y'.replace('\\', '/'))
 72 |     
 73 |     if speed_up != 1:
 74 |         commands.append(
 75 |             f'ffmpeg -i "{tmp}" -vf "setpts={1/speed_up}*PTS" -af "atempo={speed_up}" -c:v libx264 -c:a aac "{output_path}" -y'.replace('\\', '/'))
 76 | 
 77 |     # Command to delete the temporary file
 78 |     commands.append(f'del "{tmp}"')
 79 |     
 80 |     # move input video to dst folder
 81 |     commands.append(f'move "{input_video}" "{dst_folder}"')
 82 | 
 83 |     # Add an 'exit' command to close the command prompt window after execution
 84 |     commands.append('exit')
 85 | 
 86 |     # Join the commands with '&&' to ensure sequential execution
 87 |     command = ' && '.join(commands)
 88 | 
 89 |     # Execute the combined FFmpeg command
 90 |     print(command)
 91 |     subprocess.Popen(command, shell=True)
 92 |     
 93 | def replace_audio(video_path: str, audio_path: str, subtitle_path: str, output_path: str, fontsize=64, font='SimHei', color='white') -> None:
 94 |     """Replace the audio of the video file with the provided audio file.
 95 | 
 96 |     Args:
 97 |         video_path (str): Path to the video file.
 98 |         audio_path (str): Path to the audio file to replace the original audio.
 99 |         output_path (str): Path to save the output video file.
100 |     """
101 |     video_name = os.path.basename(video_path)
102 |     srt_name = video_name.replace('.mp4', '.srt').replace(
103 |         '.mkv', '.srt').replace('.avi', '.srt').replace('.flv', '.srt')
104 |     srt_path = os.path.join(os.path.dirname(audio_path), srt_name)
105 |     convert_json_to_srt(subtitle_path, srt_path)
106 | 
107 |     video = VideoFileClip(video_path)
108 |     audio = AudioFileClip(audio_path)
109 |     new_video = video.set_audio(audio)
110 | 
111 |     new_video.write_videofile(output_path, codec='libx264', threads=16, fps=30)
112 | 
113 | if __name__ == '__main__':
114 |     # file_name = r"This Virus Shouldn't Exist (But it Does)"
115 |     file_name = "Kurzgesagt Channel Trailer"
116 |     input_folder = 'test'
117 |     output_folder = os.path.join('output', file_name)
118 |     input_video = os.path.join(input_folder, file_name + '.mp4')
119 |     input_audio = os.path.join(output_folder, 'zh.wav')
120 |     input_subtitles = os.path.join(output_folder, 'zh.json')
121 |     srt_path = os.path.join(output_folder, file_name+'.srt')
122 |     output_path = os.path.join(output_folder, file_name + '.mp4')
123 |     replace_audio_ffmpeg(input_video, input_audio,
124 |                          input_subtitles, output_path)
125 | 


--------------------------------------------------------------------------------
/开发.md:
--------------------------------------------------------------------------------
 1 | 设计这样的Python架构确实需要考虑高效和可扩展性，特别是处理多视频文件的情况。以下是一个建议的架构设计，它将整个流程分解为几个主要步骤，每个步骤都可以并行处理以提高效率。
 2 | 
 3 | ### 1. 初始化和设置
 4 | 
 5 | 首先，初始化项目并设置一些基本参数，如输入/输出文件夹的路径、各种工具的配置等。
 6 | 
 7 | ### 2. 视频文件扫描和队列管理
 8 | 
 9 | - **扫描输入文件夹**：检查输入文件夹中的所有视频文件。
10 | - **创建任务队列**：为每个视频文件创建一个任务，并将这些任务添加到队列中。可以使用Python的队列（如`queue.Queue`）来管理这些任务。
11 | 
12 | ### 3. 并行处理
13 | 
14 | 此步骤将根据可用资源（CPU核心数、内存等）启动多个并行进程或线程来处理任务队列中的视频。可以使用Python的`concurrent.futures`模块来实现并行处理。
15 | 
16 | #### 对于每个视频文件的处理流程包括：
17 | 
18 | - **创建输出文件夹**：为每个视频创建对应的输出文件夹。
19 | - **音频提取和处理**：
20 |   - 使用`moviepy`提取音频。
21 |   - 使用`demucs`进行人声分离。
22 |   - 使用`whisperX`进行语音识别和时间戳对齐。
23 | 
24 | - **翻译和语音合成**：
25 |   - 使用ChatGPT进行翻译。
26 |   - 使用PaddleSpeech进行语音合成。
27 |   - 调整合成语音的时间长度。
28 | 
29 | - **音频合成和视频合成**：
30 |   - 合成中文语音和背景音乐。
31 |   - 将合成的音频与原视频结合生成最终视频。
32 | 
33 | ### 4. 同步和最终输出
34 | 
35 | - **监控和同步**：确保所有的子任务完成后，进行同步操作。
36 | - **输出最终视频**：将处理好的视频移动到最终的输出文件夹。
37 | 
38 | ### 5. 错误处理和日志记录
39 | 
40 | - 在每个步骤中添加错误处理和异常捕捉机制。
41 | - 记录处理过程中的日志，以便于问题排查和性能优化。
42 | 
43 | ### 6. 可选：用户界面/UI
44 | 
45 | - 如果需要，可以开发一个简单的用户界面来管理这些任务，显示进度和日志。
46 | 
47 | ### 技术选型
48 | 
49 | - **多进程/多线程**：根据任务的CPU密集型或IO密集型特性选择合适的并行策略。
50 | - **队列管理**：使用Python内置的队列管理任务。
51 | - **日志记录**：使用`logging`模块进行详细的日志记录。
52 | 
53 | ### 注意事项
54 | 
55 | - 资源管理：确保并行处理不会耗尽系统资源。
56 | - 错误处理：为每个步骤添加充分的错误处理和异常捕捉。
57 | - 性能优化：定期回顾日志和性能指标，以优化处理流程。
58 | 
59 | 这个架构提供了一个基本的框架，你可以根据实际情况进行调整和优化。


--------------------------------------------------------------------------------