├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── SECURITY.md ├── assets ├── E2E_intel.png └── E2E_stock.png ├── data ├── data.txt ├── pretrained_models.txt └── product_description.csv ├── env ├── intel │ └── intel-voice.yml └── stock │ └── stock-voice.yml └── src ├── evaluation.py ├── inference.py ├── optim_eval.py └── training.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, caste, color, religion, or sexual 10 | identity and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the overall 26 | community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or advances of 31 | any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email address, 35 | without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | CommunityCodeOfConduct AT intel DOT com. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series of 86 | actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or permanent 93 | ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within the 113 | community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.1, available at 119 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. 120 | 121 | Community Impact Guidelines were inspired by 122 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC]. 123 | 124 | For answers to common questions about this code of conduct, see the FAQ at 125 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at 126 | [https://www.contributor-covenant.org/translations][translations]. 127 | 128 | [homepage]: https://www.contributor-covenant.org 129 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html 130 | [Mozilla CoC]: https://github.com/mozilla/diversity 131 | [FAQ]: https://www.contributor-covenant.org/faq -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | ### License 4 | 5 | Satellite Image Processing system is licensed under the terms in [LICENSE](https://github.com/oneapi-src/voice-data-generation/blob/main/LICENSE). By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms. 6 | 7 | ### Sign your work 8 | 9 | Please use the sign-off line at the end of the patch. Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify 10 | the below (from [developercertificate.org](http://developercertificate.org/)): 11 | 12 | ``` 13 | Developer Certificate of Origin 14 | Version 1.1 15 | 16 | Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 17 | 660 York Street, Suite 102, 18 | San Francisco, CA 94110 USA 19 | 20 | Everyone is permitted to copy and distribute verbatim copies of this 21 | license document, but changing it is not allowed. 22 | 23 | Developer's Certificate of Origin 1.1 24 | 25 | By making a contribution to this project, I certify that: 26 | 27 | (a) The contribution was created in whole or in part by me and I 28 | have the right to submit it under the open source license 29 | indicated in the file; or 30 | 31 | (b) The contribution is based upon previous work that, to the best 32 | of my knowledge, is covered under an appropriate open source 33 | license and I have the right under that license to submit that 34 | work with modifications, whether created in whole or in part 35 | by me, under the same open source license (unless I am 36 | permitted to submit under a different license), as indicated 37 | in the file; or 38 | 39 | (c) The contribution was provided directly to me by some other 40 | person who certified (a), (b) or (c) and I have not modified 41 | it. 42 | 43 | (d) I understand and agree that this project and the contribution 44 | are public and that a record of the contribution (including all 45 | personal information I submit with it, including my sign-off) is 46 | maintained indefinitely and may be redistributed consistent with 47 | this project or the open source license(s) involved. 48 | ``` 49 | 50 | Then you just add a line to every git commit message: 51 | 52 | Signed-off-by: Joe Smith 53 | 54 | Use your real name (sorry, no pseudonyms or anonymous contributions.) 55 | 56 | If you set your `user.name` and `user.email` git configs, you can sign your 57 | commit automatically with `git commit -s`. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2023, Intel Corporation 2 | 3 | Redistribution and use in source and binary forms, with or without 4 | modification, are permitted provided that the following conditions are met: 5 | 6 | * Redistributions of source code must retain the above copyright notice, 7 | this list of conditions and the following disclaimer. 8 | * Redistributions in binary form must reproduce the above copyright 9 | notice, this list of conditions and the following disclaimer in the 10 | documentation and/or other materials provided with the distribution. 11 | * Neither the name of Intel Corporation nor the names of its contributors 12 | may be used to endorse or promote products derived from this software 13 | without specific prior written permission. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 16 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 17 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE 19 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 20 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 21 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 22 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 23 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 24 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | PROJECT NOT UNDER ACTIVE MANAGEMENT 2 | 3 | This project will no longer be maintained by Intel. 4 | 5 | Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project. 6 | 7 | Intel no longer accepts patches to this project. 8 | 9 | If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project. 10 | 11 | Contact: webadmin@linux.intel.com 12 | # Applications of Synthetic Voice/Audio Generation Using PyTorch 13 | ## Introduction 14 | Synthetic voice is a computer-generated speech. [More Info](https://en.wikipedia.org/wiki/Speech_synthesis) 15 | ## Table of Contents 16 | - [Purpose](#purpose) 17 | - [Reference Solution](#reference-solution) 18 | - [Reference Implementation](#reference-implementation) 19 | 20 | ## Purpose 21 | With the surge in AI implementation in modern industrial solutions, the demand for datasets has increased significantly to build robust and reliable AI models. The challenges associated with data privacy and higher cost of purchasing real datasets, limited data availability, accuracy of data labeling, and lack of scalability and variety, is driving the use of synthetic data to fulfill the high demand for AI solutions across industries. 22 | 23 | Synthetic voice has wide applications in virtual assistants, education, healthcare, multimedia and entertainment. Text-to-Speech (TTS) is one method of generating synthetic voice. It creates human speech artificially. One of the main benefits of voice synthesis is to make information easily accessible to a wider audience. For example, people with visual impairments or reading disabilities can use this technology to read aloud the written content. This can help differently-abled people access a wider range of information and communicate more easily with others. 24 | 25 | Voice synthesis technology is increasingly used to create more natural-sounding virtual assistants and chatbots, which can improve the user experience and engagement through personalized communication based on voice and language preferences. 26 | 27 | TTS is changing in a variety of ways. For example, voice cloning can capture your brand essence and express it through a machine. Speech cloning allows you to use TTS in conjunction with voice recording data sets to combine the voices of known persons such as executives and celebrities, which can be valuable for businesses in industries such as entertainment. 28 | ## Reference Solution 29 | AI-enabled synthetic voice generator aid helps generate voice using simple text generator context/story as input and provides a pure system-generated synthetic voice. 30 | 31 | The goal of this reference kit is to translate the input text data into speech. A transfer learning approach is performed on advanced PyTorch-based pre-trained Tacotron and WaveRNN (VOCODER) models. This model combination is known to be a promising method to synthesize voice data from the corresponding input text data. The LJ Speech dataset, after pre-processing using NumPy, is used for further training the mentioned pre-trained models. From the input text data, the model generates speech that mimics the voice of the LJ speech dataset which was used to train the AI model. 32 | 33 | Since GPUs are typically the choice for deep learning and AI processing to achieve a higher performance rate, to offer a more cost-effective option leveraging a CPU, the quantization technique can be used, leveraging the Intel® Analytics toolkit, to achieve higher performance rate by performing vectorized operations on CPUs itself. 34 | 35 | By quantizing/compressing the model (from floating-point to integer model), while maintaining a similar level of accuracy as the floating-point model, demonstrated efficient utilization of underlying resources when deployed on edge devices with low processing and memory capabilities 36 | 37 | ## Reference Implementation 38 | ### Use Case End-To-End flow 39 | ![Use_case_flow](assets/E2E_stock.png) 40 | 41 | **Description:** An open-source LJ Speech voice dataset is first preprocessed with NumPy before being used to train an advanced pre-trained Tacotron and WaveRNN (VOCODER) models with Stock PyTorch v1.13.0. Following training, the trained Stock PyTorch v1.13.0 model is used to generate synthetic voice data from the input text sentence. 42 | 43 | ### Expected Input-Output 44 | 45 | 46 | | **Input** | **Output** | 47 | |:-------------------------------------:|-----| 48 | | Text Sentence | Synthesized Audio data | 49 | 50 | 51 | ### Reference Sources 52 | 53 | *DataSet*: https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 (2.6 GB dataset for this use case)
54 | *Case Study & Repo*: https://github.com/fatchord/WaveRNN 55 | 56 | > ***Please see this data set's applicable license for terms and conditions. Intel®Corporation does not own the rights to this data set and does not confer any rights to it.*** 57 | 58 | ### Repository clone and Anaconda installation 59 | 60 | ``` 61 | git clone https://github.com/oneapi-src/voice-data-generation 62 | cd voice-data-generation 63 | ``` 64 | 65 | > **Note**: If you are beginning to explore the reference kits on client machines such as a windows laptop, go to the [Running on Windows](#running-on-windows) section to ensure you are all set and come back here 66 | 67 | > **Note**: The performance measurements were captured on Xeon based processors. The instructions will work on WSL, however some portions of the ref kits may run slower on a client machine, so utilize the flags supported to modify the epochs/batch size to run the training or inference faster. Additionally performance claims reported may not be seen on a windows based client machine. 68 | 69 | > **Note**: In this reference kit implementation already provides the necessary conda environment configurations to setup the software requirements. To utilize these environment scripts, first install Anaconda/Miniconda by following the instructions at the following link 70 | > [Anaconda installation](https://docs.anaconda.com/anaconda/install/linux/) 71 | 72 | ### Usage and Instructions 73 | 74 | Below are the steps to reproduce the benchmarking results given in this repository 75 | 1. Creating the execution environment 76 | 2. Dataset preparation 77 | 3. Training Tacotron & WaveRNN models 78 | 4. Evaluation 79 | 5. Model Inference 80 | 81 | ### Software Requirements 82 | | **Package** | **Stock Python** 83 | |:-------------------------| :--- 84 | | Python | python==3.8.15 85 | | PyTorch | torch==1.13.0 86 | 87 | ### Environment 88 | Below are the developer environment used for this module on Azure. All the observations captured are based on these environment setup. 89 | 90 | 91 | | **Size** | **CPU Cores** | **Memory** | **Intel® CPU Family** | 92 | |----------|:-------------:|:----------:|:---------------------:| 93 | | NA | 8 | 32GB | ICELAKE | 94 | 95 | ### Solution setup 96 | The below file is used to create an environment as follows: 97 | 98 | 99 | | **YAML file** | **Environment Name** | **Configuration** | 100 | |:---------------------:|----------------------|:---------------------------------------:| 101 | | `env/stock/stock-voice.yml` | `stock-voice` | Python=3.8.15 with stock PyTorch v1.13.0 | 102 | 103 | ### Dataset 104 | This is a public domain dataset containing short audio clips of a single speaker reading passages from 7 different non-fictional books. 105 | 106 | | **Use case** | Speech Generation 107 | |:-------------------------------| :--- 108 | | **Data Format** | Audio File in ".wav" format 109 | | **Size** | Total 13100 short audio files
110 | | **Source** | https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 111 | 112 | > **Note**: Please refer to the data.txt file in the "data" folder for downloading the dataset. 113 | 114 | ### Training 115 | 116 | We train the Tacotron model with the preprocessed data which generates Ground Truth aligned melspectograms later fed to WaveRNN model to generate clean synthesized audio file. 117 | 118 | 119 | | **Input** | String (number of words) 120 | |:------------------------| :--- 121 | | **Output Model format** | PyTorch 122 | **Output** | Audio file 123 | 124 | ### Inference 125 | Performed inferencing on the trained model using Stock PyTorch v1.13.0. 126 | 127 | #### 1. Environment Creation 128 | **Setting up the environment for Stock PyTorch**
Follow the below conda installation commands to setup the Stock PyTorch environment for the model training and prediction. 129 | ```sh 130 | conda env create -f env/stock/stock-voice.yml 131 | ``` 132 | *Activate stock conda environment* 133 | Use the following command to activate the environment that was created: 134 | ```sh 135 | conda activate stock-voice 136 | ``` 137 | 138 | >Note: Please refer to known-issues section at the end in case of any libraries issues. 139 | 140 | #### 2. Data preparation & Pre-trained models 141 | 142 | ##### 2.1 Preparing Pre-trained models 143 | > **Note**: For instructions to get the pretrained models and setting up the repository, refer the **pretrained_models.txt** file inside "data" folder and follow the steps. 144 | 145 | ##### 2.2 Data preparation 146 | > The LJSpeech audio files dataset is downloaded and extracted in a folder before running the training python module. 147 | 148 | Folder structure Looks as below after extraction of dataset. 149 | ``` 150 | - data 151 | - LJSpeech-1.1 152 | - wavs 153 | - metadata 154 | - readme 155 | ``` 156 | > **Note**: For instructions to download the dataset, refer the **data.txt** file inside "data" folder. 157 | 158 | > **Now the data folder contains the below structure** 159 |
data="data/LJSpeech-1.1/{wavs/metadata/readme}" 160 | 161 | > **Note**: Please be in "WaveRNN" folder to continue benchmarking. the below step is optional, if user already followed the instructions provided above in data preperation. 162 | ``` 163 | cd WaveRNN 164 | ``` 165 | Run the preprocess module as given below to start data preprocessing using the active environment. 166 |
This module takes option to run the preprocessing. 167 | ``` 168 | usage: preprocess.py [-h] [--path PATH] [--extension EXT] [--num_workers N] [--hp_file FILE] 169 | 170 | Preprocessing for WaveRNN and Tacotron 171 | 172 | optional arguments: 173 | -h, --help show this help message and exit 174 | --path PATH, -p PATH directly point to dataset path (overrides hparams.wav_path 175 | --extension EXT, -e EXT 176 | file extension to search for in dataset folder 177 | --num_workers N, -w N 178 | The number of worker threads to use for preprocessing 179 | --hp_file FILE The file to use for the hyperparameters 180 | ``` 181 | **Command to do data preprocessing** 182 | ```sh 183 | python preprocess.py --path '../../data/LJSpeech-1.1' -e '.wav' -w 8 --hp_file 'hparams.py' 184 | ``` 185 | > **Note**: preprocessed data will be stored inside "data" folder of cloned "WaveRNN" repository. 186 | 187 | #### 3 Training model 188 | Run the training module as given below to start training using the active environment. 189 | 190 |
This module takes option to run the training. 191 | ``` 192 | usage: training.py [-h] [--force_gta] [--force_cpu] [--lr LR] [--batch_size BATCH_SIZE] [--hp_file FILE] [--epochs EPOCHS] 193 | 194 | Train Tacotron TTS & WaveRNN Voc 195 | 196 | optional arguments: 197 | -h, --help show this help message and exit 198 | --force_gta, -g Force the model to create GTA features 199 | --force_cpu, -c Forces CPU-only training, even when in CUDA capable environment 200 | --lr LR, -l LR [float] override hparams.py learning rate 201 | --batch_size BATCH_SIZE, -b BATCH_SIZE 202 | [int] override hparams.py batch size 203 | --hp_file FILE The file to use for the hyper parameters 204 | --epochs EPOCHS, -e EPOCHS 205 | [int] number of epochs for training 206 | ``` 207 | **Command to run training** 208 | ```sh 209 | python training.py --hp_file 'hparams.py' --epochs 100 210 | ``` 211 | 212 | >Note: Training is optional as this reference kit provides pretrained models to run evaluation and inference as given below. 213 | 214 | **Expected Output**
215 | >The output trained model will be saved in `WaveRNN/pretrained/tts_weights` & `WaveRNN/pretrained/voc_weights` as `latest_weights.pyt` for Tacotron model & WaveRNN model respectively. 216 | 217 | #### 4. Evaluating the model 218 | 219 | Run the evaluation module to find out the word error rate and accuracy of the model. 220 | ``` 221 | usage: evaluation.py [-h] [--input_text INPUT_TEXT] [--batched] [--unbatched] [--force_cpu] [--hp_file FILE] [-ipx INTEL] [--save_path SAVE_PATH] 222 | [--voc_weights VOC_WEIGHTS] [--tts_weights TTS_WEIGHTS] 223 | 224 | Evaluation 225 | 226 | optional arguments: 227 | -h, --help show this help message and exit 228 | --input_text INPUT_TEXT, -i INPUT_TEXT 229 | [string] Type in something here and TTS will generate it! 230 | --batched, -b Fast Batched Generation (lower quality) 231 | --unbatched, -u Slower Unbatched Generation (better quality) 232 | --force_cpu, -c Forces CPU-only training, even when in CUDA capable environment 233 | --hp_file FILE The file to use for the hyper parameters 234 | -ipx INTEL, --intel INTEL 235 | use 1 for enabling intel pytorch optimizations, default is 0 236 | --save_path SAVE_PATH 237 | [string/path] where to store the speech files generated for the input text, default saved_audio folder 238 | --voc_weights VOC_WEIGHTS 239 | [string/path] Load in different WaveRNN weights 240 | --tts_weights TTS_WEIGHTS 241 | [string/path] Load in different Tacotron weights 242 | ``` 243 | 244 | **Command to run evaluation** 245 | 246 | > **Note**: Users can evaluate the models in two ways 247 | 1. Single text sentence 248 | 2. Multiple text sentences using csv file. 249 | 250 | ```sh 251 | # Evaluating on the single input string 252 | python evaluation.py --input_text "From fairest creatures we desire increase, That thereby beauty's rose" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' 253 | ``` 254 | ```sh 255 | # Evaluating on multiple text sentences using csv file 256 | python evaluation.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' 257 | ``` 258 | 259 | **Expected Output**
260 | >Average Word Error Rate: 36.36430090377459%, accuracy=63.635699096225416% 261 | 262 | The user can collect the logs by redirecting the output to a file as illustrated below. 263 | 264 | ```shell 265 | python evaluation.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' | tee 266 | ``` 267 | 268 | The output of the python script evaluation.py will be collected in the file 269 | 270 | #### 5. Inference 271 | *Running inference using PyTorch* 272 | 273 | ``` 274 | usage: inference.py [-h] [--input_text INPUT_TEXT] [--batched] [--unbatched] [--force_cpu] [--hp_file FILE] [-ipx INTEL] [--save_path SAVE_PATH] 275 | [--voc_weights VOC_WEIGHTS] [--tts_weights TTS_WEIGHTS] 276 | 277 | Inference 278 | 279 | optional arguments: 280 | -h, --help show this help message and exit 281 | --input_text INPUT_TEXT, -i INPUT_TEXT 282 | [string/csv file] Type in something here and TTS will generate it! 283 | --batched, -b Fast Batched Generation (lower quality) 284 | --unbatched, -u Slower Unbatched Generation (better quality) 285 | --force_cpu, -c Forces CPU-only training, even when in CUDA capable environment 286 | --hp_file FILE The file to use for the hyper parameters 287 | -ipx INTEL, --intel INTEL 288 | use 1 for enabling intel pytorch optimizations, default is 0 289 | --save_path SAVE_PATH 290 | [string/path] where to store the speech files generated for the input text, default saved_audio folder 291 | --voc_weights VOC_WEIGHTS 292 | [string/path] Load in different WaveRNN weights 293 | --tts_weights TTS_WEIGHTS 294 | [string/path] Load in different Tacotron weights 295 | ``` 296 | **Command to run inference** 297 | 298 | > **Note**: Users can inference the models in two ways 299 | 1. Single text sentence 300 | 2. Multiple text sentences using csv file. 301 | 302 | ```sh 303 | # Batch inferencing on the single input string 304 | python inference.py --input_text "From fairest creatures we desire increase, That thereby beauty's rose" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' 305 | ``` 306 | ```sh 307 | # Batch inferencing on multiple text sentences using csv file 308 | python inference.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' 309 | ``` 310 | 311 | **Expected Output**
312 | >Total time for inference is 10.439789295196533 313 | 314 | >Generated audio samples can be found in "saved_audio" folder by default. 315 | 316 | The user can collect the logs by redirecting the output to a file as illustrated below. 317 | 318 | ```shell 319 | python inference.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' | tee 320 | ``` 321 | 322 | The output of the python script inference.py will be collected in the file 323 | 324 | ## Optimizing the End To End solution with Intel® oneAPI components 325 | 326 | #Coming Soon... 327 | 328 | This reference solution can be optimized with Intel® oneAPI components to achieve a performance boost, This section will be added soon. 329 | 330 | ## Conclusion 331 | To build a synthetic voice data generator model for audio synthesis using the Deep-learning approach, machine learning engineers will need to train models with a large dataset and run inference more frequently. 332 | 333 | ### Notices & Disclaimers 334 | Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). 335 | Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. 336 | Your costs and results may vary. 337 | Intel technologies may require enabled hardware, software or service activation. 338 | © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. 339 | 340 | To the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license. 341 | 342 | Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content. 343 | 344 | ## Appendix 345 | 346 | ### **Running on Windows** 347 | 348 | The reference kits commands are linux based, in order to run this on Windows, goto Start and open WSL and follow the same steps as running on a linux machine starting from git clone instructions. If WSL is not installed you can [install WSL](https://learn.microsoft.com/en-us/windows/wsl/install). 349 | 350 | > **Note** If WSL is installed and not opening, goto Start ---> Turn Windows feature on or off and make sure Windows Subsystem for Linux is checked. Restart the system after enabling it for the changes to reflect. 351 | 352 | 353 | ### **Experiment Setup** 354 | - Testing performed on: March 2023 355 | - Testing performed by: Intel Corporation 356 | - Configuration Details: Azure Standard_D8_V5 (Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz), 1 Socket, 4 Cores per Socket, 2 Threads per Core, Turbo:On, Total Memory: 32 GB, OS: Ubuntu 20.04, Kernel: Linux 5.13.0-1031-azure , Software: Intel® Extension for PyTorch* v1.13.0, Intel® Neural Compressor v1.14.2 357 | 358 | | Platform | Ubuntu 20.04 359 | | :--- | :--- 360 | | Hardware | Azure Standard_D8_V5 (Icelake) 361 | | Software | Intel® Extension for PyTorch*, Intel® Neural Compressor. 362 | | What you will learn | Advantage of using components in Intel® oneAPI AI Analytics Toolkit over the stock version for the computer vision-based model build and inferencing. 363 | 364 | ### Known Issues 365 | 366 | 1. Common prerequisites required to run python scripts in linux system. 367 | Install gcc and curl. For Ubuntu, this will be: 368 | 369 | ```bash 370 | apt install gcc 371 | sudo apt install libglib2.0-0 372 | sudo apt install curl 373 | ``` 374 | 375 | 2. ImportError: libGL.so.1: cannot open shared object file: No such file or directory 376 | 377 | **Issue:** 378 | ``` 379 | ImportError: libGL.so.1: cannot open shared object file: No such file or directory 380 | or 381 | libgthread-2.0.so.0: cannot open shared object file: No such file or directory 382 | ``` 383 | 384 | **Solution:** 385 | 386 | Install the libgl11-mesa-glx and libglib2.0-0 libraries. For Ubuntu this will be: 387 | 388 | ```bash 389 | sudo apt install libgl1-mesa-glx 390 | sudo apt install libglib2.0-0 391 | ``` 392 | 3. OSError: cannot load library 'libsndfile.so' 393 | 394 | **Issue:** 395 | ``` 396 | OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory 397 | ``` 398 | 399 | **Solution:** 400 | 401 | Install the libsndfile1-dev library. For Ubuntu this will be: 402 | 403 | ```bash 404 | sudo apt-get install libsndfile1-dev 405 | ``` 406 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | ## Report a Vulnerability 4 | 5 | Please report security issues or vulnerabilities to the [Intel® Security Center]. 6 | 7 | For more information on how Intel® works to resolve security issues, see 8 | [Vulnerability Handling Guidelines]. 9 | 10 | [Intel® Security Center]:https://www.intel.com/content/www/us/en/security-center/default.html 11 | 12 | [Vulnerability Handling Guidelines]:https://www.intel.com/content/www/us/en/security-center/vulnerability-handling-guidelines.html 13 | -------------------------------------------------------------------------------- /assets/E2E_intel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/voice-data-generation/012e833f68dc53d321b2dbcd454f7a4de7cc4b80/assets/E2E_intel.png -------------------------------------------------------------------------------- /assets/E2E_stock.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oneapi-src/voice-data-generation/012e833f68dc53d321b2dbcd454f7a4de7cc4b80/assets/E2E_stock.png -------------------------------------------------------------------------------- /data/data.txt: -------------------------------------------------------------------------------- 1 | *make sure of the directory you are in is WaveRNN* 2 | 3 | cd ../../data 4 | 5 | 6 | *now to dowload the data set in tar format from LJSpeech* 7 | 8 | wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 9 | 10 | 11 | *Extract the LJSpeech-1.1.tar.bz2 file that was downloaded to have readMe.txt, metadata.csv & wav Folder which contains the audio files * 12 | 13 | tar -xf LJSpeech-1.1.tar.bz2 14 | 15 | 16 | *remove the tar file * 17 | 18 | rm LJSpeech-1.1.tar.bz2 19 | 20 | 21 | *Change back to the WaveRNN directory* 22 | 23 | cd ../src/WaveRNN 24 | 25 | 26 | *** Start preprocessing of the data by running preprocess.py, as mentioned in ReadMe.md*** -------------------------------------------------------------------------------- /data/pretrained_models.txt: -------------------------------------------------------------------------------- 1 | cd src 2 | git clone https://github.com/fatchord/WaveRNN 3 | 4 | cp ./*.py WaveRNN 5 | cp ./deploy.yaml WaveRNN 6 | 7 | mv WaveRNN/optim_eval.py WaveRNN/utils 8 | 9 | cd WaveRNN 10 | 11 | unzip pretrained/ljspeech.tacotron.r2.180k.zip -d pretrained/tts_weights 12 | unzip pretrained/ljspeech.wavernn.mol.800k.zip -d pretrained/voc_weights 13 | 14 | 15 | *** follow the data_download_manual.txt *** 16 | 17 | 18 | 19 | -------------------------------------------------------------------------------- /data/product_description.csv: -------------------------------------------------------------------------------- 1 | Product,Description 2 | Intel Xeon Scalable,3rd Gen Intel Xeon Scalable processors offer a balanced architecture that delivers built-in AI acceleration and advanced security capabilities. This allows you to place your workloads where they perform best - from edge to cloud. 3 | Intel Xeon Scalable,"The 3rd Gen Intel Xeon Scalable processor benefits from decades of innovation for the most common workload requirements. Supported by close partnerships with the world’s software leaders and solution providers. 3rd Gen Intel Xeon Scalable processors are optimized for many workload types and performance levels, all with the consistent, open, Intel architecture you know and trust." 4 | Intel Core i9 Processors,"These processors feature a performance hybrid architecture designed for intelligent performance, optimized creating, and enhanced tuning to allow gamers to game with up to 5.8 GHz clock speed." 5 | Intel Mobile Chipsets,"These powerful and feature-rich chipsets, are purpose built for portable, mobile, and 2 in 1 devices. Users can watch UHD videos with crisp imagery, view and edit photos in perfect detail, and smoothly play today’s modern games." 6 | Intel Desktop Chipsets,"Mainstream chipsets run popular applications, support UHD video, audio, and image editing, and run today’s modern games without lag. Performance chipsets deliver superior audio and digital video, and ultimate power for content creation, advanced applications." 7 | Intel Pentium Processors,"Discover an amazing balance of performance, experience, and value with systems powered by Intel Pentium processors." 8 | Intel Pentium Processors,"These processors power more devices, from notebooks to convertibles to desktops and mini PCs—Supports Windows, Chrome and Linux OS—giving you flexibility to choose the best device for your needs, while knowing it will give you the performance, experiences, and security features you deserve" 9 | Intel Pentium Gold Processors,"Intel Pentium Gold processors provide great value and performance to do daily activities plus power to do light photo editing, video editing, and multitasking." 10 | Intel Pentium Gold Processors,Computers with Intel Pentium Gold processors provide quick processing and vivid graphics. Choose from a range of form factors. 11 | IoT and Embedded Processors,"Deploy edge applications quickly with Intel's portfolio of edge-ready compute and connectivity technologies. Enhanced for IoT, they enable processing at the edge to get critical insights and business value from your data with compute resources where you need them most." 12 | -------------------------------------------------------------------------------- /env/intel/intel-voice.yml: -------------------------------------------------------------------------------- 1 | name: intel-voice 2 | channels: 3 | - pytorch 4 | - defaults 5 | dependencies: 6 | - cpuonly=2.0 7 | - pip=22.2.2 8 | - python=3.8 9 | - pytorch=1.13.0 10 | - torchvision=0.14.0 11 | - pip: 12 | - numba==0.48.0 13 | - librosa==0.6.3 14 | - numpy==1.22.0 15 | - intel-extension-for-pytorch==1.13.0 16 | - neural-compressor==1.14.2 17 | - matplotlib==3.6.2 18 | - unidecode==1.3.6 19 | - inflect==6.0.2 20 | - SpeechRecognition==3.9.0 21 | - soundfile==0.11.0 22 | 23 | 24 | -------------------------------------------------------------------------------- /env/stock/stock-voice.yml: -------------------------------------------------------------------------------- 1 | name: stock-voice 2 | channels: 3 | - pytorch 4 | - defaults 5 | dependencies: 6 | - cpuonly=2.0 7 | - pip=22.2.2 8 | - python=3.8 9 | - pytorch=1.13.0 10 | - torchvision=0.14.0 11 | - pip: 12 | - numba==0.48.0 13 | - librosa==0.6.3 14 | - numpy==1.22.0 15 | - matplotlib==3.6.2 16 | - unidecode==1.3.6 17 | - inflect==6.0.2 18 | - SpeechRecognition==3.9.0 19 | - soundfile==0.11.0 20 | 21 | 22 | -------------------------------------------------------------------------------- /src/evaluation.py: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2023 Intel Corporation 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | """ 4 | Evaluation code 5 | below code is adopted from https://github.com/fatchord/WaveRNN 6 | """ 7 | # pylint: disable=C0103,C0301,C0413,E0401,R0914,R0915 8 | 9 | # !pip install SpeechRecognition pydub 10 | 11 | import argparse 12 | import os 13 | import torch 14 | import speech_recognition as sr 15 | import soundfile 16 | from utils.optim_eval import word_error_rate, ipex_optimization, load_model, run_inference 17 | from utils import hparams as hp 18 | from utils.text import text_to_sequence 19 | from utils.display import simple_table 20 | 21 | 22 | def process_evaluation_epoch(global_vars: dict): 23 | """ 24 | Processes results from each worker at the end of evaluation and combine to final result 25 | Args: 26 | global_vars: dictionary containing information of entire evaluation 27 | Return: 28 | wer: final word error rate 29 | loss: final loss 30 | """ 31 | hypotheses = global_vars['predictions'] 32 | references = global_vars['transcripts'] 33 | 34 | # wer, scores, num_words 35 | wer, _, _ = word_error_rate( 36 | hypotheses=hypotheses, references=references) 37 | return wer 38 | 39 | 40 | def main(): 41 | """ 42 | Main function 43 | """ 44 | # Parse Arguments 45 | parser = argparse.ArgumentParser(description='Evaluation') 46 | parser.add_argument('--input_text', '-i', type=str, default=None, 47 | help='[string] Type in something here and TTS will generate it!') 48 | parser.add_argument('--batched', '-b', dest='batched', action='store_true', default=False, 49 | help='Fast Batched Generation (lower quality)') 50 | parser.add_argument('--unbatched', '-u', dest='batched', action='store_false', 51 | help='Slower Unbatched Generation (better quality)') 52 | parser.add_argument('--force_cpu', '-c', action='store_true', 53 | help='Forces CPU-only training, even when in CUDA capable environment') 54 | parser.add_argument('--hp_file', metavar='FILE', default='hparams.py', 55 | help='The file to use for the hyper parameters') 56 | parser.add_argument('-ipx', '--intel', type=int, required=False, default=0, 57 | help='use 1 for enabling intel pytorch optimizations, default is 0') 58 | parser.add_argument('--save_path', type=str, default='saved_audio/evaluation', 59 | help='[string/path] where to store the speech files generated for the input text, ' 60 | 'default saved_audio folder') 61 | parser.add_argument('--voc_weights', type=str, default='pretrained/voc_weights/latest_weights.pyt', 62 | help='[string/path] Load in different WaveRNN weights') 63 | parser.add_argument('--tts_weights', type=str, default='pretrained/tts_weights/latest_weights.pyt', 64 | help='[string/path] Load in different Tacotron weights') 65 | args = parser.parse_args() 66 | 67 | hp.configure(args.hp_file) # Load hparams from file 68 | 69 | # parser.set_defaults(batched=False) 70 | parser.set_defaults(input_text=None) 71 | 72 | batched = args.batched 73 | input_text = args.input_text 74 | intel_flag = args.intel 75 | tts_weights = args.tts_weights 76 | voc_weights = args.voc_weights 77 | save_path = args.save_path 78 | # creating the save path directory to store the output generated audio file if it does not exist 79 | os.makedirs(save_path, exist_ok=True) 80 | 81 | if not args.force_cpu and torch.cuda.is_available(): 82 | device = torch.device('cuda') 83 | else: 84 | device = torch.device('cpu') 85 | print('Using device:', device) 86 | 87 | if not (input_text.endswith(".txt") or input_text.endswith(".csv")): 88 | 89 | inputs = [text_to_sequence(input_text.strip(), hp.tts_cleaner_names)] 90 | inp_txt = input_text 91 | else: 92 | with open(input_text) as f: 93 | inputs = [] 94 | inp_txt = [] 95 | cnt = 0 96 | for line in f: 97 | split = line.split(',') 98 | sentence = split[-1][:-1] 99 | # adding "hi " here because the speech to text conversion sometimes misses the 1st word 100 | sentence = "hi " + sentence 101 | if cnt > 0: 102 | inp_txt.append(sentence) 103 | inputs.append(text_to_sequence(sentence.strip(), hp.tts_cleaner_names)) 104 | cnt += 1 105 | 106 | voc_model = load_model(voc_weights, 'w', device) 107 | tts_model = load_model(tts_weights, 't', device) 108 | 109 | tts_model, voc_model = ipex_optimization(tts_model, voc_model, intel_flag) 110 | 111 | voc_k = voc_model.get_step() // 1000 112 | tts_k = tts_model.get_step() // 1000 113 | 114 | r = tts_model.r 115 | 116 | simple_table([('WaveRNN', str(voc_k) + 'k'), 117 | (f'Tacotron(r={r})', str(tts_k) + 'k'), 118 | ('Generation Mode', 'Batched' if batched else 'Unbatched'), 119 | ('Target Samples', 11_000 if batched else 'N/A'), 120 | ('Overlap Samples', 550 if batched else 'N/A')]) 121 | 122 | wer = 0. 123 | itr = 1 124 | for i, x in enumerate(inputs, 1): 125 | 126 | print(f'\n\nGenerating speech for the input passed line {i}...\n') 127 | 128 | _, m, _ = tts_model.generate(x) 129 | 130 | if batched: 131 | sav_path = f'{save_path}/__input_batched{str(batched)}_{tts_k}k_{len(inputs)}_{i}.wav' 132 | 133 | else: 134 | sav_path = f'{save_path}/__input_{"un" + str(batched)}__{tts_k}k_{len(inputs)}_{i}.wav' 135 | 136 | m = torch.tensor(m).unsqueeze(0) 137 | m = (m + 4) / 8 138 | voc_model.generate(m, sav_path, batched, hp.voc_target, hp.voc_overlap, hp.mu_law) 139 | 140 | data, samplerate = soundfile.read(sav_path) 141 | soundfile.write('new.wav', data, samplerate, subtype='PCM_16') 142 | filename = 'new.wav' 143 | 144 | # initialize the recognizer 145 | r = sr.Recognizer() 146 | with sr.AudioFile(filename) as source: 147 | # listen for the data (load audio to memory) 148 | audio_data = r.record(source) 149 | # recognize (convert from speech to text) 150 | text_pre = r.recognize_google(audio_data) 151 | text_pre = "".join(letter for letter in text_pre if letter.isalnum() or letter == " ") 152 | # making the input text case-insensitive 153 | if not (input_text.endswith(".txt") or input_text.endswith(".csv")): 154 | 155 | text_gt = "".join(letter for letter in inp_txt if letter.isalnum() or letter == " ") 156 | 157 | else: 158 | text_gt = ''.join(letter for letter in inp_txt[i - 1] if letter.isalnum() or letter == " ") 159 | # dropping "hi " here because we added in speech gen 160 | text_gt = text_gt[2:] 161 | 162 | text_pre = text_pre.lower().strip() 163 | text_gt = text_gt.lower().strip() 164 | 165 | if len(text_gt) != len(text_pre): 166 | if len(text_gt) > len(text_pre): 167 | for _ in range(len(text_gt) - len(text_pre)): 168 | text_pre += " " 169 | else: 170 | text_pre = text_pre[:len(text_gt)] 171 | 172 | references = [text_gt] 173 | hypotheses = [text_pre] 174 | 175 | d = dict(predictions=hypotheses, 176 | transcripts=references) 177 | wer += process_evaluation_epoch(d) 178 | itr += 1 179 | print("Input sentence passed::\n", text_gt) 180 | print("Predicted sentence of the model::\n", text_pre) 181 | 182 | wer /= itr - 1 183 | print("Number of sentences:", itr - 1) 184 | print("\nAverage Word Error Rate: {:}%, accuracy={:}%".format(wer * 100, (1 - wer) * 100), "\n") 185 | 186 | 187 | if __name__ == '__main__': 188 | main() 189 | -------------------------------------------------------------------------------- /src/inference.py: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2023 Intel Corporation 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | """ 4 | Inference code for trained models 5 | below code is adopted from https://github.com/fatchord/WaveRNN 6 | """ 7 | # pylint: disable=C0103,C0301,C0413,E0401,E1101,C0415, 8 | 9 | import time 10 | import argparse 11 | import os 12 | import sys 13 | import torch 14 | 15 | from utils.optim_eval import ipex_optimization, load_model, run_inference 16 | from utils import hparams as hp 17 | from utils.text import text_to_sequence 18 | from utils.display import simple_table 19 | from utils.files import get_files 20 | 21 | 22 | def main(): 23 | """ 24 | Main Function 25 | """ 26 | # Parse Arguments 27 | parser = argparse.ArgumentParser(description='Inference') 28 | parser.add_argument('--input_text', '-i', type=str, default=None, 29 | help='[string/csv file] Type in something here and TTS will generate it!') 30 | parser.add_argument('--batched', '-b', dest='batched', action='store_true', default=False, 31 | help='Fast Batched Generation (lower quality)') 32 | parser.add_argument('--unbatched', '-u', dest='batched', action='store_false', 33 | help='Slower Unbatched Generation (better quality)') 34 | parser.add_argument('--force_cpu', '-c', action='store_true', 35 | help='Forces CPU-only training, even when in CUDA capable environment') 36 | parser.add_argument('--hp_file', metavar='FILE', default='hparams.py', 37 | help='The file to use for the hyper parameters') 38 | parser.add_argument('-ipx', '--intel', type=int, required=False, default=0, 39 | help='use 1 for enabling intel pytorch optimizations, default is 0') 40 | parser.add_argument('--save_path', type=str, default='saved_audio', 41 | help='[string/path] where to store the speech files generated for the input text, ' 42 | 'default saved_audio folder') 43 | parser.add_argument('--voc_weights', type=str, default='pretrained/voc_weights/latest_weights.pyt', 44 | help='[string/path] Load in different WaveRNN weights') 45 | parser.add_argument('--tts_weights', type=str, default='pretrained/tts_weights/latest_weights.pyt', 46 | help='[string/path] Load in different Tacotron weights') 47 | args = parser.parse_args() 48 | 49 | hp.configure(args.hp_file) # Load hparams from file 50 | 51 | parser.set_defaults(input_text=None) 52 | 53 | batched = args.batched 54 | input_text = args.input_text 55 | intel_flag = args.intel 56 | tts_weights = args.tts_weights 57 | voc_weights = args.voc_weights 58 | save_path = args.save_path 59 | 60 | # creating the save path directory to store the output generated audio file if it does not exist 61 | os.makedirs(save_path, exist_ok=True) 62 | 63 | if not (input_text.endswith(".txt") or input_text.endswith(".csv")): 64 | inputs = [text_to_sequence(input_text.strip(), hp.tts_cleaner_names)] 65 | else: 66 | with open(input_text) as f: 67 | 68 | inputs = [] 69 | cnt = 0 70 | for line in f: 71 | split = line.split(',') 72 | sentence = split[-1][:-1].strip() 73 | if cnt > 0: 74 | inputs.append(text_to_sequence(sentence.strip(), hp.tts_cleaner_names)) 75 | cnt += 1 76 | 77 | if not args.force_cpu and torch.cuda.is_available(): 78 | device = torch.device('cuda') 79 | else: 80 | device = torch.device('cpu') 81 | print('Using device:', device) 82 | 83 | voc_model = load_model(voc_weights, 'w', device) 84 | tts_model = load_model(tts_weights, 't', device) 85 | 86 | tts_model, voc_model = ipex_optimization(tts_model, voc_model, intel_flag) 87 | 88 | voc_k = voc_model.get_step() // 1000 89 | tts_k = tts_model.get_step() // 1000 90 | 91 | r = tts_model.r 92 | 93 | simple_table([('WaveRNN', str(voc_k) + 'k'), 94 | (f'Tacotron(r={r})', str(tts_k) + 'k'), 95 | ('Generation Mode', 'Batched' if batched else 'Unbatched'), 96 | ('Target Samples', 11_000 if batched else 'N/A'), 97 | ('Overlap Samples', 550 if batched else 'N/A')]) 98 | 99 | print("\nWarming Up the models for inference.....") 100 | for _ in range(0, 10): 101 | wr_up = True 102 | warm_up_ip = "This is an input used for warming up our models" 103 | wr_up_ip = [text_to_sequence(warm_up_ip.strip(), hp.tts_cleaner_names)] 104 | run_inference(wr_up_ip, tts_model, voc_model, batched, wr_up, save_path) 105 | 106 | print('\nFinished warmup.\nGenerating speech for the input passed...\n') 107 | 108 | wr_up = False 109 | run_inference(inputs, tts_model, voc_model, batched, wr_up, save_path) 110 | 111 | print('\n\nDone.\n') 112 | 113 | 114 | if __name__ == "__main__": 115 | main() 116 | -------------------------------------------------------------------------------- /src/optim_eval.py: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2023 Intel Corporation 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | """ 4 | below code is adopted from https://github.com/fatchord/WaveRNN 5 | """ 6 | 7 | import time 8 | from typing import List 9 | from pathlib import Path 10 | import torch 11 | from utils.text.symbols import symbols 12 | from models.fatchord_version import WaveRNN 13 | from models.tacotron import Tacotron 14 | from utils import hparams as hp 15 | 16 | 17 | def ipex_optimization(t_model, v_model, i_flag): 18 | """ 19 | 20 | Args: 21 | t_model: Tacotron model 22 | v_model: WaveRNN model 23 | i_flag: Intel flag to enable ipex 24 | 25 | Returns: optimized models if intel ipex enabled 26 | 27 | """ 28 | t_model.eval() 29 | v_model.eval() 30 | if i_flag: 31 | import intel_extension_for_pytorch as ipex 32 | t_model = ipex.optimize(t_model) 33 | v_model = ipex.optimize(v_model) 34 | print('\nINTEL IPEX Optimizations Enabled.\n') 35 | return t_model, v_model 36 | 37 | 38 | def load_model(model_weights, typ, dv, return_model=False): 39 | """ 40 | 41 | Args: 42 | return_model: Returns without weights 43 | model_weights: weights to load the model 44 | typ: model type (Tacotron / WaveRNN) 45 | dv: device 46 | 47 | Returns: loaded model 48 | 49 | """ 50 | 51 | if typ == 'w': 52 | 53 | print('\nInitialising WaveRNN Model...\n') 54 | 55 | # Instantiate WaveRNN Model 56 | model = WaveRNN(rnn_dims=hp.voc_rnn_dims, 57 | fc_dims=hp.voc_fc_dims, 58 | bits=hp.bits, 59 | pad=hp.voc_pad, 60 | upsample_factors=hp.voc_upsample_factors, 61 | feat_dims=hp.num_mels, 62 | compute_dims=hp.voc_compute_dims, 63 | res_out_dims=hp.voc_res_out_dims, 64 | res_blocks=hp.voc_res_blocks, 65 | hop_length=hp.hop_length, 66 | sample_rate=hp.sample_rate, 67 | mode='MOL').to(dv) 68 | 69 | if not return_model: 70 | model.load(model_weights) 71 | 72 | else: 73 | 74 | print('\nInitialising Tacotron Model...\n') 75 | 76 | # Instantiate Tacotron Model 77 | model = Tacotron(embed_dims=hp.tts_embed_dims, 78 | num_chars=len(symbols), 79 | encoder_dims=hp.tts_encoder_dims, 80 | decoder_dims=hp.tts_decoder_dims, 81 | n_mels=hp.num_mels, 82 | fft_bins=hp.num_mels, 83 | postnet_dims=hp.tts_postnet_dims, 84 | encoder_K=hp.tts_encoder_K, 85 | lstm_dims=hp.tts_lstm_dims, 86 | postnet_K=hp.tts_postnet_K, 87 | num_highways=hp.tts_num_highways, 88 | dropout=hp.tts_dropout, 89 | stop_threshold=hp.tts_stop_threshold).to(dv) 90 | 91 | if not return_model: 92 | model.load(model_weights) 93 | 94 | return model 95 | 96 | 97 | def run_inference(text, taco_model, wave_model, batch, wrm_up, sv_path): 98 | """ 99 | Performs the inference by taking input text and then give speech 100 | """ 101 | 102 | sav_path = None 103 | tts = taco_model.get_step() // 1000 104 | inference_time = time.time() 105 | voc_time = 0 106 | tac_time = 0 107 | for i, x in enumerate(text, 1): 108 | 109 | if not wrm_up: 110 | 111 | tac_time = time.time() 112 | 113 | _, m, _ = taco_model.generate(x) 114 | 115 | if not wrm_up: 116 | print("\nTime taken by Tacotron model for inference is ", (time.time() - tac_time)) 117 | voc_time = time.time() 118 | 119 | if batch: 120 | sav_path = f'{sv_path}/__input_batched{str(batch)}_{tts}k_{len(text)}_{i}.wav' 121 | 122 | else: 123 | sav_path = f'{sv_path}/__input_{"un" + str(batch)}__{tts}k_{len(text)}_{i}.wav' 124 | 125 | m = torch.tensor(m).unsqueeze(0) 126 | m = (m + 4) / 8 127 | # import pdb; pdb.set_trace() 128 | wave_model.generate(m, sav_path, batch, hp.voc_target, hp.voc_overlap, hp.mu_law) 129 | if not wrm_up: 130 | print("\nTime taken by WaveRNN model for inference is ", (time.time() - voc_time)) 131 | 132 | if not wrm_up: 133 | print("\nTotal time for inference is ", (time.time() - inference_time)) 134 | 135 | return sav_path 136 | 137 | 138 | def levenshtein(a: List, b: List) -> int: 139 | """Calculates the Levenshtein distance between a and b. 140 | """ 141 | n, m = len(a), len(b) 142 | if n > m: 143 | # Make sure n <= m, to use O(min(n,m)) space 144 | a, b = b, a 145 | n, m = m, n 146 | 147 | current = list(range(n + 1)) 148 | for i in range(1, m + 1): 149 | previous, current = current, [i] + [0] * n 150 | for j in range(1, n + 1): 151 | add, delete = previous[j] + 1, current[j - 1] + 1 152 | change = previous[j - 1] 153 | if a[j - 1] != b[i - 1]: 154 | change = change + 1 155 | current[j] = min(add, delete, change) 156 | 157 | return current[n] 158 | 159 | 160 | def word_error_rate(hypotheses: List[str], references: List[str]): 161 | """ 162 | Computes Average Word Error rate between two texts represented as 163 | corresponding lists of string. Hypotheses and references must have same length. 164 | Args: 165 | hypotheses: list of hypotheses / predictions 166 | references: list of references / ground truth 167 | """ 168 | scores = 0 169 | words = 0 170 | if len(hypotheses) != len(references): 171 | raise ValueError("In word error rate calculation, hypotheses and reference" 172 | " lists must have the same number of elements. But I got:" 173 | "{0} and {1} correspondingly".format(len(hypotheses), len(references))) 174 | for h, r in zip(hypotheses, references): 175 | h_list = h.split() 176 | r_list = r.split() 177 | words += len(r_list) 178 | scores += levenshtein(h_list, r_list) 179 | if words != 0: 180 | wer = (1.0 * scores) / words 181 | else: 182 | wer = float('inf') 183 | return wer, scores, words 184 | 185 | 186 | def gen_testset(model: WaveRNN, test_set, samples, batched, target, overlap, save_path: Path): 187 | k = model.get_step() // 1000 188 | 189 | for i, (m, x) in enumerate(test_set, 1): 190 | 191 | if i > samples: 192 | break 193 | 194 | print('\n| Generating: %i/%i' % (i, samples)) 195 | 196 | x = x[0].numpy() 197 | 198 | bits = 16 if hp.voc_mode == 'MOL' else hp.bits 199 | 200 | if hp.mu_law and hp.voc_mode != 'MOL': 201 | x = decode_mu_law(x, 2 ** bits, from_labels=True) 202 | else: 203 | x = label_2_float(x, bits) 204 | 205 | save_wav(x, save_path / f'{k}k_steps_{i}_target.wav') 206 | 207 | batch_str = f'gen_batched_target{target}_overlap{overlap}' if batched else 'gen_NOT_BATCHED' 208 | save_str = str(save_path / f'{k}k_steps_{i}_{batch_str}.wav') 209 | 210 | _ = model.generate(m, save_str, batched, target, overlap, hp.mu_law) 211 | -------------------------------------------------------------------------------- /src/training.py: -------------------------------------------------------------------------------- 1 | # Copyright (C) 2023 Intel Corporation 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | """ 4 | Train Tacotron model 5 | below code is adopted from https://github.com/fatchord/WaveRNN 6 | """ 7 | # pylint: disable=C0103,C0301,C0413,E0401,W0614,C0412,W0401,W0613,R0914,R0913,R0915 8 | 9 | import time 10 | import argparse 11 | from pathlib import Path 12 | import numpy as np 13 | import torch 14 | from torch import optim 15 | import torch.nn.functional as F 16 | 17 | from utils import hparams as hp 18 | from utils.display import * 19 | from utils.dataset import get_tts_datasets, get_vocoder_datasets 20 | from utils.text.symbols import symbols 21 | from utils.distribution import discretized_mix_logistic_loss 22 | from utils.paths import Paths 23 | from models.tacotron import Tacotron 24 | from models.fatchord_version import WaveRNN 25 | from utils import data_parallel_workaround 26 | from utils.optim_eval import gen_testset 27 | from utils.checkpoints import save_checkpoint, restore_checkpoint 28 | 29 | 30 | def np_now(x: torch.Tensor): 31 | """ 32 | converting torch Tensor to numpy array 33 | """ 34 | return x.detach().cpu().numpy() 35 | 36 | 37 | def main(): 38 | """ 39 | Main method 40 | """ 41 | # Parse Arguments 42 | parser = argparse.ArgumentParser(description='Train Tacotron TTS & WaveRNN Voc') 43 | parser.add_argument('--force_gta', '-g', action='store_true', help='Force the model to create GTA features') 44 | parser.add_argument('--force_cpu', '-c', action='store_true', help='Forces CPU-only training, even when in CUDA ' 45 | 'capable environment') 46 | parser.add_argument('--lr', '-l', type=float, help='[float] override hparams.py learning rate') 47 | parser.add_argument('--batch_size', '-b', type=int, help='[int] override hparams.py batch size') 48 | parser.add_argument('--hp_file', metavar='FILE', default='src/utils/hparams.py', 49 | help='The file to use for the hyper parameters') 50 | parser.add_argument('--epochs', '-e', type=int, default=None, 51 | help='[int] number of epochs for training') 52 | 53 | 54 | args = parser.parse_args() 55 | 56 | hp.configure(args.hp_file) # Load hparams from file 57 | paths = Paths(hp.data_path, hp.voc_model_id, hp.tts_model_id) 58 | 59 | if args.lr is None: 60 | args.lr = hp.voc_lr 61 | if args.batch_size is None: 62 | args.batch_size = hp.voc_batch_size 63 | 64 | batch_size = args.batch_size 65 | lr = args.lr 66 | train_gta = args.force_gta 67 | epochs = args.epochs 68 | 69 | if not args.force_cpu and torch.cuda.is_available(): 70 | device = torch.device('cuda') 71 | for session in hp.tts_schedule: 72 | _, _, _, batch_size = session 73 | if batch_size % torch.cuda.device_count() != 0: 74 | raise ValueError('`batch_size` must be evenly divisible by n_gpus!') 75 | else: 76 | device = torch.device('cpu') 77 | print('Using device:', device) 78 | 79 | # Instantiate Tacotron Model 80 | print('\nInitialising Tacotron Model...\n') 81 | model = Tacotron(embed_dims=hp.tts_embed_dims, 82 | num_chars=len(symbols), 83 | encoder_dims=hp.tts_encoder_dims, 84 | decoder_dims=hp.tts_decoder_dims, 85 | n_mels=hp.num_mels, 86 | fft_bins=hp.num_mels, 87 | postnet_dims=hp.tts_postnet_dims, 88 | encoder_K=hp.tts_encoder_K, 89 | lstm_dims=hp.tts_lstm_dims, 90 | postnet_K=hp.tts_postnet_K, 91 | num_highways=hp.tts_num_highways, 92 | dropout=hp.tts_dropout, 93 | stop_threshold=hp.tts_stop_threshold).to(device) 94 | 95 | optimizer = optim.Adam(model.parameters()) 96 | 97 | restore_checkpoint('tts', paths, model, optimizer, create_if_missing=True) 98 | 99 | start = time.time() 100 | for _, session in enumerate(hp.tts_schedule): 101 | current_step = model.get_step() 102 | 103 | r, lr, max_step, batch_size = session 104 | 105 | training_steps = max_step - current_step 106 | model.r = r 107 | simple_table([(f'Steps with r={r}', str(training_steps // 1000) + 'k Steps'), 108 | ('Batch Size', batch_size), 109 | ('Learning Rate', lr), 110 | ('Outputs/Step (r)', model.r)]) 111 | 112 | train_set, attn_example = get_tts_datasets(paths.data, batch_size, r) 113 | 114 | tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example, epochs) 115 | 116 | print("Total training time is ", (time.time() - start)) 117 | print('Training Tacotron model is Completed.') 118 | 119 | 120 | print('Creating Ground Truth Aligned Dataset...\n') 121 | 122 | train_set, attn_example = get_tts_datasets(paths.data, 32, model.r) 123 | create_gta_features(model, train_set, paths.gta) 124 | 125 | print('\n\nWe can now train WaveRNN on GTA features\n') 126 | 127 | print('\nInitialising WaveRNN Model...\n') 128 | 129 | # Instantiate WaveRNN Model 130 | voc_model = WaveRNN(rnn_dims=hp.voc_rnn_dims, 131 | fc_dims=hp.voc_fc_dims, 132 | bits=hp.bits, 133 | pad=hp.voc_pad, 134 | upsample_factors=hp.voc_upsample_factors, 135 | feat_dims=hp.num_mels, 136 | compute_dims=hp.voc_compute_dims, 137 | res_out_dims=hp.voc_res_out_dims, 138 | res_blocks=hp.voc_res_blocks, 139 | hop_length=hp.hop_length, 140 | sample_rate=hp.sample_rate, 141 | mode=hp.voc_mode).to(device) 142 | 143 | # Check to make sure the hop length is correctly factorised 144 | assert np.cumprod(hp.voc_upsample_factors)[-1] == hp.hop_length 145 | 146 | optimizer = optim.Adam(voc_model.parameters()) 147 | 148 | restore_checkpoint('voc', paths, voc_model, optimizer, create_if_missing=True) 149 | 150 | train_set, test_set = get_vocoder_datasets(paths.data, batch_size, train_gta) 151 | 152 | total_steps = hp.voc_total_steps 153 | 154 | simple_table([('Remaining', str((total_steps - voc_model.get_step()) // 1000) + 'k Steps'), 155 | ('Batch Size', batch_size), 156 | ('LR', lr), 157 | ('Sequence Len', hp.voc_seq_len), 158 | ('GTA Train', train_gta)]) 159 | 160 | loss_func = F.cross_entropy if voc_model.mode == 'RAW' else discretized_mix_logistic_loss 161 | 162 | start = time.time() 163 | voc_train_loop(paths, voc_model, loss_func, optimizer, train_set, test_set, lr, total_steps, epochs) 164 | 165 | print("Total training time to train WaveRNN model is ", (time.time() - start)) 166 | print('\nTraining Completed for both Tacotron and WaveRNN models.') 167 | 168 | 169 | def tts_train_loop(paths: Paths, model: Tacotron, optimizer, train_set, lr, train_steps, attn_example, eps=None): 170 | """ 171 | Training Tacotron model 172 | """ 173 | device = next(model.parameters()).device # use same device as model parameters 174 | 175 | for g in optimizer.param_groups: 176 | g['lr'] = lr 177 | 178 | total_iters = len(train_set) 179 | 180 | epochs = eps if eps else train_steps // total_iters + 1 181 | 182 | 183 | msg = None 184 | start = time.time() 185 | for e in range(1, epochs + 1): 186 | 187 | running_loss = 0 188 | strt = time.time() 189 | # Performs 1 iteration for every input string 190 | for i, (x, m, ids, _) in enumerate(train_set, 1): 191 | 192 | x, m = x.to(device), m.to(device) 193 | 194 | # Parallelize model onto GPUS using workaround due to python bug 195 | if device.type == 'cuda' and torch.cuda.device_count() > 1: 196 | m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m) 197 | else: 198 | m1_hat, m2_hat, attention = model(x, m) 199 | 200 | m1_loss = F.l1_loss(m1_hat, m) 201 | m2_loss = F.l1_loss(m2_hat, m) 202 | 203 | loss = m1_loss + m2_loss 204 | 205 | optimizer.zero_grad() 206 | loss.backward() 207 | if hp.tts_clip_grad_norm is not None: 208 | grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), hp.tts_clip_grad_norm) 209 | if np.isnan(grad_norm): 210 | print('grad_norm was NaN!') 211 | 212 | optimizer.step() 213 | 214 | running_loss += loss.item() 215 | avg_loss = running_loss / i 216 | 217 | speed = i / (time.time() - strt) 218 | 219 | step = model.get_step() 220 | k = step // 1000 221 | 222 | if step % hp.tts_checkpoint_every == 0: 223 | ckpt_name = f'taco_step{k}K' 224 | save_checkpoint('tts', paths, model, optimizer, 225 | name=ckpt_name, is_silent=True) 226 | 227 | if attn_example in ids: 228 | idx = ids.index(attn_example) 229 | save_attention(np_now(attention[idx][:, :160]), paths.tts_attention / f'{step}') 230 | save_spectrogram(np_now(m2_hat[idx]), paths.tts_mel_plot / f'{step}', 600) 231 | 232 | msg = f'| Epoch: {e}/{epochs} ({i}/{total_iters}) | Loss: ' \ 233 | f'{avg_loss:#.4} | {speed:#.2} steps/s | Step: {k}k | ' 234 | stream(msg) 235 | 236 | save_checkpoint('tts', paths, model, optimizer, is_silent=True) 237 | model.log(paths.tts_log, msg) 238 | print(' ') 239 | print("Total Training time for Tacotron Model is ", (time.time() - start)) 240 | 241 | 242 | def create_gta_features(model: Tacotron, train_set, save_path: Path): 243 | """ 244 | Creating Ground Truth aligned features incase we use it for training later 245 | """ 246 | device = next(model.parameters()).device # use same device as model parameters 247 | 248 | iters = len(train_set) 249 | 250 | for i, (x, mels, ids, mel_lens) in enumerate(train_set, 1): 251 | 252 | x, mels = x.to(device), mels.to(device) 253 | 254 | with torch.no_grad(): 255 | _, gta, _ = model(x, mels) 256 | 257 | gta = gta.cpu().numpy() 258 | 259 | for j, item_id in enumerate(ids): 260 | mel = gta[j][:, :mel_lens[j]] 261 | mel = (mel + 4) / 8 262 | np.save(save_path / f'{item_id}.npy', mel, allow_pickle=False) 263 | 264 | bar1 = progbar(i, iters) 265 | msg = f'{bar1} {i}/{iters} Batches ' 266 | stream(msg) 267 | 268 | 269 | def voc_train_loop(paths: Paths, model: WaveRNN, loss_func, optimizer, train_set, test_set, lr, total_steps, eps=None): 270 | """ 271 | Training WaveRNN model 272 | """ 273 | # Use same device as model parameters 274 | device = next(model.parameters()).device 275 | 276 | for g in optimizer.param_groups: 277 | g['lr'] = lr 278 | 279 | total_iters = len(train_set) 280 | 281 | epochs = eps if eps else (total_steps - model.get_step()) // total_iters + 1 282 | for e in range(1, epochs + 1): 283 | 284 | start = time.time() 285 | running_loss = 0. 286 | 287 | for i, (x, y, m) in enumerate(train_set, 1): 288 | x, m, y = x.to(device), m.to(device), y.to(device) 289 | 290 | # Parallelize model onto GPUS using workaround due to python bug 291 | if device.type == 'cuda' and torch.cuda.device_count() > 1: 292 | y_hat = data_parallel_workaround(model, x, m) 293 | else: 294 | y_hat = model(x, m) 295 | 296 | if model.mode == 'RAW': 297 | y_hat = y_hat.transpose(1, 2).unsqueeze(-1) 298 | 299 | elif model.mode == 'MOL': 300 | y = y.float() 301 | 302 | y = y.unsqueeze(-1) 303 | 304 | loss = loss_func(y_hat, y) 305 | 306 | optimizer.zero_grad() 307 | loss.backward() 308 | if hp.voc_clip_grad_norm is not None: 309 | grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), hp.voc_clip_grad_norm) 310 | if np.isnan(grad_norm): 311 | print('grad_norm was NaN!') 312 | optimizer.step() 313 | 314 | running_loss += loss.item() 315 | avg_loss = running_loss / i 316 | 317 | speed = i / (time.time() - start) 318 | 319 | step = model.get_step() 320 | k = step // 1000 321 | 322 | if step % hp.voc_checkpoint_every == 0: 323 | gen_testset(model, test_set, hp.voc_gen_at_checkpoint, hp.voc_gen_batched, 324 | hp.voc_target, hp.voc_overlap, paths.voc_output) 325 | ckpt_name = f'wave_step{k}K' 326 | save_checkpoint('voc', paths, model, optimizer, 327 | name=ckpt_name, is_silent=True) 328 | 329 | msg = f'| Epoch: {e}/{epochs} ({i}/{total_iters}) | Loss: ' \ 330 | f'{avg_loss:.4f} | {speed:.1f} steps/s | Step: {k}k | ' 331 | stream(msg) 332 | 333 | # Must save latest optimizer state to ensure that resuming training 334 | # doesn't produce artifacts 335 | save_checkpoint('voc', paths, model, optimizer, is_silent=True) 336 | model.log(paths.voc_log, msg) 337 | print(' ') 338 | 339 | 340 | if __name__ == "__main__": 341 | main() 342 | --------------------------------------------------------------------------------