├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SECURITY.md
├── assets
    ├── E2E_intel.png
    └── E2E_stock.png
├── data
    ├── data.txt
    ├── pretrained_models.txt
    └── product_description.csv
├── env
    ├── intel
    │   └── intel-voice.yml
    └── stock
    │   └── stock-voice.yml
└── src
    ├── evaluation.py
    ├── inference.py
    ├── optim_eval.py
    └── training.py


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, caste, color, religion, or sexual
 10 | identity and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the overall
 26 |   community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or advances of
 31 |   any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email address,
 35 |   without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at
 63 | CommunityCodeOfConduct AT intel DOT com.
 64 | All complaints will be reviewed and investigated promptly and fairly.
 65 | 
 66 | All community leaders are obligated to respect the privacy and security of the
 67 | reporter of any incident.
 68 | 
 69 | ## Enforcement Guidelines
 70 | 
 71 | Community leaders will follow these Community Impact Guidelines in determining
 72 | the consequences for any action they deem in violation of this Code of Conduct:
 73 | 
 74 | ### 1. Correction
 75 | 
 76 | **Community Impact**: Use of inappropriate language or other behavior deemed
 77 | unprofessional or unwelcome in the community.
 78 | 
 79 | **Consequence**: A private, written warning from community leaders, providing
 80 | clarity around the nature of the violation and an explanation of why the
 81 | behavior was inappropriate. A public apology may be requested.
 82 | 
 83 | ### 2. Warning
 84 | 
 85 | **Community Impact**: A violation through a single incident or series of
 86 | actions.
 87 | 
 88 | **Consequence**: A warning with consequences for continued behavior. No
 89 | interaction with the people involved, including unsolicited interaction with
 90 | those enforcing the Code of Conduct, for a specified period of time. This
 91 | includes avoiding interactions in community spaces as well as external channels
 92 | like social media. Violating these terms may lead to a temporary or permanent
 93 | ban.
 94 | 
 95 | ### 3. Temporary Ban
 96 | 
 97 | **Community Impact**: A serious violation of community standards, including
 98 | sustained inappropriate behavior.
 99 | 
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 | 
106 | ### 4. Permanent Ban
107 | 
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior, harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 | 
112 | **Consequence**: A permanent ban from any sort of public interaction within the
113 | community.
114 | 
115 | ## Attribution
116 | 
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.1, available at
119 | [https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
120 | 
121 | Community Impact Guidelines were inspired by
122 | [Mozilla's code of conduct enforcement ladder][Mozilla CoC].
123 | 
124 | For answers to common questions about this code of conduct, see the FAQ at
125 | [https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
126 | [https://www.contributor-covenant.org/translations][translations].
127 | 
128 | [homepage]: https://www.contributor-covenant.org
129 | [v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
130 | [Mozilla CoC]: https://github.com/mozilla/diversity
131 | [FAQ]: https://www.contributor-covenant.org/faq


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing
 2 | 
 3 | ### License
 4 | 
 5 | Satellite Image Processing system is licensed under the terms in [LICENSE](https://github.com/oneapi-src/voice-data-generation/blob/main/LICENSE). By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms.
 6 | 
 7 | ### Sign your work
 8 | 
 9 | Please use the sign-off line at the end of the patch. Your signature certifies that you wrote the patch or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify
10 | the below (from [developercertificate.org](http://developercertificate.org/)):
11 | 
12 | ```
13 | Developer Certificate of Origin
14 | Version 1.1
15 | 
16 | Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
17 | 660 York Street, Suite 102,
18 | San Francisco, CA 94110 USA
19 | 
20 | Everyone is permitted to copy and distribute verbatim copies of this
21 | license document, but changing it is not allowed.
22 | 
23 | Developer's Certificate of Origin 1.1
24 | 
25 | By making a contribution to this project, I certify that:
26 | 
27 | (a) The contribution was created in whole or in part by me and I
28 |     have the right to submit it under the open source license
29 |     indicated in the file; or
30 | 
31 | (b) The contribution is based upon previous work that, to the best
32 |     of my knowledge, is covered under an appropriate open source
33 |     license and I have the right under that license to submit that
34 |     work with modifications, whether created in whole or in part
35 |     by me, under the same open source license (unless I am
36 |     permitted to submit under a different license), as indicated
37 |     in the file; or
38 | 
39 | (c) The contribution was provided directly to me by some other
40 |     person who certified (a), (b) or (c) and I have not modified
41 |     it.
42 | 
43 | (d) I understand and agree that this project and the contribution
44 |     are public and that a record of the contribution (including all
45 |     personal information I submit with it, including my sign-off) is
46 |     maintained indefinitely and may be redistributed consistent with
47 |     this project or the open source license(s) involved.
48 | ```
49 | 
50 | Then you just add a line to every git commit message:
51 | 
52 |     Signed-off-by: Joe Smith <joe.smith@email.com>
53 | 
54 | Use your real name (sorry, no pseudonyms or anonymous contributions.)
55 | 
56 | If you set your `user.name` and `user.email` git configs, you can sign your
57 | commit automatically with `git commit -s`.


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2023, Intel Corporation
 2 | 
 3 | Redistribution and use in source and binary forms, with or without
 4 | modification, are permitted provided that the following conditions are met:
 5 | 
 6 |     * Redistributions of source code must retain the above copyright notice,
 7 |       this list of conditions and the following disclaimer.
 8 |     * Redistributions in binary form must reproduce the above copyright
 9 |       notice, this list of conditions and the following disclaimer in the
10 |       documentation and/or other materials provided with the distribution.
11 |     * Neither the name of Intel Corporation nor the names of its contributors
12 |       may be used to endorse or promote products derived from this software
13 |       without specific prior written permission.
14 | 
15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
16 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
19 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
20 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
21 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
22 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
23 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
24 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | PROJECT NOT UNDER ACTIVE MANAGEMENT
  2 | 
  3 | This project will no longer be maintained by Intel.
  4 | 
  5 | Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  
  6 | 
  7 | Intel no longer accepts patches to this project.
  8 | 
  9 | If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.  
 10 | 
 11 | Contact: webadmin@linux.intel.com
 12 | # Applications of Synthetic Voice/Audio Generation Using PyTorch
 13 | ## Introduction
 14 | Synthetic voice is a computer-generated speech. [More Info](https://en.wikipedia.org/wiki/Speech_synthesis)
 15 | ## Table of Contents 
 16 |  - [Purpose](#purpose)
 17 |  - [Reference Solution](#reference-solution)
 18 |  - [Reference Implementation](#reference-implementation)
 19 | 
 20 | ## Purpose
 21 | With the surge in AI implementation in modern industrial solutions, the demand for datasets has increased significantly to build robust and reliable AI models. The challenges associated with data privacy and higher cost of purchasing real datasets, limited data availability, accuracy of data labeling, and lack of scalability and variety, is driving the use of synthetic data to fulfill the high demand for AI solutions across industries.
 22 | 
 23 | Synthetic voice has wide applications in virtual assistants, education, healthcare, multimedia and entertainment. Text-to-Speech (TTS) is one method of generating synthetic voice. It creates human speech artificially. One of the main benefits of voice synthesis is to make information easily accessible to a wider audience. For example, people with visual impairments or reading disabilities can use this technology to read aloud the written content. This can help differently-abled people access a wider range of information and communicate more easily with others.
 24 | 
 25 | Voice synthesis technology is increasingly used to create more natural-sounding virtual assistants and chatbots, which can improve the user experience and engagement through personalized communication based on voice and language preferences.
 26 | 
 27 | TTS is changing in a variety of ways. For example, voice cloning can capture your brand essence and express it through a machine. Speech cloning allows you to use TTS in conjunction with voice recording data sets to combine the voices of known persons such as executives and celebrities, which can be valuable for businesses in industries such as entertainment.
 28 | ## Reference Solution  
 29 | AI-enabled synthetic voice generator aid helps generate voice using simple text generator context/story as input and provides a pure system-generated synthetic voice.
 30 | 
 31 | The goal of this reference kit is to translate the input text data into speech. A transfer learning approach is performed on advanced PyTorch-based pre-trained Tacotron and WaveRNN (VOCODER) models. This model combination is known to be a promising method to synthesize voice data from the corresponding input text data. The LJ Speech dataset, after pre-processing using NumPy, is used for further training the mentioned pre-trained models. From the input text data, the model generates speech that mimics the voice of the LJ speech dataset which was used to train the AI model.
 32 | 
 33 | Since GPUs are typically the choice for deep learning and AI processing to achieve a higher performance rate, to offer a more cost-effective option leveraging a CPU, the quantization technique can be used, leveraging the Intel® Analytics toolkit, to achieve higher performance rate by performing vectorized operations on CPUs itself.
 34 | 
 35 | By quantizing/compressing the model (from floating-point to integer model), while maintaining a similar level of accuracy as the floating-point model, demonstrated efficient utilization of underlying resources when deployed on edge devices with low processing and memory capabilities
 36 | 
 37 | ## Reference Implementation
 38 | ### Use Case End-To-End flow
 39 | ![Use_case_flow](assets/E2E_stock.png)
 40 | 
 41 | **Description:** An open-source LJ Speech voice dataset is first preprocessed with NumPy before being used to train an advanced pre-trained Tacotron and WaveRNN (VOCODER) models with Stock PyTorch v1.13.0. Following training, the trained Stock PyTorch v1.13.0 model is used to generate synthetic voice data from the input text sentence.
 42 | 
 43 | ### Expected Input-Output
 44 | 
 45 | 
 46 | |               **Input**               |   **Output**   |
 47 | |:-------------------------------------:|-----|
 48 | | Text Sentence |  Synthesized Audio data  |
 49 | 
 50 | 
 51 | ### Reference Sources
 52 | 
 53 | *DataSet*: https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 (2.6 GB dataset for this use case)<br>
 54 | *Case Study & Repo*: https://github.com/fatchord/WaveRNN 
 55 | 
 56 | > ***Please see this data set's applicable license for terms and conditions. Intel®Corporation does not own the rights to this data set and does not confer any rights to it.***
 57 | 
 58 | ### Repository clone and Anaconda installation
 59 | 
 60 | ```
 61 | git clone https://github.com/oneapi-src/voice-data-generation
 62 | cd voice-data-generation
 63 | ```
 64 | 
 65 | > **Note**: If you are beginning to explore the reference kits on client machines such as a windows laptop, go to the [Running on Windows](#running-on-windows) section to ensure you are all set and come back here
 66 | 
 67 | > **Note**: The performance measurements were captured on Xeon based processors. The instructions will work on WSL, however some portions of the ref kits may run slower on a client machine, so utilize the flags supported to modify the epochs/batch size to run the training or inference faster. Additionally performance claims reported may not be seen on a windows based client machine.
 68 | 
 69 | > **Note**: In this reference kit implementation already provides the necessary conda environment configurations to setup the software requirements. To utilize these environment scripts, first install Anaconda/Miniconda by following the instructions at the following link  
 70 | > [Anaconda installation](https://docs.anaconda.com/anaconda/install/linux/)
 71 | 
 72 | ### Usage and Instructions
 73 | 
 74 | Below are the steps to reproduce the benchmarking results given in this repository
 75 | 1. Creating the execution environment
 76 | 2. Dataset preparation
 77 | 3. Training Tacotron & WaveRNN models
 78 | 4. Evaluation
 79 | 5. Model Inference
 80 | 
 81 | ### Software Requirements
 82 | | **Package**              | **Stock Python**                   
 83 | |:-------------------------| :---                               
 84 | | Python                   | python==3.8.15                      
 85 | | PyTorch                  | torch==1.13.0                 
 86 | 
 87 | ### Environment
 88 | Below are the developer environment used for this module on Azure. All the observations captured are based on these environment setup.
 89 | 
 90 | 
 91 | | **Size** | **CPU Cores** | **Memory** | **Intel® CPU Family** |
 92 | |----------|:-------------:|:----------:|:---------------------:|
 93 | |    NA    |       8       |    32GB    |        ICELAKE        |
 94 | 
 95 | ### Solution setup
 96 | The below file is used to create an environment as follows:
 97 | 
 98 | 
 99 | |     **YAML file**     | **Environment Name** |            **Configuration**            |
100 | |:---------------------:|----------------------|:---------------------------------------:|
101 | | `env/stock/stock-voice.yml` | `stock-voice`        | Python=3.8.15 with stock PyTorch v1.13.0 |
102 | 
103 | ### Dataset
104 | This is a public domain dataset containing short audio clips of a single speaker reading passages from 7 different non-fictional books. 
105 | 
106 | | **Use case**                   | Speech Generation
107 | |:-------------------------------| :---
108 | | **Data Format**         | Audio File in ".wav" format 
109 | | **Size**                       | Total 13100 short audio files<br>
110 | | **Source**                     | https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
111 | 
112 | > **Note**: Please refer to the  data.txt file in the "data" folder for downloading the dataset.
113 | 
114 | ### Training
115 | 
116 | We train the Tacotron model with the preprocessed data which generates Ground Truth aligned melspectograms later fed to WaveRNN model to generate clean synthesized audio file.
117 | 
118 | 
119 | | **Input**               | String (number of words)
120 | |:------------------------| :---
121 | | **Output Model format** | PyTorch
122 |  **Output**              | Audio file
123 | 
124 | ### Inference
125 | Performed inferencing on the trained model using Stock PyTorch v1.13.0.
126 | 
127 | #### 1. Environment Creation
128 | **Setting up the environment for Stock PyTorch**<br>Follow the below conda installation commands to setup the Stock PyTorch environment for the model training and prediction. 
129 | ```sh
130 | conda env create -f env/stock/stock-voice.yml
131 | ```
132 | *Activate stock conda environment*
133 | Use the following command to activate the environment that was created:
134 | ```sh
135 | conda activate stock-voice
136 | ```
137 | 
138 | >Note: Please refer to known-issues section at the end in case of any libraries issues.
139 | 
140 | #### 2. Data preparation & Pre-trained models
141 | 
142 | ##### 2.1 Preparing Pre-trained models
143 | > **Note**: For instructions to get the pretrained models and setting up the repository, refer the **pretrained_models.txt**  file inside "data" folder and follow the steps.
144 | 
145 | ##### 2.2 Data preparation
146 | > The LJSpeech audio files dataset is downloaded and extracted in a folder before running the training python module.
147 | 
148 | Folder structure Looks as below after extraction of dataset.
149 | ```
150 | - data
151 |     - LJSpeech-1.1
152 |       - wavs
153 |       - metadata
154 |       - readme
155 | ```
156 | > **Note**: For instructions to download the dataset, refer the **data.txt** file inside "data" folder.
157 | 
158 | > **Now the data folder contains the below structure** 
159 | <br>data="data/LJSpeech-1.1/{wavs/metadata/readme}"
160 | 
161 | > **Note**: Please be in "WaveRNN" folder to continue benchmarking. the below step is optional, if user already followed the instructions provided above in data preperation.
162 | ```
163 | cd WaveRNN 
164 | ```
165 | Run the preprocess module as given below to start data preprocessing using the active environment.
166 | <br>This module takes option to run the preprocessing.
167 | ```
168 | usage: preprocess.py [-h] [--path PATH] [--extension EXT] [--num_workers N] [--hp_file FILE]
169 | 
170 | Preprocessing for WaveRNN and Tacotron
171 | 
172 | optional arguments:
173 |   -h, --help            show this help message and exit
174 |   --path PATH, -p PATH  directly point to dataset path (overrides hparams.wav_path
175 |   --extension EXT, -e EXT
176 |                         file extension to search for in dataset folder
177 |   --num_workers N, -w N
178 |                         The number of worker threads to use for preprocessing
179 |   --hp_file FILE        The file to use for the hyperparameters
180 | ```
181 | **Command to do data preprocessing**
182 | ```sh
183 | python preprocess.py --path '../../data/LJSpeech-1.1' -e '.wav' -w 8 --hp_file 'hparams.py'
184 | ```
185 | > **Note**: preprocessed data will be stored inside "data" folder of cloned "WaveRNN" repository.
186 | 
187 | #### 3 Training model
188 | Run the training module as given below to start training using the active environment. 
189 | 
190 | <br>This module takes option to run the training.
191 | ```
192 | usage: training.py [-h] [--force_gta] [--force_cpu] [--lr LR] [--batch_size BATCH_SIZE] [--hp_file FILE] [--epochs EPOCHS]
193 | 
194 | Train Tacotron TTS & WaveRNN Voc
195 | 
196 | optional arguments:
197 |   -h, --help            show this help message and exit
198 |   --force_gta, -g       Force the model to create GTA features
199 |   --force_cpu, -c       Forces CPU-only training, even when in CUDA capable environment
200 |   --lr LR, -l LR        [float] override hparams.py learning rate
201 |   --batch_size BATCH_SIZE, -b BATCH_SIZE
202 |                         [int] override hparams.py batch size
203 |   --hp_file FILE        The file to use for the hyper parameters
204 |   --epochs EPOCHS, -e EPOCHS
205 |                         [int] number of epochs for training
206 | ```
207 | **Command to run training**
208 | ```sh
209 | python training.py --hp_file 'hparams.py' --epochs 100
210 | ```
211 | 
212 | >Note: Training is optional as this reference kit provides pretrained models to run evaluation and inference as given below.
213 | 
214 | **Expected Output**<br>
215 | >The output trained model will be saved in `WaveRNN/pretrained/tts_weights` & `WaveRNN/pretrained/voc_weights` as `latest_weights.pyt` for Tacotron model & WaveRNN model respectively.
216 | 
217 | #### 4. Evaluating the model
218 | 
219 | Run the evaluation module to find out the word error rate and accuracy of the model.
220 | ```
221 | usage: evaluation.py [-h] [--input_text INPUT_TEXT] [--batched] [--unbatched] [--force_cpu] [--hp_file FILE] [-ipx INTEL] [--save_path SAVE_PATH]
222 |                      [--voc_weights VOC_WEIGHTS] [--tts_weights TTS_WEIGHTS]
223 | 
224 | Evaluation
225 | 
226 | optional arguments:
227 |   -h, --help            show this help message and exit
228 |   --input_text INPUT_TEXT, -i INPUT_TEXT
229 |                         [string] Type in something here and TTS will generate it!
230 |   --batched, -b         Fast Batched Generation (lower quality)
231 |   --unbatched, -u       Slower Unbatched Generation (better quality)
232 |   --force_cpu, -c       Forces CPU-only training, even when in CUDA capable environment
233 |   --hp_file FILE        The file to use for the hyper parameters
234 |   -ipx INTEL, --intel INTEL
235 |                         use 1 for enabling intel pytorch optimizations, default is 0
236 |   --save_path SAVE_PATH
237 |                         [string/path] where to store the speech files generated for the input text, default saved_audio folder
238 |   --voc_weights VOC_WEIGHTS
239 |                         [string/path] Load in different WaveRNN weights
240 |   --tts_weights TTS_WEIGHTS
241 |                         [string/path] Load in different Tacotron weights
242 | ```
243 | 
244 | **Command to run evaluation**
245 | 
246 | > **Note**: Users can evaluate the models in two ways 
247 | 1. Single text sentence
248 | 2. Multiple text sentences using csv file.
249 | 
250 | ```sh
251 | # Evaluating on the single input string
252 | python evaluation.py --input_text "From fairest creatures we desire increase, That thereby beauty's rose" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt'
253 | ```
254 | ```sh
255 | # Evaluating on multiple text sentences using csv file
256 | python evaluation.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt'
257 | ```
258 | 
259 | **Expected Output**<br>
260 | >Average Word Error Rate: 36.36430090377459%, accuracy=63.635699096225416%
261 | 
262 | The user can collect the logs by redirecting the output to a file as illustrated below.
263 | 
264 | ```shell
265 | python evaluation.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' | tee <log_file_name>
266 | ```
267 | 
268 | The output of the python script <i>evaluation.py<i> will be collected in the file <log_file_name>
269 | 
270 | #### 5. Inference
271 | *Running inference using PyTorch*
272 | 
273 | ```
274 | usage: inference.py [-h] [--input_text INPUT_TEXT] [--batched] [--unbatched] [--force_cpu] [--hp_file FILE] [-ipx INTEL] [--save_path SAVE_PATH]
275 |                     [--voc_weights VOC_WEIGHTS] [--tts_weights TTS_WEIGHTS]
276 | 
277 | Inference
278 | 
279 | optional arguments:
280 |   -h, --help            show this help message and exit
281 |   --input_text INPUT_TEXT, -i INPUT_TEXT
282 |                         [string/csv file] Type in something here and TTS will generate it!
283 |   --batched, -b         Fast Batched Generation (lower quality)
284 |   --unbatched, -u       Slower Unbatched Generation (better quality)
285 |   --force_cpu, -c       Forces CPU-only training, even when in CUDA capable environment
286 |   --hp_file FILE        The file to use for the hyper parameters
287 |   -ipx INTEL, --intel INTEL
288 |                         use 1 for enabling intel pytorch optimizations, default is 0
289 |   --save_path SAVE_PATH
290 |                         [string/path] where to store the speech files generated for the input text, default saved_audio folder
291 |   --voc_weights VOC_WEIGHTS
292 |                         [string/path] Load in different WaveRNN weights
293 |   --tts_weights TTS_WEIGHTS
294 |                         [string/path] Load in different Tacotron weights
295 | ```
296 | **Command to run inference**
297 | 
298 | > **Note**: Users can inference the models in two ways 
299 | 1. Single text sentence
300 | 2. Multiple text sentences using csv file.
301 | 
302 | ```sh
303 | # Batch inferencing on the single input string
304 | python inference.py --input_text "From fairest creatures we desire increase, That thereby beauty's rose" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt'
305 | ```
306 | ```sh
307 | # Batch inferencing on multiple text sentences using csv file
308 | python inference.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt'
309 | ```
310 | 
311 | **Expected Output**<br>
312 | >Total time for inference is  10.439789295196533
313 | 
314 | >Generated audio samples can be found in "saved_audio" folder by default.
315 | 
316 | The user can collect the logs by redirecting the output to a file as illustrated below.
317 | 
318 | ```shell
319 | python inference.py --input_text "../../data/product_description.csv" -b -ipx 0 --voc_weights 'pretrained/voc_weights/latest_weights.pyt' --tts_weights 'pretrained/tts_weights/latest_weights.pyt' | tee <log_file_name>
320 | ```
321 | 
322 | The output of the python script <i>inference.py<i> will be collected in the file <log_file_name>
323 | 
324 | ## Optimizing the End To End solution with Intel® oneAPI components
325 | 
326 | #Coming Soon...
327 | 
328 | This reference solution can be optimized with Intel® oneAPI components to achieve a performance boost, This section will be added soon.
329 | 
330 | ## Conclusion
331 | To build a synthetic voice data generator model for audio synthesis using the Deep-learning approach, machine learning engineers will need to train models with a large dataset and run inference more frequently. 
332 | 
333 | ### Notices & Disclaimers
334 | Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). 
335 | Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. 
336 | Your costs and results may vary. 
337 | Intel technologies may require enabled hardware, software or service activation.
338 | © Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.  
339 | 
340 | To the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.
341 |  
342 | Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.
343 | 
344 | ## Appendix
345 | 
346 | ### **Running on Windows**
347 | 
348 | The reference kits commands are linux based, in order to run this on Windows, goto Start and open WSL and follow the same steps as running on a linux machine starting from git clone instructions. If WSL is not installed you can [install WSL](https://learn.microsoft.com/en-us/windows/wsl/install).
349 | 
350 | > **Note** If WSL is installed and not opening, goto Start ---> Turn Windows feature on or off and make sure Windows Subsystem for Linux is checked. Restart the system after enabling it for the changes to reflect.
351 | 
352 | 
353 | ### **Experiment Setup**
354 | - Testing performed on: March 2023
355 | - Testing performed by: Intel Corporation
356 | - Configuration Details: Azure Standard_D8_V5 (Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz), 1 Socket, 4 Cores per Socket, 2 Threads per Core, Turbo:On, Total Memory: 32 GB, OS: Ubuntu 20.04, Kernel: Linux 5.13.0-1031-azure , Software: Intel® Extension for PyTorch* v1.13.0, Intel® Neural Compressor v1.14.2 
357 | 
358 | | Platform                          | Ubuntu 20.04
359 | | :---                              | :---
360 | | Hardware                          | Azure Standard_D8_V5 (Icelake)
361 | | Software                          | Intel® Extension for PyTorch*, Intel® Neural Compressor.
362 | | What you will learn               | Advantage of using components in Intel® oneAPI AI Analytics Toolkit over the stock version for the computer vision-based model build and inferencing.
363 | 
364 | ### Known Issues
365 | 
366 | 1. Common prerequisites required to run python scripts in linux system.
367 |     Install gcc and curl.  For Ubuntu, this will be: 
368 | 
369 |       ```bash
370 |       apt install gcc
371 |       sudo apt install libglib2.0-0
372 |       sudo apt install curl
373 |       ```
374 | 
375 | 2. ImportError: libGL.so.1: cannot open shared object file: No such file or directory
376 |    
377 |     **Issue:**
378 |       ```
379 |       ImportError: libGL.so.1: cannot open shared object file: No such file or directory
380 |       or
381 |       libgthread-2.0.so.0: cannot open shared object file: No such file or directory
382 |       ```
383 | 
384 |     **Solution:**
385 | 
386 |       Install the libgl11-mesa-glx and libglib2.0-0 libraries. For Ubuntu this will be:
387 | 
388 |       ```bash
389 |      sudo apt install libgl1-mesa-glx
390 |      sudo apt install libglib2.0-0
391 |       ```
392 | 3. OSError: cannot load library 'libsndfile.so'
393 | 
394 |     **Issue:**
395 |       ```
396 |       OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory
397 |       ```
398 | 
399 |     **Solution:**
400 | 
401 |       Install the libsndfile1-dev library. For Ubuntu this will be:
402 | 
403 |       ```bash
404 |      sudo apt-get install libsndfile1-dev
405 |       ```
406 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
 1 | # Security Policy
 2 | 
 3 | ## Report a Vulnerability
 4 | 
 5 | Please report security issues or vulnerabilities to the [Intel® Security Center].
 6 | 
 7 | For more information on how Intel® works to resolve security issues, see
 8 | [Vulnerability Handling Guidelines].
 9 | 
10 | [Intel® Security Center]:https://www.intel.com/content/www/us/en/security-center/default.html
11 | 
12 | [Vulnerability Handling Guidelines]:https://www.intel.com/content/www/us/en/security-center/vulnerability-handling-guidelines.html
13 | 


--------------------------------------------------------------------------------
/assets/E2E_intel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oneapi-src/voice-data-generation/012e833f68dc53d321b2dbcd454f7a4de7cc4b80/assets/E2E_intel.png


--------------------------------------------------------------------------------
/assets/E2E_stock.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/oneapi-src/voice-data-generation/012e833f68dc53d321b2dbcd454f7a4de7cc4b80/assets/E2E_stock.png


--------------------------------------------------------------------------------
/data/data.txt:
--------------------------------------------------------------------------------
 1 | *make sure of the directory you are in is WaveRNN*
 2 | 
 3 | cd ../../data
 4 | 
 5 | 
 6 | *now to dowload the data set in tar format from LJSpeech*
 7 | 
 8 | wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
 9 | 
10 | 
11 | *Extract the LJSpeech-1.1.tar.bz2 file that was downloaded to have readMe.txt, metadata.csv & wav Folder which contains the audio files *
12 | 
13 | tar -xf LJSpeech-1.1.tar.bz2
14 | 
15 | 
16 | *remove the tar file *
17 | 
18 | rm LJSpeech-1.1.tar.bz2
19 | 
20 | 
21 | *Change back to the WaveRNN directory*
22 | 
23 | cd ../src/WaveRNN
24 | 
25 | 
26 | *** Start preprocessing of the data by running preprocess.py, as mentioned in ReadMe.md***


--------------------------------------------------------------------------------
/data/pretrained_models.txt:
--------------------------------------------------------------------------------
 1 | cd src
 2 | git clone https://github.com/fatchord/WaveRNN
 3 | 
 4 | cp ./*.py WaveRNN
 5 | cp ./deploy.yaml WaveRNN
 6 | 
 7 | mv WaveRNN/optim_eval.py WaveRNN/utils
 8 | 
 9 | cd WaveRNN
10 | 
11 | unzip pretrained/ljspeech.tacotron.r2.180k.zip -d pretrained/tts_weights
12 | unzip pretrained/ljspeech.wavernn.mol.800k.zip -d pretrained/voc_weights
13 | 
14 | 
15 | *** follow the data_download_manual.txt ***
16 | 
17 | 
18 | 
19 | 


--------------------------------------------------------------------------------
/data/product_description.csv:
--------------------------------------------------------------------------------
 1 | ﻿Product,Description
 2 | Intel Xeon Scalable,3rd Gen Intel Xeon Scalable processors offer a balanced architecture that delivers built-in AI acceleration and advanced security capabilities. This allows you to place your workloads where they perform best - from edge to cloud.
 3 | Intel Xeon Scalable,"The 3rd Gen Intel Xeon Scalable processor benefits from decades of innovation for the most common workload requirements. Supported by close partnerships with the world’s software leaders and solution providers. 3rd Gen Intel Xeon Scalable processors are optimized for many workload types and performance levels, all with the consistent, open, Intel architecture you know and trust."
 4 | Intel Core i9 Processors,"These processors feature a performance hybrid architecture designed for intelligent performance, optimized creating, and enhanced tuning to allow gamers to game with up to 5.8 GHz clock speed."
 5 | Intel Mobile Chipsets,"These powerful and feature-rich chipsets, are purpose built for portable, mobile, and 2 in 1 devices. Users can watch UHD videos with crisp imagery, view and edit photos in perfect detail, and smoothly play today’s modern games."
 6 | Intel Desktop Chipsets,"Mainstream chipsets run popular applications, support UHD video, audio, and image editing, and run today’s modern games without lag. Performance chipsets deliver superior audio and digital video, and ultimate power for content creation, advanced applications."
 7 | Intel Pentium Processors,"Discover an amazing balance of performance, experience, and value with systems powered by Intel Pentium processors."
 8 | Intel Pentium Processors,"These processors power more devices, from notebooks to convertibles to desktops and mini PCs—Supports Windows, Chrome and Linux OS—giving you flexibility to choose the best device for your needs, while knowing it will give you the performance, experiences, and security features you deserve"
 9 | Intel Pentium Gold Processors,"Intel Pentium Gold processors provide great value and performance to do daily activities plus power to do light photo editing, video editing, and multitasking."
10 | Intel Pentium Gold Processors,Computers with Intel Pentium Gold processors provide quick processing and vivid graphics. Choose from a range of form factors.
11 | IoT and Embedded Processors,"Deploy edge applications quickly with Intel's portfolio of edge-ready compute and connectivity technologies. Enhanced for IoT, they enable processing at the edge to get critical insights and business value from your data with compute resources where you need them most."
12 | 


--------------------------------------------------------------------------------
/env/intel/intel-voice.yml:
--------------------------------------------------------------------------------
 1 | name: intel-voice
 2 | channels:
 3 |   - pytorch
 4 |   - defaults
 5 | dependencies:
 6 |   - cpuonly=2.0
 7 |   - pip=22.2.2
 8 |   - python=3.8
 9 |   - pytorch=1.13.0
10 |   - torchvision=0.14.0
11 |   - pip:
12 |     - numba==0.48.0
13 |     - librosa==0.6.3
14 |     - numpy==1.22.0
15 |     - intel-extension-for-pytorch==1.13.0
16 |     - neural-compressor==1.14.2
17 |     - matplotlib==3.6.2
18 |     - unidecode==1.3.6
19 |     - inflect==6.0.2
20 |     - SpeechRecognition==3.9.0
21 |     - soundfile==0.11.0
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/env/stock/stock-voice.yml:
--------------------------------------------------------------------------------
 1 | name: stock-voice
 2 | channels:
 3 |   - pytorch
 4 |   - defaults
 5 | dependencies:
 6 |   - cpuonly=2.0
 7 |   - pip=22.2.2
 8 |   - python=3.8
 9 |   - pytorch=1.13.0
10 |   - torchvision=0.14.0
11 |   - pip:
12 |     - numba==0.48.0
13 |     - librosa==0.6.3
14 |     - numpy==1.22.0
15 |     - matplotlib==3.6.2
16 |     - unidecode==1.3.6
17 |     - inflect==6.0.2
18 |     - SpeechRecognition==3.9.0
19 |     - soundfile==0.11.0
20 | 
21 | 
22 | 


--------------------------------------------------------------------------------
/src/evaluation.py:
--------------------------------------------------------------------------------
  1 | # Copyright (C) 2023 Intel Corporation
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | """
  4 | Evaluation code
  5 | below code is adopted from https://github.com/fatchord/WaveRNN
  6 | """
  7 | # pylint: disable=C0103,C0301,C0413,E0401,R0914,R0915
  8 | 
  9 | # !pip install SpeechRecognition pydub
 10 | 
 11 | import argparse
 12 | import os
 13 | import torch
 14 | import speech_recognition as sr
 15 | import soundfile
 16 | from utils.optim_eval import word_error_rate, ipex_optimization, load_model, run_inference
 17 | from utils import hparams as hp
 18 | from utils.text import text_to_sequence
 19 | from utils.display import simple_table
 20 | 
 21 | 
 22 | def process_evaluation_epoch(global_vars: dict):
 23 |     """
 24 |     Processes results from each worker at the end of evaluation and combine to final result
 25 |     Args:
 26 |         global_vars: dictionary containing information of entire evaluation
 27 |     Return:
 28 |         wer: final word error rate
 29 |         loss: final loss
 30 |     """
 31 |     hypotheses = global_vars['predictions']
 32 |     references = global_vars['transcripts']
 33 | 
 34 |     # wer, scores, num_words
 35 |     wer, _, _ = word_error_rate(
 36 |         hypotheses=hypotheses, references=references)
 37 |     return wer
 38 | 
 39 | 
 40 | def main():
 41 |     """
 42 |         Main function
 43 |     """
 44 |     # Parse Arguments
 45 |     parser = argparse.ArgumentParser(description='Evaluation')
 46 |     parser.add_argument('--input_text', '-i', type=str, default=None,
 47 |                         help='[string] Type in something here and TTS will generate it!')
 48 |     parser.add_argument('--batched', '-b', dest='batched', action='store_true', default=False,
 49 |                         help='Fast Batched Generation (lower quality)')
 50 |     parser.add_argument('--unbatched', '-u', dest='batched', action='store_false',
 51 |                         help='Slower Unbatched Generation (better quality)')
 52 |     parser.add_argument('--force_cpu', '-c', action='store_true',
 53 |                         help='Forces CPU-only training, even when in CUDA capable environment')
 54 |     parser.add_argument('--hp_file', metavar='FILE', default='hparams.py',
 55 |                         help='The file to use for the hyper parameters')
 56 |     parser.add_argument('-ipx', '--intel', type=int, required=False, default=0,
 57 |                         help='use 1 for enabling intel pytorch optimizations, default is 0')
 58 |     parser.add_argument('--save_path', type=str, default='saved_audio/evaluation',
 59 |                         help='[string/path] where to store the speech files generated for the input text, '
 60 |                              'default saved_audio folder')
 61 |     parser.add_argument('--voc_weights', type=str, default='pretrained/voc_weights/latest_weights.pyt',
 62 |                         help='[string/path] Load in different WaveRNN weights')
 63 |     parser.add_argument('--tts_weights', type=str, default='pretrained/tts_weights/latest_weights.pyt',
 64 |                         help='[string/path] Load in different Tacotron weights')
 65 |     args = parser.parse_args()
 66 | 
 67 |     hp.configure(args.hp_file)  # Load hparams from file
 68 | 
 69 |     # parser.set_defaults(batched=False)
 70 |     parser.set_defaults(input_text=None)
 71 | 
 72 |     batched = args.batched
 73 |     input_text = args.input_text
 74 |     intel_flag = args.intel
 75 |     tts_weights = args.tts_weights
 76 |     voc_weights = args.voc_weights
 77 |     save_path = args.save_path
 78 |     # creating the save path directory to store the output generated audio file if it does not exist
 79 |     os.makedirs(save_path, exist_ok=True)
 80 | 
 81 |     if not args.force_cpu and torch.cuda.is_available():
 82 |         device = torch.device('cuda')
 83 |     else:
 84 |         device = torch.device('cpu')
 85 |     print('Using device:', device)
 86 | 
 87 |     if not (input_text.endswith(".txt") or input_text.endswith(".csv")):
 88 | 
 89 |         inputs = [text_to_sequence(input_text.strip(), hp.tts_cleaner_names)]
 90 |         inp_txt = input_text
 91 |     else:
 92 |         with open(input_text) as f:
 93 |             inputs = []
 94 |             inp_txt = []
 95 |             cnt = 0
 96 |             for line in f:
 97 |                 split = line.split(',')
 98 |                 sentence = split[-1][:-1]
 99 |                 # adding "hi " here because the speech to text conversion sometimes misses the 1st word
100 |                 sentence = "hi " + sentence
101 |                 if cnt > 0:
102 |                     inp_txt.append(sentence)
103 |                     inputs.append(text_to_sequence(sentence.strip(), hp.tts_cleaner_names))
104 |                 cnt += 1
105 | 
106 |     voc_model = load_model(voc_weights, 'w', device)
107 |     tts_model = load_model(tts_weights, 't', device)
108 | 
109 |     tts_model, voc_model = ipex_optimization(tts_model, voc_model, intel_flag)
110 | 
111 |     voc_k = voc_model.get_step() // 1000
112 |     tts_k = tts_model.get_step() // 1000
113 | 
114 |     r = tts_model.r
115 | 
116 |     simple_table([('WaveRNN', str(voc_k) + 'k'),
117 |                   (f'Tacotron(r={r})', str(tts_k) + 'k'),
118 |                   ('Generation Mode', 'Batched' if batched else 'Unbatched'),
119 |                   ('Target Samples', 11_000 if batched else 'N/A'),
120 |                   ('Overlap Samples', 550 if batched else 'N/A')])
121 | 
122 |     wer = 0.
123 |     itr = 1
124 |     for i, x in enumerate(inputs, 1):
125 | 
126 |         print(f'\n\nGenerating speech for the input passed line {i}...\n')
127 | 
128 |         _, m, _ = tts_model.generate(x)
129 | 
130 |         if batched:
131 |             sav_path = f'{save_path}/__input_batched{str(batched)}_{tts_k}k_{len(inputs)}_{i}.wav'
132 | 
133 |         else:
134 |             sav_path = f'{save_path}/__input_{"un" + str(batched)}__{tts_k}k_{len(inputs)}_{i}.wav'
135 | 
136 |         m = torch.tensor(m).unsqueeze(0)
137 |         m = (m + 4) / 8
138 |         voc_model.generate(m, sav_path, batched, hp.voc_target, hp.voc_overlap, hp.mu_law)
139 | 
140 |         data, samplerate = soundfile.read(sav_path)
141 |         soundfile.write('new.wav', data, samplerate, subtype='PCM_16')
142 |         filename = 'new.wav'
143 | 
144 |         # initialize the recognizer
145 |         r = sr.Recognizer()
146 |         with sr.AudioFile(filename) as source:
147 |             # listen for the data (load audio to memory)
148 |             audio_data = r.record(source)
149 |             # recognize (convert from speech to text)
150 |             text_pre = r.recognize_google(audio_data)
151 |             text_pre = "".join(letter for letter in text_pre if letter.isalnum() or letter == " ")
152 |         # making the input text case-insensitive
153 |         if not (input_text.endswith(".txt") or input_text.endswith(".csv")):
154 | 
155 |             text_gt = "".join(letter for letter in inp_txt if letter.isalnum() or letter == " ")
156 | 
157 |         else:
158 |             text_gt = ''.join(letter for letter in inp_txt[i - 1] if letter.isalnum() or letter == " ")
159 |             # dropping "hi " here because we added in speech gen
160 |             text_gt = text_gt[2:]
161 | 
162 |         text_pre = text_pre.lower().strip()
163 |         text_gt = text_gt.lower().strip()
164 | 
165 |         if len(text_gt) != len(text_pre):
166 |             if len(text_gt) > len(text_pre):
167 |                 for _ in range(len(text_gt) - len(text_pre)):
168 |                     text_pre += " "
169 |             else:
170 |                 text_pre = text_pre[:len(text_gt)]
171 | 
172 |         references = [text_gt]
173 |         hypotheses = [text_pre]
174 | 
175 |         d = dict(predictions=hypotheses,
176 |                  transcripts=references)
177 |         wer += process_evaluation_epoch(d)
178 |         itr += 1
179 |         print("Input sentence passed::\n", text_gt)
180 |         print("Predicted sentence of the model::\n", text_pre)
181 | 
182 |     wer /= itr - 1
183 |     print("Number of sentences:", itr - 1)
184 |     print("\nAverage Word Error Rate: {:}%, accuracy={:}%".format(wer * 100, (1 - wer) * 100), "\n")
185 | 
186 | 
187 | if __name__ == '__main__':
188 |     main()
189 | 


--------------------------------------------------------------------------------
/src/inference.py:
--------------------------------------------------------------------------------
  1 | # Copyright (C) 2023 Intel Corporation
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | """
  4 | Inference code for trained models
  5 | below code is adopted from https://github.com/fatchord/WaveRNN
  6 | """
  7 | # pylint: disable=C0103,C0301,C0413,E0401,E1101,C0415,
  8 | 
  9 | import time
 10 | import argparse
 11 | import os
 12 | import sys
 13 | import torch
 14 | 
 15 | from utils.optim_eval import ipex_optimization, load_model, run_inference
 16 | from utils import hparams as hp
 17 | from utils.text import text_to_sequence
 18 | from utils.display import simple_table
 19 | from utils.files import get_files
 20 | 
 21 | 
 22 | def main():
 23 |     """
 24 |         Main Function
 25 |     """
 26 |     # Parse Arguments
 27 |     parser = argparse.ArgumentParser(description='Inference')
 28 |     parser.add_argument('--input_text', '-i', type=str, default=None,
 29 |                         help='[string/csv file] Type in something here and TTS will generate it!')
 30 |     parser.add_argument('--batched', '-b', dest='batched', action='store_true', default=False,
 31 |                         help='Fast Batched Generation (lower quality)')
 32 |     parser.add_argument('--unbatched', '-u', dest='batched', action='store_false',
 33 |                         help='Slower Unbatched Generation (better quality)')
 34 |     parser.add_argument('--force_cpu', '-c', action='store_true',
 35 |                         help='Forces CPU-only training, even when in CUDA capable environment')
 36 |     parser.add_argument('--hp_file', metavar='FILE', default='hparams.py',
 37 |                         help='The file to use for the hyper parameters')
 38 |     parser.add_argument('-ipx', '--intel', type=int, required=False, default=0,
 39 |                         help='use 1 for enabling intel pytorch optimizations, default is 0')
 40 |     parser.add_argument('--save_path', type=str, default='saved_audio',
 41 |                         help='[string/path] where to store the speech files generated for the input text, '
 42 |                              'default saved_audio folder')
 43 |     parser.add_argument('--voc_weights', type=str, default='pretrained/voc_weights/latest_weights.pyt',
 44 |                         help='[string/path] Load in different WaveRNN weights')
 45 |     parser.add_argument('--tts_weights', type=str, default='pretrained/tts_weights/latest_weights.pyt',
 46 |                         help='[string/path] Load in different Tacotron weights')
 47 |     args = parser.parse_args()
 48 | 
 49 |     hp.configure(args.hp_file)  # Load hparams from file
 50 | 
 51 |     parser.set_defaults(input_text=None)
 52 | 
 53 |     batched = args.batched
 54 |     input_text = args.input_text
 55 |     intel_flag = args.intel
 56 |     tts_weights = args.tts_weights
 57 |     voc_weights = args.voc_weights
 58 |     save_path = args.save_path
 59 | 
 60 |     # creating the save path directory to store the output generated audio file if it does not exist
 61 |     os.makedirs(save_path, exist_ok=True)
 62 | 
 63 |     if not (input_text.endswith(".txt") or input_text.endswith(".csv")):
 64 |         inputs = [text_to_sequence(input_text.strip(), hp.tts_cleaner_names)]
 65 |     else:
 66 |         with open(input_text) as f:
 67 | 
 68 |             inputs = []
 69 |             cnt = 0
 70 |             for line in f:
 71 |                 split = line.split(',')
 72 |                 sentence = split[-1][:-1].strip()
 73 |                 if cnt > 0:
 74 |                     inputs.append(text_to_sequence(sentence.strip(), hp.tts_cleaner_names))
 75 |                 cnt += 1
 76 | 
 77 |     if not args.force_cpu and torch.cuda.is_available():
 78 |         device = torch.device('cuda')
 79 |     else:
 80 |         device = torch.device('cpu')
 81 |     print('Using device:', device)
 82 | 
 83 |     voc_model = load_model(voc_weights, 'w', device)
 84 |     tts_model = load_model(tts_weights, 't', device)
 85 | 
 86 |     tts_model, voc_model = ipex_optimization(tts_model, voc_model, intel_flag)
 87 | 
 88 |     voc_k = voc_model.get_step() // 1000
 89 |     tts_k = tts_model.get_step() // 1000
 90 | 
 91 |     r = tts_model.r
 92 | 
 93 |     simple_table([('WaveRNN', str(voc_k) + 'k'),
 94 |                   (f'Tacotron(r={r})', str(tts_k) + 'k'),
 95 |                   ('Generation Mode', 'Batched' if batched else 'Unbatched'),
 96 |                   ('Target Samples', 11_000 if batched else 'N/A'),
 97 |                   ('Overlap Samples', 550 if batched else 'N/A')])
 98 | 
 99 |     print("\nWarming Up the models for inference.....")
100 |     for _ in range(0, 10):
101 |         wr_up = True
102 |         warm_up_ip = "This is an input used for warming up our models"
103 |         wr_up_ip = [text_to_sequence(warm_up_ip.strip(), hp.tts_cleaner_names)]
104 |         run_inference(wr_up_ip, tts_model, voc_model, batched, wr_up, save_path)
105 | 
106 |     print('\nFinished warmup.\nGenerating speech for the input passed...\n')
107 | 
108 |     wr_up = False
109 |     run_inference(inputs, tts_model, voc_model, batched, wr_up, save_path)
110 | 
111 |     print('\n\nDone.\n')
112 | 
113 | 
114 | if __name__ == "__main__":
115 |     main()
116 | 


--------------------------------------------------------------------------------
/src/optim_eval.py:
--------------------------------------------------------------------------------
  1 | # Copyright (C) 2023 Intel Corporation
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | """
  4 | below code is adopted from https://github.com/fatchord/WaveRNN
  5 | """
  6 | 
  7 | import time
  8 | from typing import List
  9 | from pathlib import Path
 10 | import torch
 11 | from utils.text.symbols import symbols
 12 | from models.fatchord_version import WaveRNN
 13 | from models.tacotron import Tacotron
 14 | from utils import hparams as hp
 15 | 
 16 | 
 17 | def ipex_optimization(t_model, v_model, i_flag):
 18 |     """
 19 | 
 20 |     Args:
 21 |         t_model: Tacotron model
 22 |         v_model: WaveRNN model
 23 |         i_flag: Intel flag to enable ipex
 24 | 
 25 |     Returns: optimized models if intel ipex enabled
 26 | 
 27 |     """
 28 |     t_model.eval()
 29 |     v_model.eval()
 30 |     if i_flag:
 31 |         import intel_extension_for_pytorch as ipex
 32 |         t_model = ipex.optimize(t_model)
 33 |         v_model = ipex.optimize(v_model)
 34 |         print('\nINTEL IPEX Optimizations Enabled.\n')
 35 |     return t_model, v_model
 36 | 
 37 | 
 38 | def load_model(model_weights, typ, dv, return_model=False):
 39 |     """
 40 | 
 41 |     Args:
 42 |         return_model: Returns without weights
 43 |         model_weights: weights to load the model
 44 |         typ: model type (Tacotron / WaveRNN)
 45 |         dv: device
 46 | 
 47 |     Returns: loaded model
 48 | 
 49 |     """
 50 | 
 51 |     if typ == 'w':
 52 | 
 53 |         print('\nInitialising WaveRNN Model...\n')
 54 | 
 55 |         # Instantiate WaveRNN Model
 56 |         model = WaveRNN(rnn_dims=hp.voc_rnn_dims,
 57 |                         fc_dims=hp.voc_fc_dims,
 58 |                         bits=hp.bits,
 59 |                         pad=hp.voc_pad,
 60 |                         upsample_factors=hp.voc_upsample_factors,
 61 |                         feat_dims=hp.num_mels,
 62 |                         compute_dims=hp.voc_compute_dims,
 63 |                         res_out_dims=hp.voc_res_out_dims,
 64 |                         res_blocks=hp.voc_res_blocks,
 65 |                         hop_length=hp.hop_length,
 66 |                         sample_rate=hp.sample_rate,
 67 |                         mode='MOL').to(dv)
 68 | 
 69 |         if not return_model:
 70 |             model.load(model_weights)
 71 | 
 72 |     else:
 73 | 
 74 |         print('\nInitialising Tacotron Model...\n')
 75 | 
 76 |         # Instantiate Tacotron Model
 77 |         model = Tacotron(embed_dims=hp.tts_embed_dims,
 78 |                          num_chars=len(symbols),
 79 |                          encoder_dims=hp.tts_encoder_dims,
 80 |                          decoder_dims=hp.tts_decoder_dims,
 81 |                          n_mels=hp.num_mels,
 82 |                          fft_bins=hp.num_mels,
 83 |                          postnet_dims=hp.tts_postnet_dims,
 84 |                          encoder_K=hp.tts_encoder_K,
 85 |                          lstm_dims=hp.tts_lstm_dims,
 86 |                          postnet_K=hp.tts_postnet_K,
 87 |                          num_highways=hp.tts_num_highways,
 88 |                          dropout=hp.tts_dropout,
 89 |                          stop_threshold=hp.tts_stop_threshold).to(dv)
 90 | 
 91 |         if not return_model:
 92 |             model.load(model_weights)
 93 | 
 94 |     return model
 95 | 
 96 | 
 97 | def run_inference(text, taco_model, wave_model, batch, wrm_up, sv_path):
 98 |     """
 99 |         Performs the inference by taking input text and then give speech
100 |     """
101 | 
102 |     sav_path = None
103 |     tts = taco_model.get_step() // 1000
104 |     inference_time = time.time()
105 |     voc_time = 0
106 |     tac_time = 0
107 |     for i, x in enumerate(text, 1):
108 | 
109 |         if not wrm_up:
110 | 
111 |             tac_time = time.time()
112 | 
113 |         _, m, _ = taco_model.generate(x)
114 | 
115 |         if not wrm_up:
116 |             print("\nTime taken by Tacotron model for inference is ", (time.time() - tac_time))
117 |             voc_time = time.time()
118 | 
119 |         if batch:
120 |             sav_path = f'{sv_path}/__input_batched{str(batch)}_{tts}k_{len(text)}_{i}.wav'
121 | 
122 |         else:
123 |             sav_path = f'{sv_path}/__input_{"un" + str(batch)}__{tts}k_{len(text)}_{i}.wav'
124 | 
125 |         m = torch.tensor(m).unsqueeze(0)
126 |         m = (m + 4) / 8
127 |         # import pdb; pdb.set_trace()
128 |         wave_model.generate(m, sav_path, batch, hp.voc_target, hp.voc_overlap, hp.mu_law)
129 |         if not wrm_up:
130 |             print("\nTime taken by WaveRNN model for inference is ", (time.time() - voc_time))
131 | 
132 |     if not wrm_up:
133 |         print("\nTotal time for inference is ", (time.time() - inference_time))
134 | 
135 |     return sav_path
136 | 
137 | 
138 | def levenshtein(a: List, b: List) -> int:
139 |     """Calculates the Levenshtein distance between a and b.
140 |     """
141 |     n, m = len(a), len(b)
142 |     if n > m:
143 |         # Make sure n <= m, to use O(min(n,m)) space
144 |         a, b = b, a
145 |         n, m = m, n
146 | 
147 |     current = list(range(n + 1))
148 |     for i in range(1, m + 1):
149 |         previous, current = current, [i] + [0] * n
150 |         for j in range(1, n + 1):
151 |             add, delete = previous[j] + 1, current[j - 1] + 1
152 |             change = previous[j - 1]
153 |             if a[j - 1] != b[i - 1]:
154 |                 change = change + 1
155 |             current[j] = min(add, delete, change)
156 | 
157 |     return current[n]
158 | 
159 | 
160 | def word_error_rate(hypotheses: List[str], references: List[str]):
161 |     """
162 |     Computes Average Word Error rate between two texts represented as
163 |     corresponding lists of string. Hypotheses and references must have same length.
164 |     Args:
165 |         hypotheses: list of hypotheses / predictions
166 |         references: list of references / ground truth
167 |     """
168 |     scores = 0
169 |     words = 0
170 |     if len(hypotheses) != len(references):
171 |         raise ValueError("In word error rate calculation, hypotheses and reference"
172 |                          " lists must have the same number of elements. But I got:"
173 |                          "{0} and {1} correspondingly".format(len(hypotheses), len(references)))
174 |     for h, r in zip(hypotheses, references):
175 |         h_list = h.split()
176 |         r_list = r.split()
177 |         words += len(r_list)
178 |         scores += levenshtein(h_list, r_list)
179 |     if words != 0:
180 |         wer = (1.0 * scores) / words
181 |     else:
182 |         wer = float('inf')
183 |     return wer, scores, words
184 | 
185 | 
186 | def gen_testset(model: WaveRNN, test_set, samples, batched, target, overlap, save_path: Path):
187 |     k = model.get_step() // 1000
188 | 
189 |     for i, (m, x) in enumerate(test_set, 1):
190 | 
191 |         if i > samples:
192 |             break
193 | 
194 |         print('\n| Generating: %i/%i' % (i, samples))
195 | 
196 |         x = x[0].numpy()
197 | 
198 |         bits = 16 if hp.voc_mode == 'MOL' else hp.bits
199 | 
200 |         if hp.mu_law and hp.voc_mode != 'MOL':
201 |             x = decode_mu_law(x, 2 ** bits, from_labels=True)
202 |         else:
203 |             x = label_2_float(x, bits)
204 | 
205 |         save_wav(x, save_path / f'{k}k_steps_{i}_target.wav')
206 | 
207 |         batch_str = f'gen_batched_target{target}_overlap{overlap}' if batched else 'gen_NOT_BATCHED'
208 |         save_str = str(save_path / f'{k}k_steps_{i}_{batch_str}.wav')
209 | 
210 |         _ = model.generate(m, save_str, batched, target, overlap, hp.mu_law)
211 | 


--------------------------------------------------------------------------------
/src/training.py:
--------------------------------------------------------------------------------
  1 | # Copyright (C) 2023 Intel Corporation
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | """
  4 | Train Tacotron model
  5 | below code is adopted from https://github.com/fatchord/WaveRNN
  6 | """
  7 | # pylint: disable=C0103,C0301,C0413,E0401,W0614,C0412,W0401,W0613,R0914,R0913,R0915
  8 | 
  9 | import time
 10 | import argparse
 11 | from pathlib import Path
 12 | import numpy as np
 13 | import torch
 14 | from torch import optim
 15 | import torch.nn.functional as F
 16 | 
 17 | from utils import hparams as hp
 18 | from utils.display import *
 19 | from utils.dataset import get_tts_datasets, get_vocoder_datasets
 20 | from utils.text.symbols import symbols
 21 | from utils.distribution import discretized_mix_logistic_loss
 22 | from utils.paths import Paths
 23 | from models.tacotron import Tacotron
 24 | from models.fatchord_version import WaveRNN
 25 | from utils import data_parallel_workaround
 26 | from utils.optim_eval import gen_testset
 27 | from utils.checkpoints import save_checkpoint, restore_checkpoint
 28 | 
 29 | 
 30 | def np_now(x: torch.Tensor):
 31 |     """
 32 |         converting torch Tensor to numpy array
 33 |     """
 34 |     return x.detach().cpu().numpy()
 35 | 
 36 | 
 37 | def main():
 38 |     """
 39 |         Main method
 40 |     """
 41 |     # Parse Arguments
 42 |     parser = argparse.ArgumentParser(description='Train Tacotron TTS & WaveRNN Voc')
 43 |     parser.add_argument('--force_gta', '-g', action='store_true', help='Force the model to create GTA features')
 44 |     parser.add_argument('--force_cpu', '-c', action='store_true', help='Forces CPU-only training, even when in CUDA '
 45 |                                                                        'capable environment')
 46 |     parser.add_argument('--lr', '-l', type=float, help='[float] override hparams.py learning rate')
 47 |     parser.add_argument('--batch_size', '-b', type=int, help='[int] override hparams.py batch size')
 48 |     parser.add_argument('--hp_file', metavar='FILE', default='src/utils/hparams.py',
 49 |                         help='The file to use for the hyper parameters')
 50 |     parser.add_argument('--epochs', '-e', type=int, default=None,
 51 |                         help='[int] number of epochs for training')
 52 | 
 53 | 
 54 |     args = parser.parse_args()
 55 | 
 56 |     hp.configure(args.hp_file)  # Load hparams from file
 57 |     paths = Paths(hp.data_path, hp.voc_model_id, hp.tts_model_id)
 58 | 
 59 |     if args.lr is None:
 60 |         args.lr = hp.voc_lr
 61 |     if args.batch_size is None:
 62 |         args.batch_size = hp.voc_batch_size
 63 | 
 64 |     batch_size = args.batch_size
 65 |     lr = args.lr
 66 |     train_gta = args.force_gta
 67 |     epochs = args.epochs
 68 | 
 69 |     if not args.force_cpu and torch.cuda.is_available():
 70 |         device = torch.device('cuda')
 71 |         for session in hp.tts_schedule:
 72 |             _, _, _, batch_size = session
 73 |             if batch_size % torch.cuda.device_count() != 0:
 74 |                 raise ValueError('`batch_size` must be evenly divisible by n_gpus!')
 75 |     else:
 76 |         device = torch.device('cpu')
 77 |     print('Using device:', device)
 78 | 
 79 |     # Instantiate Tacotron Model
 80 |     print('\nInitialising Tacotron Model...\n')
 81 |     model = Tacotron(embed_dims=hp.tts_embed_dims,
 82 |                      num_chars=len(symbols),
 83 |                      encoder_dims=hp.tts_encoder_dims,
 84 |                      decoder_dims=hp.tts_decoder_dims,
 85 |                      n_mels=hp.num_mels,
 86 |                      fft_bins=hp.num_mels,
 87 |                      postnet_dims=hp.tts_postnet_dims,
 88 |                      encoder_K=hp.tts_encoder_K,
 89 |                      lstm_dims=hp.tts_lstm_dims,
 90 |                      postnet_K=hp.tts_postnet_K,
 91 |                      num_highways=hp.tts_num_highways,
 92 |                      dropout=hp.tts_dropout,
 93 |                      stop_threshold=hp.tts_stop_threshold).to(device)
 94 | 
 95 |     optimizer = optim.Adam(model.parameters())
 96 | 
 97 |     restore_checkpoint('tts', paths, model, optimizer, create_if_missing=True)
 98 | 
 99 |     start = time.time()
100 |     for _, session in enumerate(hp.tts_schedule):
101 |         current_step = model.get_step()
102 | 
103 |         r, lr, max_step, batch_size = session
104 | 
105 |         training_steps = max_step - current_step
106 |         model.r = r
107 |         simple_table([(f'Steps with r={r}', str(training_steps // 1000) + 'k Steps'),
108 |                       ('Batch Size', batch_size),
109 |                       ('Learning Rate', lr),
110 |                       ('Outputs/Step (r)', model.r)])
111 | 
112 |         train_set, attn_example = get_tts_datasets(paths.data, batch_size, r)
113 | 
114 |         tts_train_loop(paths, model, optimizer, train_set, lr, training_steps, attn_example, epochs)
115 | 
116 |     print("Total training time is ", (time.time() - start))
117 |     print('Training Tacotron model is Completed.')
118 | 
119 | 
120 |     print('Creating Ground Truth Aligned Dataset...\n')
121 | 
122 |     train_set, attn_example = get_tts_datasets(paths.data, 32, model.r)
123 |     create_gta_features(model, train_set, paths.gta)
124 | 
125 |     print('\n\nWe can now train WaveRNN on GTA features\n')
126 | 
127 |     print('\nInitialising WaveRNN Model...\n')
128 | 
129 |     # Instantiate WaveRNN Model
130 |     voc_model = WaveRNN(rnn_dims=hp.voc_rnn_dims,
131 |                         fc_dims=hp.voc_fc_dims,
132 |                         bits=hp.bits,
133 |                         pad=hp.voc_pad,
134 |                         upsample_factors=hp.voc_upsample_factors,
135 |                         feat_dims=hp.num_mels,
136 |                         compute_dims=hp.voc_compute_dims,
137 |                         res_out_dims=hp.voc_res_out_dims,
138 |                         res_blocks=hp.voc_res_blocks,
139 |                         hop_length=hp.hop_length,
140 |                         sample_rate=hp.sample_rate,
141 |                         mode=hp.voc_mode).to(device)
142 | 
143 |     # Check to make sure the hop length is correctly factorised
144 |     assert np.cumprod(hp.voc_upsample_factors)[-1] == hp.hop_length
145 | 
146 |     optimizer = optim.Adam(voc_model.parameters())
147 | 
148 |     restore_checkpoint('voc', paths, voc_model, optimizer, create_if_missing=True)
149 | 
150 |     train_set, test_set = get_vocoder_datasets(paths.data, batch_size, train_gta)
151 | 
152 |     total_steps = hp.voc_total_steps
153 | 
154 |     simple_table([('Remaining', str((total_steps - voc_model.get_step()) // 1000) + 'k Steps'),
155 |                   ('Batch Size', batch_size),
156 |                   ('LR', lr),
157 |                   ('Sequence Len', hp.voc_seq_len),
158 |                   ('GTA Train', train_gta)])
159 | 
160 |     loss_func = F.cross_entropy if voc_model.mode == 'RAW' else discretized_mix_logistic_loss
161 | 
162 |     start = time.time()
163 |     voc_train_loop(paths, voc_model, loss_func, optimizer, train_set, test_set, lr, total_steps, epochs)
164 | 
165 |     print("Total training time to train WaveRNN model is ", (time.time() - start))
166 |     print('\nTraining Completed for both Tacotron and WaveRNN models.')
167 | 
168 | 
169 | def tts_train_loop(paths: Paths, model: Tacotron, optimizer, train_set, lr, train_steps, attn_example, eps=None):
170 |     """
171 |         Training Tacotron model
172 |     """
173 |     device = next(model.parameters()).device  # use same device as model parameters
174 | 
175 |     for g in optimizer.param_groups:
176 |         g['lr'] = lr
177 | 
178 |     total_iters = len(train_set)
179 | 
180 |     epochs = eps if eps else train_steps // total_iters + 1
181 | 
182 | 
183 |     msg = None
184 |     start = time.time()
185 |     for e in range(1, epochs + 1):
186 | 
187 |         running_loss = 0
188 |         strt = time.time()
189 |         # Performs 1 iteration for every input string
190 |         for i, (x, m, ids, _) in enumerate(train_set, 1):
191 | 
192 |             x, m = x.to(device), m.to(device)
193 | 
194 |             # Parallelize model onto GPUS using workaround due to python bug
195 |             if device.type == 'cuda' and torch.cuda.device_count() > 1:
196 |                 m1_hat, m2_hat, attention = data_parallel_workaround(model, x, m)
197 |             else:
198 |                 m1_hat, m2_hat, attention = model(x, m)
199 | 
200 |             m1_loss = F.l1_loss(m1_hat, m)
201 |             m2_loss = F.l1_loss(m2_hat, m)
202 | 
203 |             loss = m1_loss + m2_loss
204 | 
205 |             optimizer.zero_grad()
206 |             loss.backward()
207 |             if hp.tts_clip_grad_norm is not None:
208 |                 grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), hp.tts_clip_grad_norm)
209 |                 if np.isnan(grad_norm):
210 |                     print('grad_norm was NaN!')
211 | 
212 |             optimizer.step()
213 | 
214 |             running_loss += loss.item()
215 |             avg_loss = running_loss / i
216 | 
217 |             speed = i / (time.time() - strt)
218 | 
219 |             step = model.get_step()
220 |             k = step // 1000
221 | 
222 |             if step % hp.tts_checkpoint_every == 0:
223 |                 ckpt_name = f'taco_step{k}K'
224 |                 save_checkpoint('tts', paths, model, optimizer,
225 |                                 name=ckpt_name, is_silent=True)
226 | 
227 |             if attn_example in ids:
228 |                 idx = ids.index(attn_example)
229 |                 save_attention(np_now(attention[idx][:, :160]), paths.tts_attention / f'{step}')
230 |                 save_spectrogram(np_now(m2_hat[idx]), paths.tts_mel_plot / f'{step}', 600)
231 | 
232 |             msg = f'| Epoch: {e}/{epochs} ({i}/{total_iters}) | Loss: ' \
233 |                   f'{avg_loss:#.4} | {speed:#.2} steps/s | Step: {k}k | '
234 |             stream(msg)
235 | 
236 |         save_checkpoint('tts', paths, model, optimizer, is_silent=True)
237 |         model.log(paths.tts_log, msg)
238 |         print(' ')
239 |     print("Total Training time for Tacotron Model is ", (time.time() - start))
240 | 
241 | 
242 | def create_gta_features(model: Tacotron, train_set, save_path: Path):
243 |     """
244 |         Creating Ground Truth aligned features incase we use it for training later
245 |     """
246 |     device = next(model.parameters()).device  # use same device as model parameters
247 | 
248 |     iters = len(train_set)
249 | 
250 |     for i, (x, mels, ids, mel_lens) in enumerate(train_set, 1):
251 | 
252 |         x, mels = x.to(device), mels.to(device)
253 | 
254 |         with torch.no_grad():
255 |             _, gta, _ = model(x, mels)
256 | 
257 |         gta = gta.cpu().numpy()
258 | 
259 |         for j, item_id in enumerate(ids):
260 |             mel = gta[j][:, :mel_lens[j]]
261 |             mel = (mel + 4) / 8
262 |             np.save(save_path / f'{item_id}.npy', mel, allow_pickle=False)
263 | 
264 |         bar1 = progbar(i, iters)
265 |         msg = f'{bar1} {i}/{iters} Batches '
266 |         stream(msg)
267 | 
268 | 
269 | def voc_train_loop(paths: Paths, model: WaveRNN, loss_func, optimizer, train_set, test_set, lr, total_steps, eps=None):
270 |     """
271 |         Training WaveRNN model
272 |     """
273 |     # Use same device as model parameters
274 |     device = next(model.parameters()).device
275 | 
276 |     for g in optimizer.param_groups:
277 |         g['lr'] = lr
278 | 
279 |     total_iters = len(train_set)
280 | 
281 |     epochs = eps if eps else (total_steps - model.get_step()) // total_iters + 1
282 |     for e in range(1, epochs + 1):
283 | 
284 |         start = time.time()
285 |         running_loss = 0.
286 | 
287 |         for i, (x, y, m) in enumerate(train_set, 1):
288 |             x, m, y = x.to(device), m.to(device), y.to(device)
289 | 
290 |             # Parallelize model onto GPUS using workaround due to python bug
291 |             if device.type == 'cuda' and torch.cuda.device_count() > 1:
292 |                 y_hat = data_parallel_workaround(model, x, m)
293 |             else:
294 |                 y_hat = model(x, m)
295 | 
296 |             if model.mode == 'RAW':
297 |                 y_hat = y_hat.transpose(1, 2).unsqueeze(-1)
298 | 
299 |             elif model.mode == 'MOL':
300 |                 y = y.float()
301 | 
302 |             y = y.unsqueeze(-1)
303 | 
304 |             loss = loss_func(y_hat, y)
305 | 
306 |             optimizer.zero_grad()
307 |             loss.backward()
308 |             if hp.voc_clip_grad_norm is not None:
309 |                 grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), hp.voc_clip_grad_norm)
310 |                 if np.isnan(grad_norm):
311 |                     print('grad_norm was NaN!')
312 |             optimizer.step()
313 | 
314 |             running_loss += loss.item()
315 |             avg_loss = running_loss / i
316 | 
317 |             speed = i / (time.time() - start)
318 | 
319 |             step = model.get_step()
320 |             k = step // 1000
321 | 
322 |             if step % hp.voc_checkpoint_every == 0:
323 |                 gen_testset(model, test_set, hp.voc_gen_at_checkpoint, hp.voc_gen_batched,
324 |                             hp.voc_target, hp.voc_overlap, paths.voc_output)
325 |                 ckpt_name = f'wave_step{k}K'
326 |                 save_checkpoint('voc', paths, model, optimizer,
327 |                                 name=ckpt_name, is_silent=True)
328 | 
329 |             msg = f'| Epoch: {e}/{epochs} ({i}/{total_iters}) | Loss: ' \
330 |                   f'{avg_loss:.4f} | {speed:.1f} steps/s | Step: {k}k | '
331 |             stream(msg)
332 | 
333 |         # Must save latest optimizer state to ensure that resuming training
334 |         # doesn't produce artifacts
335 |         save_checkpoint('voc', paths, model, optimizer, is_silent=True)
336 |         model.log(paths.voc_log, msg)
337 |         print(' ')
338 | 
339 | 
340 | if __name__ == "__main__":
341 |     main()
342 | 


--------------------------------------------------------------------------------