S2VC: A Framework for Any-to-Any Voice Conversion with Self-SupervisedPretrained 36 | Representations
37 |39 | Abstract: 40 | Any-to-any voice conversion (VC) aims to convert the timbre of utterances from and to any speakers seen 41 | or unseen during training. Various any-to-any VC approaches have been proposed like AUTOVC, AdaINVC, and 42 | FragmentVC. AUTOVC, and AdaINVC utilize source and target encoders to disentangle the content and 43 | speaker information of the features. FragmentVC utilizes two encoders to encode source and target 44 | information and adopts cross attention to align the source and target features with similar phonetic 45 | content. Moreover, pre-trained features are adopted. AUTOVC used dvector to extract speaker information, 46 | and self-supervised learning (SSL) features like wav2vec 2.0 is used in FragmentVC to extract the 47 | phonetic content information. Different from previous works, we proposed S2VC that utilizes 48 | Self-Supervised features as both source and target features for VC model. Supervised phoneme 49 | posteriororgram (PPG), which is believed to be speaker-independent and widely used in VC to extract 50 | content information, is chosen as a strong baseline for SSL features. The objective evaluation and 51 | subjective evaluation both show models taking SSL feature CPC as both source and target features 52 | outperforms that taking PPG as source feature, suggesting that SSL features have great potential in 53 | improving VC. 54 |
55 |56 | 57 | arXiv (Preprint) 58 |
59 |© 台大語音實驗室 NTU Speech Lab
467 |