├── docs ├── tts-arxiv-daily-wechat.json ├── _config.yml └── index.md ├── requirements.txt ├── assets ├── 4-ga-2-1.png ├── 4-ga-3-1.png ├── 4-ga-5-1.png ├── 4-ga-7.png ├── 4-ga-8.png ├── 4-ga-9.png └── 5-pages-1.png ├── .github ├── ISSUE_TEMPLATE │ ├── config.yml │ ├── feature_request.md │ ├── question.md │ └── bug_report.md ├── PULL_REQUEST_TEMPLATE.md └── workflows │ ├── cv-arxiv-daily.yml │ └── update_paper_links.yml ├── .gitignore ├── config.yaml ├── CODE_OF_CONDUCT.md ├── LICENSE └── daily_arxiv.py /docs/tts-arxiv-daily-wechat.json: -------------------------------------------------------------------------------- 1 | {} -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | arxiv 3 | pyyaml -------------------------------------------------------------------------------- /assets/4-ga-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-2-1.png -------------------------------------------------------------------------------- /assets/4-ga-3-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-3-1.png -------------------------------------------------------------------------------- /assets/4-ga-5-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-5-1.png -------------------------------------------------------------------------------- /assets/4-ga-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-7.png -------------------------------------------------------------------------------- /assets/4-ga-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-8.png -------------------------------------------------------------------------------- /assets/4-ga-9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-9.png -------------------------------------------------------------------------------- /assets/5-pages-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/5-pages-1.png -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/config.yml: -------------------------------------------------------------------------------- 1 | # Configuration: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository 2 | 3 | blank_issues_enabled: false 4 | -------------------------------------------------------------------------------- /.github/PULL_REQUEST_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | 3 | 4 | 5 | ## Related Issue 6 | 7 | 8 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | CMakeLists.txt.user 2 | CMakeLists_modified.txt 3 | 4 | .DS_Store 5 | 6 | build/ 7 | 8 | lib/ 9 | bin/ 10 | 11 | cmake_modules/ 12 | cmake-build-debug/ 13 | .idea/ 14 | .vscode/ 15 | *.pyc 16 | 17 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: 🚀 Feature request 3 | about: Suggest an idea for this project 🏖 4 | title: "" 5 | labels: enhancement 6 | assignees: 7 | --- 8 | 9 | ## 🚀 Feature Request 10 | 11 | 12 | 13 | ## 📎 Additional context 14 | 15 | 16 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: ❓ Question 3 | about: Ask a question about this project 🎓 4 | title: "" 5 | labels: question 6 | assignees: 7 | --- 8 | 9 | ## Checklist 10 | 11 | 12 | 13 | - [ ] I've searched the project's [`issues`] 14 | 15 | ## ❓ Question 16 | 17 | 18 | 19 | How can I [...]? 20 | 21 | Is it possible to [...]? 22 | 23 | ## 📎 Additional context 24 | 25 | 26 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: 🐛 Bug report 3 | about: If something isn't working 🔧 4 | title: "" 5 | labels: bug 6 | assignees: 7 | --- 8 | 9 | ## 🐛 Bug Report 10 | 11 | 12 | 13 | ## 🔬 How To Reproduce 14 | 15 | Steps to reproduce the behavior: 16 | 17 | 1. ... 18 | 19 | ### Environment 20 | 21 | - OS: [e.g. Linux / Windows / macOS] 22 | - Python version, get it with: 23 | 24 | ```bash 25 | python --version 26 | ``` 27 | 28 | ## 📎 Additional context 29 | 30 | 31 | -------------------------------------------------------------------------------- /docs/_config.yml: -------------------------------------------------------------------------------- 1 | title: CV Arxiv Daily 2 | description: Automatically Update CV Papers Daily using Github Actions (Update Every 8th hours) 3 | show_downloads: true 4 | 5 | github: 6 | zip_url: https://github.com/Vincentqyw/cv-arxiv-daily 7 | another_url: https://github.com/Vincentqyw/cv-arxiv-daily 8 | 9 | ## add remote theme 10 | remote_theme: jekyll/minima@v2.5.1 11 | 12 | # minima: 13 | skin: dark 14 | social_links: 15 | twitter: AlphaRealcat 16 | github: vincentqyw 17 | 18 | plugins: 19 | - jekyll-remote-theme # add this line to the plugins list if you already have one 20 | - jekyll-feed 21 | - jekyll-seo-tag 22 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | # TODO: add papers by configuration file 2 | base_url: "https://arxiv.paperswithcode.com/api/v0/papers/" 3 | user_name: "wetdog" 4 | repo_name: "cv-arxiv-daily" 5 | show_authors: True 6 | show_links: True 7 | show_badge: True 8 | max_results: 10 9 | 10 | publish_readme: True 11 | publish_gitpage: True 12 | publish_wechat: False 13 | 14 | # file paths 15 | json_readme_path: './docs/tts-arxiv-daily.json' 16 | json_gitpage_path: './docs/tts-arxiv-daily-web.json' 17 | json_wechat_path: './docs/tts-arxiv-daily-wechat.json' 18 | 19 | md_readme_path: 'README.md' 20 | md_gitpage_path: './docs/index.md' 21 | md_wechat_path: './docs/wechat.md' 22 | 23 | # keywords to search 24 | keywords: 25 | "TTS": 26 | filters: ["TTS", "Text to speech"] 27 | -------------------------------------------------------------------------------- /.github/workflows/cv-arxiv-daily.yml: -------------------------------------------------------------------------------- 1 | # This is a basic workflow to help you get started with Actions 2 | 3 | name: Run Arxiv Papers Daily 4 | 5 | # Controls when the workflow will run 6 | on: 7 | # Allows you to run this workflow manually from the Actions tab 8 | workflow_dispatch: 9 | schedule: 10 | - cron: "0 0/12 * * *" #'*/60 * * * *' 11 | # Triggers the workflow on push or pull request events but only for the main branch 12 | # push: 13 | # branches: 14 | # - main 15 | 16 | env: 17 | 18 | GITHUB_USER_NAME: wetdog 19 | GITHUB_USER_EMAIL: jose091@gmail.com 20 | 21 | 22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel 23 | jobs: 24 | # This workflow contains a single job called "build" 25 | build: 26 | name: update 27 | # The type of runner that the job will run on 28 | runs-on: ubuntu-latest 29 | 30 | # Steps represent a sequence of tasks that will be executed as part of the job 31 | steps: 32 | - name: Checkout 33 | uses: actions/checkout@v3 34 | 35 | - name: Set up Python Env 36 | uses: actions/setup-python@v4 37 | with: 38 | python-version: '3.10' 39 | #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified 40 | - name: Install dependencies 41 | run: | 42 | python -m pip install --upgrade pip 43 | pip install arxiv 44 | pip install requests 45 | pip install pyyaml 46 | 47 | - name: Run daily arxiv 48 | run: | 49 | python daily_arxiv.py 50 | 51 | - name: Push new cv-arxiv-daily.md 52 | uses: github-actions-x/commit@v2.9 53 | with: 54 | github-token: ${{ secrets.GITHUB_TOKEN }} 55 | commit-message: "Github Action Automatic Update TTS Arxiv Papers" 56 | files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md 57 | rebase: 'true' 58 | name: ${{ env.GITHUB_USER_NAME }} 59 | email: ${{ env.GITHUB_USER_EMAIL }} 60 | -------------------------------------------------------------------------------- /.github/workflows/update_paper_links.yml: -------------------------------------------------------------------------------- 1 | # This is a basic workflow to help you get started with Actions 2 | 3 | name: Run Update Paper Links Weekly 4 | 5 | # Controls when the workflow will run 6 | on: 7 | # Allows you to run this workflow manually from the Actions tab 8 | workflow_dispatch: 9 | schedule: 10 | - cron: "0 8 * * 1" #Run At 08:00 on Monday 11 | # Triggers the workflow on push or pull request events but only for the main branch 12 | # push: 13 | # branches: 14 | # - main 15 | 16 | env: 17 | 18 | GITHUB_USER_NAME: wetdog 19 | GITHUB_USER_EMAIL: jose091@gmail.com 20 | 21 | 22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel 23 | jobs: 24 | # This workflow contains a single job called "build" 25 | build: 26 | name: update 27 | # The type of runner that the job will run on 28 | runs-on: ubuntu-latest 29 | 30 | # Steps represent a sequence of tasks that will be executed as part of the job 31 | steps: 32 | - name: Checkout 33 | uses: actions/checkout@v3 34 | 35 | - name: Set up Python Env 36 | uses: actions/setup-python@v4 37 | with: 38 | python-version: '3.10' 39 | #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified 40 | - name: Install dependencies 41 | run: | 42 | python -m pip install --upgrade pip 43 | pip install arxiv 44 | pip install requests 45 | pip install pyyaml 46 | 47 | - name: Run daily arxiv 48 | run: | 49 | python daily_arxiv.py --update_paper_links 50 | 51 | - name: Push new tts-arxiv-daily.md 52 | uses: github-actions-x/commit@v2.9 53 | with: 54 | github-token: ${{ secrets.GITHUB_TOKEN }} 55 | commit-message: "Github Action Automatic Update CV Arxiv Papers" 56 | files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md 57 | rebase: 'true' 58 | name: ${{ env.GITHUB_USER_NAME }} 59 | email: ${{ env.GITHUB_USER_EMAIL }} 60 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | alpharealcat@gmail.com. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /daily_arxiv.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import json 4 | import arxiv 5 | import yaml 6 | import logging 7 | import argparse 8 | import datetime 9 | import requests 10 | 11 | logging.basicConfig(format='[%(asctime)s %(levelname)s] %(message)s', 12 | datefmt='%m/%d/%Y %H:%M:%S', 13 | level=logging.INFO) 14 | 15 | base_url = "https://arxiv.paperswithcode.com/api/v0/papers/" 16 | github_url = "https://api.github.com/search/repositories" 17 | arxiv_url = "http://arxiv.org/" 18 | 19 | def load_config(config_file:str) -> dict: 20 | ''' 21 | config_file: input config file path 22 | return: a dict of configuration 23 | ''' 24 | # make filters pretty 25 | def pretty_filters(**config) -> dict: 26 | keywords = dict() 27 | EXCAPE = '\"' 28 | QUOTA = '' # NO-USE 29 | OR = 'OR' # TODO 30 | def parse_filters(filters:list): 31 | ret = '' 32 | for idx in range(0,len(filters)): 33 | filter = filters[idx] 34 | if len(filter.split()) > 1: 35 | ret += (EXCAPE + filter + EXCAPE) 36 | else: 37 | ret += (QUOTA + filter + QUOTA) 38 | if idx != len(filters) - 1: 39 | ret += OR 40 | return ret 41 | for k,v in config['keywords'].items(): 42 | keywords[k] = parse_filters(v['filters']) 43 | return keywords 44 | with open(config_file,'r') as f: 45 | config = yaml.load(f,Loader=yaml.FullLoader) 46 | config['kv'] = pretty_filters(**config) 47 | logging.info(f'config = {config}') 48 | return config 49 | 50 | def get_authors(authors, first_author = False): 51 | output = str() 52 | if first_author == False: 53 | output = ", ".join(str(author) for author in authors) 54 | else: 55 | output = authors[0] 56 | return output 57 | def sort_papers(papers): 58 | output = dict() 59 | keys = list(papers.keys()) 60 | keys.sort(reverse=True) 61 | for key in keys: 62 | output[key] = papers[key] 63 | return output 64 | import requests 65 | 66 | def get_code_link(qword:str) -> str: 67 | """ 68 | This short function was auto-generated by ChatGPT. 69 | I only renamed some params and added some comments. 70 | @param qword: query string, eg. arxiv ids and paper titles 71 | @return paper_code in github: string, if not found, return None 72 | """ 73 | # query = f"arxiv:{arxiv_id}" 74 | query = f"{qword}" 75 | params = { 76 | "q": query, 77 | "sort": "stars", 78 | "order": "desc" 79 | } 80 | r = requests.get(github_url, params=params) 81 | results = r.json() 82 | code_link = None 83 | if results["total_count"] > 0: 84 | code_link = results["items"][0]["html_url"] 85 | return code_link 86 | 87 | def get_daily_papers(topic,query="slam", max_results=2): 88 | """ 89 | @param topic: str 90 | @param query: str 91 | @return paper_with_code: dict 92 | """ 93 | # output 94 | content = dict() 95 | content_to_web = dict() 96 | search_engine = arxiv.Search( 97 | query = query, 98 | max_results = max_results, 99 | sort_by = arxiv.SortCriterion.SubmittedDate 100 | ) 101 | 102 | for result in search_engine.results(): 103 | 104 | paper_id = result.get_short_id() 105 | paper_title = result.title 106 | paper_url = result.entry_id 107 | code_url = base_url + paper_id #TODO 108 | paper_abstract = result.summary.replace("\n"," ") 109 | paper_authors = get_authors(result.authors) 110 | paper_first_author = get_authors(result.authors,first_author = True) 111 | primary_category = result.primary_category 112 | publish_time = result.published.date() 113 | update_time = result.updated.date() 114 | comments = result.comment 115 | 116 | logging.info(f"Time = {update_time} title = {paper_title} author = {paper_first_author}") 117 | 118 | # eg: 2108.09112v1 -> 2108.09112 119 | ver_pos = paper_id.find('v') 120 | if ver_pos == -1: 121 | paper_key = paper_id 122 | else: 123 | paper_key = paper_id[0:ver_pos] 124 | paper_url = arxiv_url + 'abs/' + paper_key 125 | 126 | try: 127 | # source code link 128 | r = requests.get(code_url).json() 129 | repo_url = None 130 | if "official" in r and r["official"]: 131 | repo_url = r["official"]["url"] 132 | # TODO: not found, two more chances 133 | # else: 134 | # repo_url = get_code_link(paper_title) 135 | # if repo_url is None: 136 | # repo_url = get_code_link(paper_key) 137 | if repo_url is not None: 138 | content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|**[link]({})**|\n".format( 139 | update_time,paper_title,paper_first_author,paper_key,paper_url,repo_url) 140 | content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({}), Code: **[{}]({})**".format( 141 | update_time,paper_title,paper_first_author,paper_url,paper_url,repo_url,repo_url) 142 | 143 | else: 144 | content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|null|\n".format( 145 | update_time,paper_title,paper_first_author,paper_key,paper_url) 146 | content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({})".format( 147 | update_time,paper_title,paper_first_author,paper_url,paper_url) 148 | 149 | # TODO: select useful comments 150 | comments = None 151 | if comments != None: 152 | content_to_web[paper_key] += f", {comments}\n" 153 | else: 154 | content_to_web[paper_key] += f"\n" 155 | 156 | except Exception as e: 157 | logging.error(f"exception: {e} with id: {paper_key}") 158 | 159 | data = {topic:content} 160 | data_web = {topic:content_to_web} 161 | return data,data_web 162 | 163 | def update_paper_links(filename): 164 | ''' 165 | weekly update paper links in json file 166 | ''' 167 | def parse_arxiv_string(s): 168 | parts = s.split("|") 169 | date = parts[1].strip() 170 | title = parts[2].strip() 171 | authors = parts[3].strip() 172 | arxiv_id = parts[4].strip() 173 | code = parts[5].strip() 174 | arxiv_id = re.sub(r'v\d+', '', arxiv_id) 175 | return date,title,authors,arxiv_id,code 176 | 177 | with open(filename,"r") as f: 178 | content = f.read() 179 | if not content: 180 | m = {} 181 | else: 182 | m = json.loads(content) 183 | 184 | json_data = m.copy() 185 | 186 | for keywords,v in json_data.items(): 187 | logging.info(f'keywords = {keywords}') 188 | for paper_id,contents in v.items(): 189 | contents = str(contents) 190 | 191 | update_time, paper_title, paper_first_author, paper_url, code_url = parse_arxiv_string(contents) 192 | 193 | contents = "|{}|{}|{}|{}|{}|\n".format(update_time,paper_title,paper_first_author,paper_url,code_url) 194 | json_data[keywords][paper_id] = str(contents) 195 | logging.info(f'paper_id = {paper_id}, contents = {contents}') 196 | 197 | valid_link = False if '|null|' in contents else True 198 | if valid_link: 199 | continue 200 | try: 201 | code_url = base_url + paper_id #TODO 202 | r = requests.get(code_url).json() 203 | repo_url = None 204 | if "official" in r and r["official"]: 205 | repo_url = r["official"]["url"] 206 | if repo_url is not None: 207 | new_cont = contents.replace('|null|',f'|**[link]({repo_url})**|') 208 | logging.info(f'ID = {paper_id}, contents = {new_cont}') 209 | json_data[keywords][paper_id] = str(new_cont) 210 | 211 | except Exception as e: 212 | logging.error(f"exception: {e} with id: {paper_id}") 213 | # dump to json file 214 | with open(filename,"w") as f: 215 | json.dump(json_data,f) 216 | 217 | def update_json_file(filename,data_dict): 218 | ''' 219 | daily update json file using data_dict 220 | ''' 221 | with open(filename,"r") as f: 222 | content = f.read() 223 | if not content: 224 | m = {} 225 | else: 226 | m = json.loads(content) 227 | 228 | json_data = m.copy() 229 | 230 | # update papers in each keywords 231 | for data in data_dict: 232 | for keyword in data.keys(): 233 | papers = data[keyword] 234 | 235 | if keyword in json_data.keys(): 236 | json_data[keyword].update(papers) 237 | else: 238 | json_data[keyword] = papers 239 | 240 | with open(filename,"w") as f: 241 | json.dump(json_data,f) 242 | 243 | def json_to_md(filename,md_filename, 244 | task = '', 245 | to_web = False, 246 | use_title = True, 247 | use_tc = True, 248 | show_badge = True, 249 | use_b2t = True): 250 | """ 251 | @param filename: str 252 | @param md_filename: str 253 | @return None 254 | """ 255 | def pretty_math(s:str) -> str: 256 | ret = '' 257 | match = re.search(r"\$.*\$", s) 258 | if match == None: 259 | return s 260 | math_start,math_end = match.span() 261 | space_trail = space_leading = '' 262 | if s[:math_start][-1] != ' ' and '*' != s[:math_start][-1]: space_trail = ' ' 263 | if s[math_end:][0] != ' ' and '*' != s[math_end:][0]: space_leading = ' ' 264 | ret += s[:math_start] 265 | ret += f'{space_trail}${match.group()[1:-1].strip()}${space_leading}' 266 | ret += s[math_end:] 267 | return ret 268 | 269 | DateNow = datetime.date.today() 270 | DateNow = str(DateNow) 271 | DateNow = DateNow.replace('-','.') 272 | 273 | with open(filename,"r") as f: 274 | content = f.read() 275 | if not content: 276 | data = {} 277 | else: 278 | data = json.loads(content) 279 | 280 | # clean README.md if daily already exist else create it 281 | with open(md_filename,"w+") as f: 282 | pass 283 | 284 | # write data into README.md 285 | with open(md_filename,"a+") as f: 286 | 287 | if (use_title == True) and (to_web == True): 288 | f.write("---\n" + "layout: default\n" + "---\n\n") 289 | 290 | if show_badge == True: 291 | f.write(f"[![Contributors][contributors-shield]][contributors-url]\n") 292 | f.write(f"[![Forks][forks-shield]][forks-url]\n") 293 | f.write(f"[![Stargazers][stars-shield]][stars-url]\n") 294 | f.write(f"[![Issues][issues-shield]][issues-url]\n\n") 295 | 296 | if use_title == True: 297 | #f.write(("


CV-ARXIV-DAILY" 298 | # "
Automatically Update CV Papers Daily

\n")) 299 | f.write("## Updated on " + DateNow + "\n") 300 | else: 301 | f.write("> Updated on " + DateNow + "\n") 302 | 303 | # TODO: add usage 304 | f.write("> Usage instructions: [here](./docs/README.md#usage)\n\n") 305 | f.write("> This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily)\n\n") 306 | 307 | #Add: table of contents 308 | if use_tc == True: 309 | f.write("
\n") 310 | f.write(" Table of Contents\n") 311 | f.write("
    \n") 312 | for keyword in data.keys(): 313 | day_content = data[keyword] 314 | if not day_content: 315 | continue 316 | kw = keyword.replace(' ','-') 317 | f.write(f"
  1. {keyword}
  2. \n") 318 | f.write("
\n") 319 | f.write("
\n\n") 320 | 321 | for keyword in data.keys(): 322 | day_content = data[keyword] 323 | if not day_content: 324 | continue 325 | # the head of each part 326 | f.write(f"## {keyword}\n\n") 327 | 328 | if use_title == True : 329 | if to_web == False: 330 | f.write("|Publish Date|Title|Authors|PDF|Code|\n" + "|---|---|---|---|---|\n") 331 | else: 332 | f.write("| Publish Date | Title | Authors | PDF | Code |\n") 333 | f.write("|:---------|:-----------------------|:---------|:------|:------|\n") 334 | 335 | # sort papers by date 336 | day_content = sort_papers(day_content) 337 | 338 | for _,v in day_content.items(): 339 | if v is not None: 340 | f.write(pretty_math(v)) # make latex pretty 341 | 342 | f.write(f"\n") 343 | 344 | #Add: back to top 345 | if use_b2t: 346 | top_info = f"#Updated on {DateNow}" 347 | top_info = top_info.replace(' ','-').replace('.','') 348 | f.write(f"

(back to top)

\n\n") 349 | 350 | if show_badge == True: 351 | # we don't like long string, break it! 352 | f.write((f"[contributors-shield]: https://img.shields.io/github/" 353 | f"contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge\n")) 354 | f.write((f"[contributors-url]: https://github.com/Vincentqyw/" 355 | f"cv-arxiv-daily/graphs/contributors\n")) 356 | f.write((f"[forks-shield]: https://img.shields.io/github/forks/Vincentqyw/" 357 | f"cv-arxiv-daily.svg?style=for-the-badge\n")) 358 | f.write((f"[forks-url]: https://github.com/Vincentqyw/" 359 | f"cv-arxiv-daily/network/members\n")) 360 | f.write((f"[stars-shield]: https://img.shields.io/github/stars/Vincentqyw/" 361 | f"cv-arxiv-daily.svg?style=for-the-badge\n")) 362 | f.write((f"[stars-url]: https://github.com/Vincentqyw/" 363 | f"cv-arxiv-daily/stargazers\n")) 364 | f.write((f"[issues-shield]: https://img.shields.io/github/issues/Vincentqyw/" 365 | f"cv-arxiv-daily.svg?style=for-the-badge\n")) 366 | f.write((f"[issues-url]: https://github.com/Vincentqyw/" 367 | f"cv-arxiv-daily/issues\n\n")) 368 | 369 | logging.info(f"{task} finished") 370 | 371 | def demo(**config): 372 | # TODO: use config 373 | data_collector = [] 374 | data_collector_web= [] 375 | 376 | keywords = config['kv'] 377 | max_results = config['max_results'] 378 | publish_readme = config['publish_readme'] 379 | publish_gitpage = config['publish_gitpage'] 380 | publish_wechat = config['publish_wechat'] 381 | show_badge = config['show_badge'] 382 | 383 | b_update = config['update_paper_links'] 384 | logging.info(f'Update Paper Link = {b_update}') 385 | if config['update_paper_links'] == False: 386 | logging.info(f"GET daily papers begin") 387 | for topic, keyword in keywords.items(): 388 | logging.info(f"Keyword: {topic}") 389 | data, data_web = get_daily_papers(topic, query = keyword, 390 | max_results = max_results) 391 | data_collector.append(data) 392 | data_collector_web.append(data_web) 393 | print("\n") 394 | logging.info(f"GET daily papers end") 395 | 396 | # 1. update README.md file 397 | if publish_readme: 398 | json_file = config['json_readme_path'] 399 | md_file = config['md_readme_path'] 400 | # update paper links 401 | if config['update_paper_links']: 402 | update_paper_links(json_file) 403 | else: 404 | # update json data 405 | update_json_file(json_file,data_collector) 406 | # json data to markdown 407 | json_to_md(json_file,md_file, task ='Update Readme', \ 408 | show_badge = show_badge) 409 | 410 | # 2. update docs/index.md file (to gitpage) 411 | if publish_gitpage: 412 | json_file = config['json_gitpage_path'] 413 | md_file = config['md_gitpage_path'] 414 | # TODO: duplicated update paper links!!! 415 | if config['update_paper_links']: 416 | update_paper_links(json_file) 417 | else: 418 | update_json_file(json_file,data_collector) 419 | json_to_md(json_file, md_file, task ='Update GitPage', \ 420 | to_web = True, show_badge = show_badge, \ 421 | use_tc=False, use_b2t=False) 422 | 423 | # 3. Update docs/wechat.md file 424 | if publish_wechat: 425 | json_file = config['json_wechat_path'] 426 | md_file = config['md_wechat_path'] 427 | # TODO: duplicated update paper links!!! 428 | if config['update_paper_links']: 429 | update_paper_links(json_file) 430 | else: 431 | update_json_file(json_file, data_collector_web) 432 | json_to_md(json_file, md_file, task ='Update Wechat', \ 433 | to_web=False, use_title= False, show_badge = show_badge) 434 | 435 | if __name__ == "__main__": 436 | parser = argparse.ArgumentParser() 437 | parser.add_argument('--config_path',type=str, default='config.yaml', 438 | help='configuration file path') 439 | parser.add_argument('--update_paper_links', default=False, 440 | action="store_true",help='whether to update paper links etc.') 441 | args = parser.parse_args() 442 | config = load_config(args.config_path) 443 | config = {**config, 'update_paper_links':args.update_paper_links} 444 | demo(**config) 445 | -------------------------------------------------------------------------------- /docs/index.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | --- 4 | 5 | [![Contributors][contributors-shield]][contributors-url] 6 | [![Forks][forks-shield]][forks-url] 7 | [![Stargazers][stars-shield]][stars-url] 8 | [![Issues][issues-shield]][issues-url] 9 | 10 | ## Updated on 2025.12.22 11 | > Usage instructions: [here](./docs/README.md#usage) 12 | 13 | > This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily) 14 | 15 | ## TTS 16 | 17 | | Publish Date | Title | Authors | PDF | Code | 18 | |:---------|:-----------------------|:---------|:------|:------| 19 | |**2025-07-23**|**BoSS: Beyond-Semantic Speech**|Qing Wang et.al.|[2507.17563](http://arxiv.org/abs/2507.17563)|null| 20 | |**2025-07-21**|**A2TTS: TTS for Low Resource Indian Languages**|Ayush Singh Bhadoriya et.al.|[2507.15272](http://arxiv.org/abs/2507.15272)|null| 21 | |**2025-07-22**|**Hear Your Code Fail, Voice-Assisted Debugging for Python**|Sayed Mahbub Hasan Amiri et.al.|[2507.15007](http://arxiv.org/abs/2507.15007)|null| 22 | |**2025-07-20**|**DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis**|Yinghao Aaron Li et.al.|[2507.14988](http://arxiv.org/abs/2507.14988)|null| 23 | |**2025-07-17**|**NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech**|Maksim Borisov et.al.|[2507.13155](http://arxiv.org/abs/2507.13155)|null| 24 | |**2025-07-17**|**Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication**|Tianyu Song et.al.|[2507.13052](http://arxiv.org/abs/2507.13052)|null| 25 | |**2025-07-17**|**Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes**|Zhou Feng et.al.|[2507.12932](http://arxiv.org/abs/2507.12932)|null| 26 | |**2025-07-16**|**Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations**|Yichen Han et.al.|[2507.12197](http://arxiv.org/abs/2507.12197)|null| 27 | |**2025-07-16**|**EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis**|Haoxun Li et.al.|[2507.12015](http://arxiv.org/abs/2507.12015)|null| 28 | |**2025-07-15**|**Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection**|Ivan Viakhirev et.al.|[2507.11777](http://arxiv.org/abs/2507.11777)|null| 29 | |**2025-07-15**|**P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge**|Marvin Sach et.al.|[2507.11306](http://arxiv.org/abs/2507.11306)|null| 30 | |**2025-07-14**|**Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition**|Mengzhe Geng et.al.|[2507.10827](http://arxiv.org/abs/2507.10827)|null| 31 | |**2025-07-14**|**An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments**|Mikko Korkiakoski et.al.|[2507.10469](http://arxiv.org/abs/2507.10469)|null| 32 | |**2025-07-12**|**ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching**|Han Zhu et.al.|[2507.09318](http://arxiv.org/abs/2507.09318)|null| 33 | |**2025-07-12**|**Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning**|Dominika Woszczyk et.al.|[2507.09310](http://arxiv.org/abs/2507.09310)|null| 34 | |**2025-07-12**|**ClaritySpeech: Dementia Obfuscation in Speech**|Dominika Woszczyk et.al.|[2507.09282](http://arxiv.org/abs/2507.09282)|null| 35 | |**2025-07-11**|**Exploiting Leaderboards for Large-Scale Distribution of Malicious Models**|Anshuman Suri et.al.|[2507.08983](http://arxiv.org/abs/2507.08983)|null| 36 | |**2025-07-11**|**Unlocking Speech Instruction Data Potential with Query Rewriting**|Yonghua Hei et.al.|[2507.08603](http://arxiv.org/abs/2507.08603)|null| 37 | |**2025-07-11**|**MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling**|Jingjing Tang et.al.|[2507.08530](http://arxiv.org/abs/2507.08530)|null| 38 | |**2025-07-11**|**Active Learning for Text-to-Speech Synthesis with Informative Sample Collection**|Kentaro Seki et.al.|[2507.08319](http://arxiv.org/abs/2507.08319)|null| 39 | |**2025-07-10**|**SecureSpeech: Prompt-based Speaker and Content Protection**|Belinda Soh Hui Hui et.al.|[2507.07799](http://arxiv.org/abs/2507.07799)|null| 40 | |**2025-07-09**|**Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents**|Zackary Rackauckas et.al.|[2507.06483](http://arxiv.org/abs/2507.06483)|null| 41 | |**2025-07-08**|**Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis**|Xintong Hu et.al.|[2507.06116](http://arxiv.org/abs/2507.06116)|null| 42 | |**2025-07-08**|**Differentiable Reward Optimization for LLM based TTS system**|Changfeng Gao et.al.|[2507.05911](http://arxiv.org/abs/2507.05911)|null| 43 | |**2025-07-08**|**OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model**|Chen Wang et.al.|[2507.05177](http://arxiv.org/abs/2507.05177)|null| 44 | |**2025-07-07**|**Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2507.04598](http://arxiv.org/abs/2507.04598)|null| 45 | |**2025-07-06**|**TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet**|Jaeseok Jeong et.al.|[2507.04349](http://arxiv.org/abs/2507.04349)|null| 46 | |**2025-07-05**|**PresentAgent: Multimodal Agent for Presentation Video Generation**|Jingwei Shi et.al.|[2507.04036](http://arxiv.org/abs/2507.04036)|null| 47 | |**2025-07-05**|**Prosody Labeling with Phoneme-BERT and Speech Foundation Models**|Tomoki Koriyama et.al.|[2507.03912](http://arxiv.org/abs/2507.03912)|null| 48 | |**2025-07-05**|**Traceable TTS: Toward Watermark-Free TTS with Strong Traceability**|Yuxiang Zhao et.al.|[2507.03887](http://arxiv.org/abs/2507.03887)|null| 49 | |**2025-07-03**|**Open-Source System for Multilingual Translation and Cloned Speech Synthesis**|Mateo Cámara et.al.|[2507.02530](http://arxiv.org/abs/2507.02530)|null| 50 | |**2025-07-03**|**JoyTTS: LLM-based Spoken Chatbot With Voice Cloning**|Fangru Zhou et.al.|[2507.02380](http://arxiv.org/abs/2507.02380)|null| 51 | |**2025-07-02**|**A Dataset for Automatic Assessment of TTS Quality in Spanish**|Alejandro Sosa Welford et.al.|[2507.01805](http://arxiv.org/abs/2507.01805)|null| 52 | |**2025-07-02**|**SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech**|Cheng Zhuangfei et.al.|[2507.01348](http://arxiv.org/abs/2507.01348)|null| 53 | |**2025-07-02**|**Multi-interaction TTS toward professional recording reproduction**|Hiroki Kanagawa et.al.|[2507.00808](http://arxiv.org/abs/2507.00808)|null| 54 | |**2025-06-30**|**Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis**|Paul Mayer et.al.|[2507.00227](http://arxiv.org/abs/2507.00227)|null| 55 | |**2025-06-30**|**JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching**|Mingi Kwon et.al.|[2506.23552](http://arxiv.org/abs/2506.23552)|null| 56 | |**2025-06-29**|**You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties**|Paige Tuttösí et.al.|[2506.23367](http://arxiv.org/abs/2506.23367)|null| 57 | |**2025-06-23**|**IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech**|Siyi Zhou et.al.|[2506.21619](http://arxiv.org/abs/2506.21619)|null| 58 | |**2025-06-25**|**An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS**|Marie Kunešová et.al.|[2506.20190](http://arxiv.org/abs/2506.20190)|null| 59 | |**2025-06-24**|**TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems**|Christoph Minixhofer et.al.|[2506.19441](http://arxiv.org/abs/2506.19441)|null| 60 | |**2025-06-23**|**Selecting N-lowest scores for training MOS prediction models**|Yuto Kondo et.al.|[2506.18326](http://arxiv.org/abs/2506.18326)|null| 61 | |**2025-06-23**|**Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting**|Yuto Kondo et.al.|[2506.18307](http://arxiv.org/abs/2506.18307)|null| 62 | |**2025-06-23**|**JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles**|Yuto Kondo et.al.|[2506.18296](http://arxiv.org/abs/2506.18296)|null| 63 | |**2025-06-20**|**RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching**|Hyun Joon Park et.al.|[2506.16741](http://arxiv.org/abs/2506.16741)|null| 64 | |**2025-06-20**|**LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization**|Daejin Jo et.al.|[2506.16738](http://arxiv.org/abs/2506.16738)|null| 65 | |**2025-06-19**|**Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement**|Tuan-Nam Nguyen et.al.|[2506.16580](http://arxiv.org/abs/2506.16580)|null| 66 | |**2025-06-19**|**InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems**|Kexin Huang et.al.|[2506.16381](http://arxiv.org/abs/2506.16381)|**[link](https://github.com/kexinhuang19/instructttseval)**| 67 | |**2025-06-19**|**Optimizing Multilingual Text-To-Speech with Accents & Emotions**|Pranav Pawar et.al.|[2506.16310](http://arxiv.org/abs/2506.16310)|null| 68 | |**2025-06-18**|**TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data**|Kentaro Seki et.al.|[2506.15614](http://arxiv.org/abs/2506.15614)|null| 69 | |**2025-06-18**|**PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction**|Shufan Li et.al.|[2506.15556](http://arxiv.org/abs/2506.15556)|null| 70 | |**2025-06-18**|**EmojiVoice: Towards long-term controllable expressivity in robot speech**|Paige Tuttösí et.al.|[2506.15085](http://arxiv.org/abs/2506.15085)|null| 71 | |**2025-06-17**|**Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification**|Yiyang Zhao et.al.|[2506.14226](http://arxiv.org/abs/2506.14226)|null| 72 | |**2025-06-16**|**EmoNews: A Spoken Dialogue System for Expressive News Conversations**|Ryuki Matsuura et.al.|[2506.13894](http://arxiv.org/abs/2506.13894)|null| 73 | |**2025-06-16**|**ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching**|Han Zhu et.al.|[2506.13053](http://arxiv.org/abs/2506.13053)|null| 74 | |**2025-06-14**|**StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling**|Hui Wang et.al.|[2506.12570](http://arxiv.org/abs/2506.12570)|null| 75 | |**2025-06-14**|**Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech**|Yakov Kolani et.al.|[2506.12311](http://arxiv.org/abs/2506.12311)|null| 76 | |**2025-06-11**|**S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder**|Yu Pan et.al.|[2506.11160](http://arxiv.org/abs/2506.11160)|null| 77 | |**2025-06-16**|**A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data**|Cheng-Kang Chou et.al.|[2506.11130](http://arxiv.org/abs/2506.11130)|null| 78 | |**2025-06-10**|**GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions**|Wenkang Han et.al.|[2506.11127](http://arxiv.org/abs/2506.11127)|null| 79 | |**2025-06-10**|**ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams**|Freddie Grabovski et.al.|[2506.11125](http://arxiv.org/abs/2506.11125)|null| 80 | |**2025-06-12**|**Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs**|Hayato Futami et.al.|[2506.10299](http://arxiv.org/abs/2506.10299)|null| 81 | |**2025-06-11**|**UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching**|Neta Glazer et.al.|[2506.09874](http://arxiv.org/abs/2506.09874)|null| 82 | |**2025-06-15**|**EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection**|Christoph Schuhmann et.al.|[2506.09827](http://arxiv.org/abs/2506.09827)|null| 83 | |**2025-06-11**|**Ming-Omni: A Unified Multimodal Model for Perception and Generation**|Inclusion AI et.al.|[2506.09344](http://arxiv.org/abs/2506.09344)|**[link](https://github.com/inclusionai/ming)**| 84 | |**2025-06-10**|**A Review on Score-based Generative Models for Audio Applications**|Ge Zhu et.al.|[2506.08457](http://arxiv.org/abs/2506.08457)|null| 85 | |**2025-06-09**|**Seeing Voices: Generating A-Roll Video from Audio with Mirage**|Aditi Sundararaman et.al.|[2506.08279](http://arxiv.org/abs/2506.08279)|null| 86 | |**2025-06-09**|**Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation**|Rui Hu et.al.|[2506.07646](http://arxiv.org/abs/2506.07646)|null| 87 | |**2025-06-07**|**SynHate: Detecting Hate Speech in Synthetic Deepfake Audio**|Rishabh Ranjan et.al.|[2506.06772](http://arxiv.org/abs/2506.06772)|null| 88 | |**2025-06-09**|**Voice Impression Control in Zero-Shot TTS**|Keinichi Fujita et.al.|[2506.05688](http://arxiv.org/abs/2506.05688)|null| 89 | |**2025-06-05**|**Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning**|Hien Ohnaka et.al.|[2506.04527](http://arxiv.org/abs/2506.04527)|null| 90 | |**2025-06-04**|**Can we reconstruct a dysarthric voice with the large speech model Parler TTS?**|Ariadna Sanchez et.al.|[2506.04397](http://arxiv.org/abs/2506.04397)|null| 91 | |**2025-06-04**|**HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset**|Ryan Langman et.al.|[2506.04152](http://arxiv.org/abs/2506.04152)|null| 92 | |**2025-06-04**|**UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation**|Jinting Wang et.al.|[2506.04134](http://arxiv.org/abs/2506.04134)|null| 93 | |**2025-06-04**|**A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions**|Chung-Chun Wang et.al.|[2506.04077](http://arxiv.org/abs/2506.04077)|null| 94 | |**2025-06-04**|**Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages**|Utkarsh Pathak et.al.|[2506.03884](http://arxiv.org/abs/2506.03884)|null| 95 | |**2025-06-04**|**Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts**|Sidharth Pulipaka et.al.|[2506.03793](http://arxiv.org/abs/2506.03793)|null| 96 | |**2025-06-04**|**BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing**|Masaya Kawamura et.al.|[2506.03515](http://arxiv.org/abs/2506.03515)|null| 97 | |**2025-06-03**|**Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation**|Yongqi Wang et.al.|[2506.02997](http://arxiv.org/abs/2506.02997)|null| 98 | |**2025-06-03**|**Towards a Japanese Full-duplex Spoken Dialogue System**|Atsumoto Ohashi et.al.|[2506.02979](http://arxiv.org/abs/2506.02979)|null| 99 | |**2025-06-03**|**CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech**|Helin Wang et.al.|[2506.02863](http://arxiv.org/abs/2506.02863)|null| 100 | |**2025-06-03**|**Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions**|Xiaoxue Gao et.al.|[2506.02742](http://arxiv.org/abs/2506.02742)|null| 101 | |**2025-06-02**|**SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction**|Saurabh Agrawal et.al.|[2506.02082](http://arxiv.org/abs/2506.02082)|null| 102 | |**2025-06-02**|**Zero-Shot Text-to-Speech for Vietnamese**|Thi Vu et.al.|[2506.01322](http://arxiv.org/abs/2506.01322)|null| 103 | |**2025-06-02**|**CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction**|Yudong Lu et.al.|[2506.01268](http://arxiv.org/abs/2506.01268)|null| 104 | |**2025-06-02**|**WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing**|Yu Nakagome et.al.|[2506.01263](http://arxiv.org/abs/2506.01263)|null| 105 | |**2025-06-01**|**DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation**|Ming Meng et.al.|[2506.01020](http://arxiv.org/abs/2506.01020)|null| 106 | |**2025-06-01**|**Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models**|Kyowoon Lee et.al.|[2506.00832](http://arxiv.org/abs/2506.00832)|null| 107 | |**2025-05-30**|**Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation**|Wenrui Liu et.al.|[2505.24496](http://arxiv.org/abs/2505.24496)|null| 108 | |**2025-05-30**|**DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec**|Peijie Chen et.al.|[2505.24314](http://arxiv.org/abs/2505.24314)|null| 109 | |**2025-05-29**|**Can Emotion Fool Anti-spoofing?**|Aurosweta Mahapatra et.al.|[2505.23962](http://arxiv.org/abs/2505.23962)|null| 110 | |**2025-05-29**|**Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes**|Neta Glazer et.al.|[2505.23619](http://arxiv.org/abs/2505.23619)|null| 111 | |**2025-05-29**|**EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge**|Ruskin Raj Manku et.al.|[2505.23009](http://arxiv.org/abs/2505.23009)|**[link](https://github.com/boson-ai/emergenttts-eval-public)**| 112 | |**2025-05-29**|**LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting**|Pai Zhu et.al.|[2505.22995](http://arxiv.org/abs/2505.22995)|null| 113 | |**2025-05-28**|**Tell me Habibi, is it Real or Fake?**|Kartik Kuckreja et.al.|[2505.22581](http://arxiv.org/abs/2505.22581)|null| 114 | |**2025-05-28**|**A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity**|Charlotte Pouw et.al.|[2505.22236](http://arxiv.org/abs/2505.22236)|null| 115 | |**2025-05-27**|**Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech**|Nam-Gyu Kim et.al.|[2505.20868](http://arxiv.org/abs/2505.20868)|null| 116 | |**2025-05-26**|**Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling**|Qixi Zheng et.al.|[2505.19931](http://arxiv.org/abs/2505.19931)|null| 117 | |**2025-05-26**|**DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech**|Deok-Hyeon Cho et.al.|[2505.19687](http://arxiv.org/abs/2505.19687)|null| 118 | |**2025-05-26**|**KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization**|Zhaolin Li et.al.|[2505.19679](http://arxiv.org/abs/2505.19679)|null| 119 | |**2025-05-26**|**Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling**|Haiyang Sun et.al.|[2505.19669](http://arxiv.org/abs/2505.19669)|null| 120 | |**2025-05-26**|**Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment**|Jeongsoo Choi et.al.|[2505.19595](http://arxiv.org/abs/2505.19595)|**[link](https://github.com/zhikangniu/a-dma)**| 121 | |**2025-05-25**|**SpeakStream: Streaming Text-to-Speech with Interleaved Data**|Richard He Bai et.al.|[2505.19206](http://arxiv.org/abs/2505.19206)|null| 122 | |**2025-05-25**|**CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning**|Renyuan Li et.al.|[2505.19119](http://arxiv.org/abs/2505.19119)|null| 123 | |**2025-05-25**|**Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis**|Minsu Kim et.al.|[2505.18972](http://arxiv.org/abs/2505.18972)|null| 124 | |**2025-05-27**|**RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations**|Ashwin Sankar et.al.|[2505.18609](http://arxiv.org/abs/2505.18609)|null| 125 | |**2025-05-24**|**MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt**|Zhichao Wu et.al.|[2505.18453](http://arxiv.org/abs/2505.18453)|null| 126 | |**2025-05-23**|**What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection**|Binh Nguyen et.al.|[2505.17513](http://arxiv.org/abs/2505.17513)|null| 127 | |**2025-05-23**|**UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information**|Rui Wang et.al.|[2505.17426](http://arxiv.org/abs/2505.17426)|null| 128 | |**2025-05-23**|**Speechless: Speech Instruction Training Without Speech for Low Resource Languages**|Alan Dao et.al.|[2505.17417](http://arxiv.org/abs/2505.17417)|null| 129 | |**2025-05-22**|**Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2**|Zackary Rackauckas et.al.|[2505.17320](http://arxiv.org/abs/2505.17320)|null| 130 | |**2025-05-21**|**Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech**|Yejin Lee et.al.|[2505.17093](http://arxiv.org/abs/2505.17093)|null| 131 | |**2025-05-22**|**From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition**|Tianduo Wang et.al.|[2505.16972](http://arxiv.org/abs/2505.16972)|**[link](https://github.com/tianduowang/speech-bt)**| 132 | |**2025-05-21**|**MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling**|Yifan Cheng et.al.|[2505.15772](http://arxiv.org/abs/2505.15772)|null| 133 | |**2025-05-21**|**Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information**|Nicholas Sanders et.al.|[2505.15667](http://arxiv.org/abs/2505.15667)|null| 134 | |**2025-05-21**|**Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models**|Zirui Song et.al.|[2505.15406](http://arxiv.org/abs/2505.15406)|**[link](https://github.com/mbzuai-nlp/audiojailbreak)**| 135 | |**2025-05-20**|**FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation**|Yutong Liu et.al.|[2505.14351](http://arxiv.org/abs/2505.14351)|null| 136 | |**2025-05-21**|**AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models**|Guangke Chen et.al.|[2505.14103](http://arxiv.org/abs/2505.14103)|null| 137 | |**2025-05-20**|**SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement**|Kuan-Yu Chen et.al.|[2505.14066](http://arxiv.org/abs/2505.14066)|null| 138 | |**2025-05-22**|**Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising**|Ye-Xin Lu et.al.|[2505.13830](http://arxiv.org/abs/2505.13830)|null| 139 | |**2025-05-19**|**OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching**|Hieu-Nghia Huynh-Nguyen et.al.|[2505.12800](http://arxiv.org/abs/2505.12800)|null| 140 | |**2025-05-18**|**Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis**|Dong Yang et.al.|[2505.12226](http://arxiv.org/abs/2505.12226)|null| 141 | |**2025-05-16**|**Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese**|Xihuai Wang et.al.|[2505.11200](http://arxiv.org/abs/2505.11200)|null| 142 | |**2025-05-16**|**BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset**|Istiaq Ahmed Fahad et.al.|[2505.10885](http://arxiv.org/abs/2505.10885)|**[link](https://github.com/KamruzzamanAsif/BanglaFake)**| 143 | |**2025-05-15**|**UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech**|Jiaxuan Liu et.al.|[2505.10599](http://arxiv.org/abs/2505.10599)|null| 144 | |**2025-05-12**|**MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder**|Bowen Zhang et.al.|[2505.07916](http://arxiv.org/abs/2505.07916)|null| 145 | |**2025-05-12**|**Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications**|Biel Tura Vecino et.al.|[2505.07701](http://arxiv.org/abs/2505.07701)|null| 146 | |**2025-05-10**|**VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback**|Eason Chen et.al.|[2505.06676](http://arxiv.org/abs/2505.06676)|null| 147 | |**2025-05-10**|**Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation**|Abbas Bertina et.al.|[2505.06599](http://arxiv.org/abs/2505.06599)|null| 148 | |**2025-05-15**|**FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech**|Linhan Ma et.al.|[2505.05159](http://arxiv.org/abs/2505.05159)|null| 149 | |**2025-05-08**|**Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations**|Linrong Pan et.al.|[2505.05056](http://arxiv.org/abs/2505.05056)|null| 150 | |**2025-05-08**|**A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration**|Shaja Arul Selvamani et.al.|[2505.04885](http://arxiv.org/abs/2505.04885)|null| 151 | |**2025-05-07**|**Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment**|Xueyao Zhang et.al.|[2505.04113](http://arxiv.org/abs/2505.04113)|null| 152 | |**2025-05-06**|**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**|Zuwei Long et.al.|[2505.03739](http://arxiv.org/abs/2505.03739)|**[link](https://github.com/vita-mllm/vita-audio)**| 153 | |**2025-05-05**|**Generating Narrated Lecture Videos from Slides with Synchronized Highlights**|Alexander Holmberg et.al.|[2505.02966](http://arxiv.org/abs/2505.02966)|null| 154 | |**2025-05-05**|**Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play**|Yemin Shi et.al.|[2505.02707](http://arxiv.org/abs/2505.02707)|**[link](https://github.com/maitrix-org/voila)**| 155 | |**2025-04-30**|**Sadeed: Advancing Arabic Diacritization Through Small Language Model**|Zeina Aldallal et.al.|[2504.21635](http://arxiv.org/abs/2504.21635)|null| 156 | |**2025-04-29**|**ClonEval: An Open Voice Cloning Benchmark**|Iwona Christop et.al.|[2504.20581](http://arxiv.org/abs/2504.20581)|null| 157 | |**2025-05-02**|**Towards Flow-Matching-based TTS without Classifier-Free Guidance**|Yuzhe Liang et.al.|[2504.20334](http://arxiv.org/abs/2504.20334)|null| 158 | |**2025-04-27**|**Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget**|Xin Li et.al.|[2504.19146](http://arxiv.org/abs/2504.19146)|**[link](https://github.com/MYZY-AI/Muyan-TTS)**| 159 | |**2025-04-22**|**A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models**|Gengxian Cao et.al.|[2504.15552](http://arxiv.org/abs/2504.15552)|null| 160 | |**2025-04-18**|**ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents**|Takuya Sera et.al.|[2504.13793](http://arxiv.org/abs/2504.13793)|null| 161 | |**2025-04-22**|**EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting**|Guanrou Yang et.al.|[2504.12867](http://arxiv.org/abs/2504.12867)|null| 162 | |**2025-04-15**|**GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture**|Yaodong Song et.al.|[2504.12339](http://arxiv.org/abs/2504.12339)|null| 163 | |**2025-04-15**|**Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation**|Yan Rong et.al.|[2504.11002](http://arxiv.org/abs/2504.11002)|null| 164 | |**2025-04-15**|**Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy**|Botao Zhao et.al.|[2504.10819](http://arxiv.org/abs/2504.10819)|null| 165 | |**2025-04-14**|**Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis**|Yifan Yang et.al.|[2504.10352](http://arxiv.org/abs/2504.10352)|null| 166 | |**2025-04-14**|**AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis**|Dan Luo et.al.|[2504.10309](http://arxiv.org/abs/2504.10309)|null| 167 | |**2025-04-11**|**Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation**|Haowei Lou et.al.|[2504.08274](http://arxiv.org/abs/2504.08274)|null| 168 | |**2025-04-10**|**Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis**|Yizhong Geng et.al.|[2504.07858](http://arxiv.org/abs/2504.07858)|null| 169 | |**2025-04-10**|**SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow**|Kaidi Wang et.al.|[2504.07776](http://arxiv.org/abs/2504.07776)|null| 170 | |**2025-04-08**|**AVENet: Disentangling Features by Approximating Average Features for Voice Conversion**|Wenyu Wang et.al.|[2504.05833](http://arxiv.org/abs/2504.05833)|null| 171 | |**2025-04-07**|**SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation**|Stephen Brade et.al.|[2504.05106](http://arxiv.org/abs/2504.05106)|null| 172 | |**2025-04-04**|**RWKVTTS: Yet another TTS based on RWKV-7**|Lin yueyu et.al.|[2504.03289](http://arxiv.org/abs/2504.03289)|**[link](https://github.com/yynil/rwkvtts)**| 173 | |**2025-04-09**|**F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization**|Xiaohui Sun et.al.|[2504.02407](http://arxiv.org/abs/2504.02407)|null| 174 | |**2025-04-02**|**TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection**|Zhiming Ma et.al.|[2503.24115](http://arxiv.org/abs/2503.24115)|**[link](https://github.com/jimmyma99/teleantifraud)**| 175 | |**2025-03-30**|**Speculative End-Turn Detector for Efficient Speech Chatbot Assistant**|Hyunjong Ok et.al.|[2503.23439](http://arxiv.org/abs/2503.23439)|null| 176 | |**2025-03-29**|**SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System**|Hyeongju Kim et.al.|[2503.23108](http://arxiv.org/abs/2503.23108)|null| 177 | |**2025-03-26**|**Dual Audio-Centric Modality Coupling for Talking Head Generation**|Ao Fu et.al.|[2503.22728](http://arxiv.org/abs/2503.22728)|null| 178 | |**2025-03-28**|**DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation**|Haomin Zhang et.al.|[2503.22265](http://arxiv.org/abs/2503.22265)|null| 179 | |**2025-03-26**|**Text-Driven Voice Conversion via Latent State-Space Modeling**|Wen Li et.al.|[2503.20999](http://arxiv.org/abs/2503.20999)|null| 180 | |**2025-03-28**|**FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System**|Hao-Han Guo et.al.|[2503.20499](http://arxiv.org/abs/2503.20499)|null| 181 | |**2025-03-21**|**Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication**|Yiwen Xu et.al.|[2503.17479](http://arxiv.org/abs/2503.17479)|null| 182 | |**2025-03-10**|**VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection**|Kunal Chavan et.al.|[2503.16488](http://arxiv.org/abs/2503.16488)|null| 183 | |**2025-03-19**|**MoonCast: High-Quality Zero-Shot Podcast Generation**|Zeqian Ju et.al.|[2503.14345](http://arxiv.org/abs/2503.14345)|**[link](https://github.com/jzq2000/mooncast)**| 184 | |**2025-03-14**|**MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation**|Sungwoo Cho et.al.|[2503.11026](http://arxiv.org/abs/2503.11026)|null| 185 | |**2025-03-11**|**An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR**|Sewade Ogun et.al.|[2503.08954](http://arxiv.org/abs/2503.08954)|null| 186 | |**2025-03-03**|**Direct Speech to Speech Translation: A Review**|Mohammad Sarim et.al.|[2503.04799](http://arxiv.org/abs/2503.04799)|null| 187 | |**2025-03-06**|**LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM**|Sambal Shikhar et.al.|[2503.04724](http://arxiv.org/abs/2503.04724)|null| 188 | |**2025-03-06**|**Scaling Rich Style-Prompted Text-to-Speech Datasets**|Anuj Diwan et.al.|[2503.04713](http://arxiv.org/abs/2503.04713)|**[link](https://github.com/ajd12342/paraspeechcaps)**| 189 | |**2025-03-04**|**InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training**|Dingdong Wang et.al.|[2503.02769](http://arxiv.org/abs/2503.02769)|null| 190 | |**2025-03-03**|**Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens**|Xinsheng Wang et.al.|[2503.01710](http://arxiv.org/abs/2503.01710)|**[link](https://github.com/sparkaudio/spark-tts)**| 191 | |**2025-03-02**|**UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation**|Alexander H. Liu et.al.|[2503.00733](http://arxiv.org/abs/2503.00733)|null| 192 | |**2025-03-12**|**Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale**|Max M. Lang et.al.|[2502.20140](http://arxiv.org/abs/2502.20140)|null| 193 | |**2025-02-26**|**Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis**|Ziyue Jiang et.al.|[2502.18924](http://arxiv.org/abs/2502.18924)|null| 194 | |**2025-03-08**|**Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding**|Tianyun Liu et.al.|[2502.18889](http://arxiv.org/abs/2502.18889)|null| 195 | |**2025-02-24**|**Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM**|Jiatong Shi et.al.|[2502.16897](http://arxiv.org/abs/2502.16897)|null| 196 | |**2025-02-17**|**NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing**|Yifan Liang et.al.|[2502.12002](http://arxiv.org/abs/2502.12002)|null| 197 | |**2025-02-16**|**SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer**|Zhengyan Sheng et.al.|[2502.11094](http://arxiv.org/abs/2502.11094)|null| 198 | |**2025-02-14**|**VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect**|Qingyuan Fei et.al.|[2502.10329](http://arxiv.org/abs/2502.10329)|null| 199 | |**2025-02-13**|**TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument**|Kyungsu Kim et.al.|[2502.08939](http://arxiv.org/abs/2502.08939)|**[link](https://github.com/kyungsukim42/tokensynth)**| 200 | |**2025-03-02**|**ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech**|Xin Wang et.al.|[2502.08857](http://arxiv.org/abs/2502.08857)|null| 201 | |**2025-02-11**|**LoRP-TTS: Low-Rank Personalized Text-To-Speech**|Łukasz Bondaruk et.al.|[2502.07562](http://arxiv.org/abs/2502.07562)|null| 202 | |**2025-02-11**|**Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction**|Leying Zhang et.al.|[2502.07345](http://arxiv.org/abs/2502.07345)|null| 203 | |**2025-02-11**|**Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement**|Xueyao Zhang et.al.|[2502.07243](http://arxiv.org/abs/2502.07243)|null| 204 | |**2025-02-10**|**Synthetic Audio Helps for Cognitive State Tasks**|Adil Soubki et.al.|[2502.06922](http://arxiv.org/abs/2502.06922)|**[link](https://github.com/adil-soubki/sad-training)**| 205 | |**2025-02-19**|**Speech to Speech Translation with Translatotron: A State of the Art Review**|Jules R. Kala et.al.|[2502.05980](http://arxiv.org/abs/2502.05980)|null| 206 | |**2025-02-09**|**BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting**|Mohammad Jahid Ibna Basher et.al.|[2502.05729](http://arxiv.org/abs/2502.05729)|null| 207 | |**2025-02-08**|**Gender Bias in Instruction-Guided Speech Synthesis Models**|Chun-Yi Kuan et.al.|[2502.05649](http://arxiv.org/abs/2502.05649)|null| 208 | |**2025-02-08**|**IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System**|Wei Deng et.al.|[2502.05512](http://arxiv.org/abs/2502.05512)|null| 209 | |**2025-02-05**|**Metis: A Foundation Speech Generation Model with Masked Generative Pre-training**|Yuancheng Wang et.al.|[2502.03128](http://arxiv.org/abs/2502.03128)|**[link](https://github.com/open-mmlab/amphion)**| 210 | |**2025-02-05**|**Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech**|Jixun Yao et.al.|[2502.02950](http://arxiv.org/abs/2502.02950)|null| 211 | |**2025-02-04**|**Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet**|Shenran Wang et.al.|[2502.02703](http://arxiv.org/abs/2502.02703)|null| 212 | |**2025-02-04**|**Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation**|Peidong Wang et.al.|[2502.02683](http://arxiv.org/abs/2502.02683)|null| 213 | |**2025-02-02**|**EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis**|Junuk Cha et.al.|[2502.00654](http://arxiv.org/abs/2502.00654)|null| 214 | |**2025-01-31**|**VisualSpeech: Enhance Prosody with Visual Context in TTS**|Shumin Que et.al.|[2501.19258](http://arxiv.org/abs/2501.19258)|null| 215 | |**2025-01-29**|**BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights**|Chan-Jan Hsu et.al.|[2501.17790](http://arxiv.org/abs/2501.17790)|null| 216 | |**2025-01-28**|**Compact Neural TTS Voices for Accessibility**|Kunal Jain et.al.|[2501.17332](http://arxiv.org/abs/2501.17332)|null| 217 | |**2025-01-26**|**Overview of the Amphion Toolkit (v0.2)**|Jiaqi Li et.al.|[2501.15442](http://arxiv.org/abs/2501.15442)|**[link](https://github.com/open-mmlab/amphion)**| 218 | |**2025-01-24**|**Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models**|Tianrui Wang et.al.|[2501.14273](http://arxiv.org/abs/2501.14273)|null| 219 | |**2025-01-24**|**Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation**|Wen Huang et.al.|[2501.14240](http://arxiv.org/abs/2501.14240)|null| 220 | |**2025-01-24**|**LoCoML: A Framework for Real-World ML Inference Pipelines**|Kritin Maddireddy et.al.|[2501.14165](http://arxiv.org/abs/2501.14165)|null| 221 | |**2025-01-23**|**Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement**|Jae-Sung Bae et.al.|[2501.13372](http://arxiv.org/abs/2501.13372)|null| 222 | |**2025-01-21**|**A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data**|Minh Tran et.al.|[2501.12501](http://arxiv.org/abs/2501.12501)|null| 223 | |**2025-01-15**|**Speech Synthesis along Perceptual Voice Quality Dimensions**|Frederik Rautenberg et.al.|[2501.08791](http://arxiv.org/abs/2501.08791)|null| 224 | |**2025-01-15**|**Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification**|Li Zhang et.al.|[2501.08691](http://arxiv.org/abs/2501.08691)|null| 225 | |**2025-01-15**|**Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement**|Qianniu Chen et.al.|[2501.08566](http://arxiv.org/abs/2501.08566)|null| 226 | |**2025-01-19**|**MathReader : Text-to-Speech for Mathematical Documents**|Sieun Hyeon et.al.|[2501.07088](http://arxiv.org/abs/2501.07088)|**[link](https://github.com/hyeonsieun/mathreader)**| 227 | |**2025-01-10**|**TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer**|Vladimir Bataev et.al.|[2501.06320](http://arxiv.org/abs/2501.06320)|null| 228 | |**2025-01-10**|**MinMo: A Multimodal Large Language Model for Seamless Voice Interaction**|Qian Chen et.al.|[2501.06282](http://arxiv.org/abs/2501.06282)|null| 229 | |**2025-01-10**|**PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control**|Shaozuo Zhang et.al.|[2501.06276](http://arxiv.org/abs/2501.06276)|null| 230 | |**2025-01-10**|**Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron**|Kishor Kayyar Lakshminarayana et.al.|[2501.05976](http://arxiv.org/abs/2501.05976)|null| 231 | |**2025-01-10**|**MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model**|Matthew Baas et.al.|[2501.05787](http://arxiv.org/abs/2501.05787)|null| 232 | |**2025-01-09**|**Probing Speaker-specific Features in Speaker Representations**|Aemon Yat Fei Chiu et.al.|[2501.05310](http://arxiv.org/abs/2501.05310)|null| 233 | |**2025-01-08**|**Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model**|Sanjana Sankar et.al.|[2501.04799](http://arxiv.org/abs/2501.04799)|null| 234 | |**2025-01-08**|**DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions**|Weidong Chen et.al.|[2501.04256](http://arxiv.org/abs/2501.04256)|null| 235 | |**2025-01-02**|**FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles**|Tian-Hao Zhang et.al.|[2501.03181](http://arxiv.org/abs/2501.03181)|null| 236 | |**2025-01-02**|**RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer**|Seongho Hong et.al.|[2501.01182](http://arxiv.org/abs/2501.01182)|**[link](https://github.com/seongho608/ringformer)**| 237 | |**2025-01-02**|**Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT**|Dongyang Dai et.al.|[2501.01102](http://arxiv.org/abs/2501.01102)|null| 238 | |**2025-01-06**|**Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study**|Mykola Maslych et.al.|[2501.00168](http://arxiv.org/abs/2501.00168)|null| 239 | |**2024-12-28**|**Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting**|Wooseok Han et.al.|[2412.20155](http://arxiv.org/abs/2412.20155)|null| 240 | |**2024-12-26**|**"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities**|Jiawei Yu et.al.|[2412.19102](http://arxiv.org/abs/2412.19102)|null| 241 | |**2024-12-26**|**Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID**|Ahmad Alfani Handoyo et.al.|[2412.19043](http://arxiv.org/abs/2412.19043)|null| 242 | |**2024-12-25**|**Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset**|Neil Shah et.al.|[2412.18839](http://arxiv.org/abs/2412.18839)|null| 243 | |**2024-12-24**|**GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing**|Wen Ku et.al.|[2412.18300](http://arxiv.org/abs/2412.18300)|null| 244 | |**2024-12-22**|**Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective**|Hankun Wang et.al.|[2412.17048](http://arxiv.org/abs/2412.17048)|null| 245 | |**2024-12-22**|**Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis**|Ye-Xin Lu et.al.|[2412.16977](http://arxiv.org/abs/2412.16977)|null| 246 | |**2024-12-22**|**Autoregressive Speech Synthesis with Next-Distribution Prediction**|Xinfa Zhu et.al.|[2412.16846](http://arxiv.org/abs/2412.16846)|null| 247 | |**2024-12-23**|**Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers**|Yifan Yang et.al.|[2412.16102](http://arxiv.org/abs/2412.16102)|null| 248 | |**2024-12-19**|**Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling**|Leying Zhang et.al.|[2412.14890](http://arxiv.org/abs/2412.14890)|null| 249 | |**2024-12-17**|**Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge**|Mahieyin Rahmun et.al.|[2412.13279](http://arxiv.org/abs/2412.13279)|**[link](https://github.com/AGenCyLab/SPCUP2022)**| 250 | |**2024-12-17**|**Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion**|Syed Zohaib Hassan et.al.|[2412.12710](http://arxiv.org/abs/2412.12710)|null| 251 | |**2024-12-17**|**Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes**|Kuiyuan Zhang et.al.|[2412.12619](http://arxiv.org/abs/2412.12619)|null| 252 | |**2024-12-17**|**Hierarchical Control of Emotion Rendering in Speech Synthesis**|Sho Inoue et.al.|[2412.12498](http://arxiv.org/abs/2412.12498)|**[link](https://github.com/shinshoji01/hed-project-page)**| 253 | |**2024-12-19**|**ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis**|Xiangheng He et.al.|[2412.11795](http://arxiv.org/abs/2412.11795)|null| 254 | |**2024-12-17**|**Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech**|Rui Liu et.al.|[2412.11409](http://arxiv.org/abs/2412.11409)|**[link](https://github.com/ai-s2-lab/m2se-vtts)**| 255 | |**2024-12-16**|**Efficient Generative Modeling with Residual Vector Quantization-Based Tokens**|Jaehyeon Kim et.al.|[2412.10208](http://arxiv.org/abs/2412.10208)|null| 256 | |**2024-12-13**|**AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation**|Xiyuan Gao et.al.|[2412.10103](http://arxiv.org/abs/2412.10103)|null| 257 | |**2024-12-13**|**CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder**|Jianwei Cui et.al.|[2412.08918](http://arxiv.org/abs/2412.08918)|null| 258 | |**2024-12-11**|**Multimodal Latent Language Modeling with Next-Token Diffusion**|Yutao Sun et.al.|[2412.08635](http://arxiv.org/abs/2412.08635)|**[link](https://github.com/microsoft/unilm/tree/master/LatentLM)**| 259 | |**2024-12-11**|**A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction**|Sowmya Cheripally et.al.|[2412.08312](http://arxiv.org/abs/2412.08312)|null| 260 | |**2024-12-11**|**A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings**|Anindita Mondal et.al.|[2412.08283](http://arxiv.org/abs/2412.08283)|null| 261 | |**2024-12-11**|**LatentSpeech: Latent Diffusion for Text-To-Speech Generation**|Haowei Lou et.al.|[2412.08117](http://arxiv.org/abs/2412.08117)|null| 262 | |**2024-12-11**|**Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration**|Haowei Lou et.al.|[2412.08112](http://arxiv.org/abs/2412.08112)|null| 263 | |**2024-12-09**|**Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey**|Tianxin Xie et.al.|[2412.06602](http://arxiv.org/abs/2412.06602)|**[link](https://github.com/imxtx/awesome-controllabe-speech-synthesis)**| 264 | |**2024-12-12**|**EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations**|Weizhen Bian et.al.|[2412.06581](http://arxiv.org/abs/2412.06581)|null| 265 | |**2024-12-01**|**Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor**|Ashwin Baluja et.al.|[2412.05315](http://arxiv.org/abs/2412.05315)|null| 266 | |**2024-12-04**|**DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles**|Jiaxuan Liu et.al.|[2412.03388](http://arxiv.org/abs/2412.03388)|null| 267 | |**2024-12-03**|**GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot**|Aohan Zeng et.al.|[2412.02612](http://arxiv.org/abs/2412.02612)|**[link](https://github.com/thudm/glm-4-voice)**| 268 | |**2024-11-19**|**A Context-Based Numerical Format Prediction for a Text-To-Speech System**|Yaser Darwesh et.al.|[2412.00028](http://arxiv.org/abs/2412.00028)|null| 269 | |**2024-11-27**|**Continual Learning in Machine Speech Chain Using Gradient Episodic Memory**|Geoffrey Tyndall et.al.|[2411.18320](http://arxiv.org/abs/2411.18320)|null| 270 | |**2024-11-27**|**SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation**|Wenyi Yu et.al.|[2411.18138](http://arxiv.org/abs/2411.18138)|null| 271 | |**2024-11-26**|**WavChat: A Survey of Spoken Dialogue Models**|Shengpeng Ji et.al.|[2411.13577](http://arxiv.org/abs/2411.13577)|**[link](https://github.com/jishengpeng/wavchat)**| 272 | |**2024-12-02**|**I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception**|Jiawei Zhang et.al.|[2411.13314](http://arxiv.org/abs/2411.13314)|null| 273 | |**2024-11-20**|**Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM**|Jiawei Yu et.al.|[2411.13159](http://arxiv.org/abs/2411.13159)|null| 274 | |**2024-11-19**|**Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation**|Praveen Srinivasa Varadhan et.al.|[2411.12719](http://arxiv.org/abs/2411.12719)|null| 275 | |**2024-11-19**|**Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D**|Adithya TG et.al.|[2411.12619](http://arxiv.org/abs/2411.12619)|null| 276 | |**2024-11-18**|**ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram**|Xiao-Hang Jiang et.al.|[2411.11258](http://arxiv.org/abs/2411.11258)|null| 277 | |**2024-11-12**|**Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models**|Dongrui Han et.al.|[2411.07563](http://arxiv.org/abs/2411.07563)|null| 278 | |**2024-11-11**|**Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities**|Snehasish Paul Shivali Chauhan et.al.|[2411.06970](http://arxiv.org/abs/2411.06970)|null| 279 | |**2024-11-10**|**Debatts: Zero-Shot Debating Text-to-Speech Synthesis**|Yiqiao Huang et.al.|[2411.06540](http://arxiv.org/abs/2411.06540)|null| 280 | |**2024-11-07**|**CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR**|Kadir Burak Buldu et.al.|[2411.04671](http://arxiv.org/abs/2411.04671)|null| 281 | |**2024-11-04**|**EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector**|Deok-Hyeon Cho et.al.|[2411.02625](http://arxiv.org/abs/2411.02625)|**[link](https://github.com/Choddeok/EmoSpherepp)**| 282 | |**2024-11-09**|**Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis**|Shijia Liao et.al.|[2411.01156](http://arxiv.org/abs/2411.01156)|**[link](https://github.com/fishaudio/fish-speech)**| 283 | |**2024-10-31**|**Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?**|Ioannis Tsiamas et.al.|[2410.24019](http://arxiv.org/abs/2410.24019)|null| 284 | |**2024-10-30**|**Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis**|Théodor Lemerle et.al.|[2410.23320](http://arxiv.org/abs/2410.23320)|**[link](https://github.com/theodorblackbird/lina-speech)**| 285 | |**2024-10-29**|**Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech**|Eric Battenberg et.al.|[2410.22179](http://arxiv.org/abs/2410.22179)|null| 286 | |**2024-10-29**|**Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding**|Bohan Li et.al.|[2410.21951](http://arxiv.org/abs/2410.21951)|null| 287 | |**2024-10-29**|**RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis**|Kehan Sui et.al.|[2410.21641](http://arxiv.org/abs/2410.21641)|null| 288 | |**2024-10-28**|**Asynchronous Tool Usage for Real-Time Agents**|Antonio A. Ginart et.al.|[2410.21620](http://arxiv.org/abs/2410.21620)|null| 289 | |**2024-10-28**|**Enhancing TTS Stability in Hebrew using Discrete Semantic Units**|Ella Zeldes et.al.|[2410.21502](http://arxiv.org/abs/2410.21502)|null| 290 | |**2024-10-28**|**Mitigating Unauthorized Speech Synthesis for Voice Protection**|Zhisheng Zhang et.al.|[2410.20742](http://arxiv.org/abs/2410.20742)|**[link](https://github.com/wxzyd123/pivotal_objective_perturbation)**| 291 | |**2024-10-27**|**Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation**|Maohao Shen et.al.|[2410.20336](http://arxiv.org/abs/2410.20336)|null| 292 | |**2024-10-24**|**Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis**|Suparna De et.al.|[2410.19199](http://arxiv.org/abs/2410.19199)|null| 293 | |**2024-10-24**|**STTATTS: Unified Speech-To-Text And Text-To-Speech Model**|Hawau Olamide Toyin et.al.|[2410.18607](http://arxiv.org/abs/2410.18607)|**[link](https://github.com/mbzuai-nlp/sttatts)**| 294 | |**2024-10-24**|**Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts**|ChaeHun Park et.al.|[2410.18444](http://arxiv.org/abs/2410.18444)|null| 295 | |**2024-10-23**|**ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams**|Srija Anand et.al.|[2410.17901](http://arxiv.org/abs/2410.17901)|null| 296 | |**2024-10-22**|**Continuous Speech Tokenizer in Text To Speech**|Yixing Li et.al.|[2410.17081](http://arxiv.org/abs/2410.17081)|null| 297 | |**2024-10-22**|**Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap**|Guanrou Yang et.al.|[2410.16726](http://arxiv.org/abs/2410.16726)|null| 298 | |**2024-10-21**|**Continuous Speech Synthesis using per-token Latent Diffusion**|Arnon Turetzky et.al.|[2410.16048](http://arxiv.org/abs/2410.16048)|null| 299 | |**2024-10-18**|**A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages**|Sujitha Sathiyamoorthy et.al.|[2410.14197](http://arxiv.org/abs/2410.14197)|null| 300 | |**2024-10-18**|**Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech**|Shuwei He et.al.|[2410.14101](http://arxiv.org/abs/2410.14101)|**[link](https://github.com/ms2ku-vtts/ms2ku-vtts)**| 301 | |**2024-10-17**|**Enhancing Crowdsourced Audio for Text-to-Speech Models**|José Giraldo et.al.|[2410.13357](http://arxiv.org/abs/2410.13357)|null| 302 | |**2024-10-17**|**DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech**|Jan Melechovsky et.al.|[2410.13342](http://arxiv.org/abs/2410.13342)|null| 303 | |**2024-10-17**|**DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2410.13288](http://arxiv.org/abs/2410.13288)|null| 304 | |**2024-10-17**|**Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation**|Sreyan Ghosh et.al.|[2410.13198](http://arxiv.org/abs/2410.13198)|null| 305 | |**2024-10-16**|**ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs**|Rui-Chen Zheng et.al.|[2410.12359](http://arxiv.org/abs/2410.12359)|null| 306 | |**2024-10-14**|**IsoChronoMeter: A simple and effective isochronic translation evaluation metric**|Nikolai Rozanov et.al.|[2410.11127](http://arxiv.org/abs/2410.11127)|null| 307 | |**2024-10-14**|**DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization**|Yingahao Aaron Li et.al.|[2410.11097](http://arxiv.org/abs/2410.11097)|null| 308 | |**2024-10-12**|**Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling**|Rui Liu et.al.|[2410.09524](http://arxiv.org/abs/2410.09524)|null| 309 | |**2024-10-10**|**Unsupervised Data Validation Methods for Efficient Model Training**|Yurii Paniv et.al.|[2410.07880](http://arxiv.org/abs/2410.07880)|null| 310 | |**2024-10-15**|**F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching**|Yushen Chen et.al.|[2410.06885](http://arxiv.org/abs/2410.06885)|**[link](https://github.com/SWivid/F5-TTS)**| 311 | |**2024-10-09**|**Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch**|Teodora Răgman et.al.|[2410.06787](http://arxiv.org/abs/2410.06787)|null| 312 | |**2024-10-09**|**Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS**|Onkar Kishor Susladkar et.al.|[2410.06608](http://arxiv.org/abs/2410.06608)|null| 313 | |**2024-10-09**|**Can DeepFake Speech be Reliably Detected?**|Hongbin Liu et.al.|[2410.06572](http://arxiv.org/abs/2410.06572)|null| 314 | |**2024-10-07**|**SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech**|Minchan Kim et.al.|[2410.04690](http://arxiv.org/abs/2410.04690)|null| 315 | |**2024-10-06**|**HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis**|Yuto Nishimura et.al.|[2410.04380](http://arxiv.org/abs/2410.04380)|null| 316 | |**2024-10-10**|**SONAR: A Synthetic AI-Audio Detection Framework and Benchmark**|Xiang Li et.al.|[2410.04324](http://arxiv.org/abs/2410.04324)|**[link](https://github.com/jessegator/sonar)**| 317 | |**2024-10-05**|**Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System**|Ze Li et.al.|[2410.04017](http://arxiv.org/abs/2410.04017)|null| 318 | |**2024-10-01**|**Recent Advances in Speech Language Models: A Survey**|Wenqian Cui et.al.|[2410.03751](http://arxiv.org/abs/2410.03751)|null| 319 | |**2024-10-04**|**Generative Semantic Communication for Text-to-Speech Synthesis**|Jiahao Zheng et.al.|[2410.03459](http://arxiv.org/abs/2410.03459)|null| 320 | |**2024-10-04**|**Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens**|Jinzheng Zhao et.al.|[2410.03298](http://arxiv.org/abs/2410.03298)|null| 321 | |**2024-10-04**|**Narrative Player: Reviving Data Narratives with Visuals**|Zekai Shao et.al.|[2410.03268](http://arxiv.org/abs/2410.03268)|null| 322 | |**2024-10-04**|**MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech**|Taejun Bak et.al.|[2410.03192](http://arxiv.org/abs/2410.03192)|null| 323 | |**2024-10-01**|**Augmentation through Laundering Attacks for Audio Spoof Detection**|Hashim Ali et.al.|[2410.01108](http://arxiv.org/abs/2410.01108)|null| 324 | |**2024-10-01**|**Zero-Shot Text-to-Speech from Continuous Text Streams**|Trung Dang et.al.|[2410.00767](http://arxiv.org/abs/2410.00767)|null| 325 | |**2024-10-01**|**EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control**|Haozhe Chen et.al.|[2410.00316](http://arxiv.org/abs/2410.00316)|**[link](https://github.com/tonychenxyz/emoknob)**| 326 | |**2024-09-30**|**Word-wise intonation model for cross-language TTS systems**|Tomilov A. A. et.al.|[2409.20374](http://arxiv.org/abs/2409.20374)|null| 327 | |**2024-09-27**|**Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech**|Youngjae Kim et.al.|[2409.18622](http://arxiv.org/abs/2409.18622)|null| 328 | |**2024-09-26**|**Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control**|Ryuichi Yamamoto et.al.|[2409.17452](http://arxiv.org/abs/2409.17452)|null| 329 | |**2024-09-25**|**Exploring synthetic data for cross-speaker style transfer in style representation based TTS**|Lucas H. Ueda et.al.|[2409.17364](http://arxiv.org/abs/2409.17364)|null| 330 | |**2024-09-25**|**Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions**|Kun Zhou et.al.|[2409.16681](http://arxiv.org/abs/2409.16681)|null| 331 | |**2024-09-25**|**Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation**|Siyin Wang et.al.|[2409.16644](http://arxiv.org/abs/2409.16644)|null| 332 | |**2024-09-24**|**FastTalker: Jointly Generating Speech and Conversational Gestures from Text**|Zixin Guo et.al.|[2409.16404](http://arxiv.org/abs/2409.16404)|null| 333 | |**2024-09-24**|**Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling**|Ville Heilala et.al.|[2409.16376](http://arxiv.org/abs/2409.16376)|null| 334 | |**2024-09-24**|**Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech**|Yunji Chu et.al.|[2409.16203](http://arxiv.org/abs/2409.16203)|null| 335 | |**2024-09-24**|**NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers**|Nohil Park et.al.|[2409.15760](http://arxiv.org/abs/2409.15760)|null| 336 | |**2024-09-24**|**VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance**|Jiheum Yeom et.al.|[2409.15759](http://arxiv.org/abs/2409.15759)|null| 337 | |**2024-09-24**|**StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis**|Zhiyong Chen et.al.|[2409.15741](http://arxiv.org/abs/2409.15741)|null| 338 | |**2024-09-23**|**A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection**|Lam Pham et.al.|[2409.15180](http://arxiv.org/abs/2409.15180)|null| 339 | |**2024-09-23**|**LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation**|Hieu-Thi Luong et.al.|[2409.14743](http://arxiv.org/abs/2409.14743)|null| 340 | |**2024-09-20**|**Zero-shot Cross-lingual Voice Transfer for TTS**|Fadi Biadsy et.al.|[2409.13910](http://arxiv.org/abs/2409.13910)|null| 341 | |**2024-09-20**|**On the Feasibility of Fully AI-automated Vishing Attacks**|João Figueiredo et.al.|[2409.13793](http://arxiv.org/abs/2409.13793)|null| 342 | |**2024-09-19**|**Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space**|Sebastião Quintas et.al.|[2409.12745](http://arxiv.org/abs/2409.12745)|null| 343 | |**2024-09-19**|**Preference Alignment Improves Language Model-Based TTS**|Jinchuan Tian et.al.|[2409.12403](http://arxiv.org/abs/2409.12403)|null| 344 | |**2024-09-18**|**Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference**|Edresson Casanova et.al.|[2409.12117](http://arxiv.org/abs/2409.12117)|null| 345 | |**2024-09-18**|**Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems**|Anusha Prakash et.al.|[2409.11915](http://arxiv.org/abs/2409.11915)|null| 346 | |**2024-09-18**|**DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech**|Xin Qi et.al.|[2409.11835](http://arxiv.org/abs/2409.11835)|null| 347 | |**2024-09-18**|**Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation**|Haohan Guo et.al.|[2409.11630](http://arxiv.org/abs/2409.11630)|null| 348 | |**2024-09-17**|**SpMis: An Investigation of Synthetic Spoken Misinformation Detection**|Peizhuo Liu et.al.|[2409.11308](http://arxiv.org/abs/2409.11308)|null| 349 | |**2024-09-19**|**The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives**|Samee Arif et.al.|[2409.11261](http://arxiv.org/abs/2409.11261)|**[link](https://github.com/ulrs0/The-Art-of-Story-Telling)**| 350 | |**2024-09-17**|**Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora**|Francesco Nespoli et.al.|[2409.11107](http://arxiv.org/abs/2409.11107)|null| 351 | |**2024-09-16**|**Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization**|Xiaoxue Gao et.al.|[2409.10157](http://arxiv.org/abs/2409.10157)|null| 352 | |**2024-09-16**|**StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion**|Yinghao Aaron Li et.al.|[2409.10058](http://arxiv.org/abs/2409.10058)|null| 353 | |**2024-09-15**|**Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning**|Siqi Sun et.al.|[2409.09891](http://arxiv.org/abs/2409.09891)|null| 354 | |**2024-09-14**|**E1 TTS: Simple and Fast Non-Autoregressive TTS**|Zhijun Liu et.al.|[2409.09351](http://arxiv.org/abs/2409.09351)|null| 355 | |**2024-09-14**|**Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation**|Changjin Han et.al.|[2409.09311](http://arxiv.org/abs/2409.09311)|null| 356 | |**2024-09-14**|**SafeEar: Content Privacy-Preserving Audio Deepfake Detection**|Xinfeng Li et.al.|[2409.09272](http://arxiv.org/abs/2409.09272)|**[link](https://github.com/LetterLiGo/SafeEar)**| 357 | |**2024-09-13**|**AccentBox: Towards High-Fidelity Zero-Shot Accent Generation**|Jinzuomu Zhong et.al.|[2409.09098](http://arxiv.org/abs/2409.09098)|null| 358 | |**2024-09-17**|**HLTCOE JHU Submission to the Voice Privacy Challenge 2024**|Henry Li Xinyuan et.al.|[2409.08913](http://arxiv.org/abs/2409.08913)|null| 359 | |**2024-09-13**|**Text-To-Speech Synthesis In The Wild**|Jee-weon Jung et.al.|[2409.08711](http://arxiv.org/abs/2409.08711)|null| 360 | |**2024-09-14**|**Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions**|Amila Indika et.al.|[2409.07945](http://arxiv.org/abs/2409.07945)|null| 361 | |**2024-09-12**|**Full-text Error Correction for Chinese Speech Recognition with Large Language Model**|Zhiyuan Tang et.al.|[2409.07790](http://arxiv.org/abs/2409.07790)|null| 362 | |**2024-09-11**|**SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis**|Helin Wang et.al.|[2409.07556](http://arxiv.org/abs/2409.07556)|**[link](https://github.com/WangHelin1997/SSR-Speech)**| 363 | |**2024-09-11**|**D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack**|Hong-Hanh Nguyen-Le et.al.|[2409.07390](http://arxiv.org/abs/2409.07390)|null| 364 | |**2024-09-11**|**Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT**|Kazuki Yamauchi et.al.|[2409.07265](http://arxiv.org/abs/2409.07265)|null| 365 | |**2024-09-11**|**Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment**|Tien-Hong Lo et.al.|[2409.07151](http://arxiv.org/abs/2409.07151)|null| 366 | |**2024-09-10**|**Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models**|Xin Jing et.al.|[2409.06451](http://arxiv.org/abs/2409.06451)|null| 367 | |**2024-09-10**|**What happens to diffusion model likelihood when your model is conditional?**|Mattias Cross et.al.|[2409.06364](http://arxiv.org/abs/2409.06364)|null| 368 | |**2024-09-10**|**VoiceWukong: Benchmarking Deepfake Voice Detection**|Ziwei Yan et.al.|[2409.06348](http://arxiv.org/abs/2409.06348)|null| 369 | |**2024-09-09**|**AS-Speech: Adaptive Style For Speech Synthesis**|Zhipeng Li et.al.|[2409.05730](http://arxiv.org/abs/2409.05730)|null| 370 | |**2024-09-09**|**IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS**|Ashwin Sankar et.al.|[2409.05356](http://arxiv.org/abs/2409.05356)|**[link](https://github.com/ai4bharat/indicvoices-r)**| 371 | |**2024-09-10**|**Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion**|Zhengyang Chen et.al.|[2409.05004](http://arxiv.org/abs/2409.05004)|null| 372 | |**2024-09-01**|**Sample-Efficient Diffusion for Text-To-Speech Synthesis**|Justin Lovelace et.al.|[2409.03717](http://arxiv.org/abs/2409.03717)|**[link](https://github.com/justinlovelace/sesd)**| 373 | |**2024-09-10**|**LAST: Language Model Aware Speech Tokenization**|Arnon Turetzky et.al.|[2409.03701](http://arxiv.org/abs/2409.03701)|null| 374 | |**2024-09-05**|**FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications**|Hao-Han Guo et.al.|[2409.03283](http://arxiv.org/abs/2409.03283)|null| 375 | |**2024-09-04**|**Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems**|Jeongmin Liu et.al.|[2409.02517](http://arxiv.org/abs/2409.02517)|null| 376 | |**2024-09-03**|**VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka**|Li-Wei Chen et.al.|[2409.01548](http://arxiv.org/abs/2409.01548)|null| 377 | |**2024-09-02**|**A multilingual training strategy for low resource Text to Speech**|Asma Amalas et.al.|[2409.01217](http://arxiv.org/abs/2409.01217)|null| 378 | |**2024-09-02**|**A Framework for Synthetic Audio Conversations Generation using Large Language Models**|Kaung Myat Kyaw et.al.|[2409.00946](http://arxiv.org/abs/2409.00946)|null| 379 | |**2024-09-02**|**SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis**|Haohan Guo et.al.|[2409.00933](http://arxiv.org/abs/2409.00933)|**[link](https://github.com/hhguo/socodec)**| 380 | |**2024-09-01**|**MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer**|Yuancheng Wang et.al.|[2409.00750](http://arxiv.org/abs/2409.00750)|null| 381 | |**2024-08-30**|**SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection**|Ismail Rasim Ulgen et.al.|[2408.17432](http://arxiv.org/abs/2408.17432)|null| 382 | |**2024-08-30**|**AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge**|Kirill Borodin et.al.|[2408.17352](http://arxiv.org/abs/2408.17352)|null| 383 | |**2024-08-30**|**Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model**|Zhen Ye et.al.|[2408.17175](http://arxiv.org/abs/2408.17175)|**[link](https://github.com/zhenye234/xcodec)**| 384 | |**2024-08-30**|**Utilizing Speaker Profiles for Impersonation Audio Detection**|Hao Gu et.al.|[2408.17009](http://arxiv.org/abs/2408.17009)|null| 385 | |**2024-08-29**|**Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis**|Zehai Tu et.al.|[2408.16373](http://arxiv.org/abs/2408.16373)|null| 386 | |**2024-08-28**|**Multi-modal Adversarial Training for Zero-Shot Voice Cloning**|John Janiczek et.al.|[2408.15916](http://arxiv.org/abs/2408.15916)|null| 387 | |**2024-08-29**|**Easy, Interpretable, Effective: openSMILE for voice deepfake detection**|Octavian Pascu et.al.|[2408.15775](http://arxiv.org/abs/2408.15775)|null| 388 | |**2024-08-28**|**VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling**|Yixuan Zhou et.al.|[2408.15676](http://arxiv.org/abs/2408.15676)|**[link](https://github.com/thuhcsi/voxinstruct)**| 389 | |**2024-08-28**|**VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech**|Heeseung Kim et.al.|[2408.14739](http://arxiv.org/abs/2408.14739)|null| 390 | |**2024-08-27**|**StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech**|Haowei Lou et.al.|[2408.14713](http://arxiv.org/abs/2408.14713)|null| 391 | |**2024-08-27**|**DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance**|Jinhyeok Yang et.al.|[2408.14423](http://arxiv.org/abs/2408.14423)|null| 392 | |**2024-08-26**|**Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard**|Wonjune Kang et.al.|[2408.13970](http://arxiv.org/abs/2408.13970)|null| 393 | |**2024-08-28**|**SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2408.13893](http://arxiv.org/abs/2408.13893)|null| 394 | |**2024-08-22**|**Positional Description for Numerical Normalization**|Deepanshu Gupta et.al.|[2408.12430](http://arxiv.org/abs/2408.12430)|null| 395 | |**2024-08-22**|**VoiceX: A Text-To-Speech Framework for Custom Voices**|Silvan Mertes et.al.|[2408.12170](http://arxiv.org/abs/2408.12170)|null| 396 | |**2024-08-13**|**Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation**|Yinghao Aaron Li et.al.|[2408.11849](http://arxiv.org/abs/2408.11849)|null| 397 | |**2024-08-20**|**EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech**|Xin Qi et.al.|[2408.10852](http://arxiv.org/abs/2408.10852)|null| 398 | |**2024-08-20**|**SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS**|Karl El Hajal et.al.|[2408.10771](http://arxiv.org/abs/2408.10771)|null| 399 | |**2024-08-20**|**Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting**|Hyun Jin Park et.al.|[2408.10463](http://arxiv.org/abs/2408.10463)|null| 400 | |**2024-08-17**|**Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition**|Samuele Cornell et.al.|[2408.09215](http://arxiv.org/abs/2408.09215)|**[link](https://github.com/popcornell/ASRLightningFT)**| 401 | |**2024-08-14**|**PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation**|Sang-Hoon Lee et.al.|[2408.07547](http://arxiv.org/abs/2408.07547)|**[link](https://github.com/sh-lee-prml/periodwave)**| 402 | |**2024-08-13**|**SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis**|Osamu Take et.al.|[2408.06858](http://arxiv.org/abs/2408.06858)|**[link](https://github.com/sarulab-speech/saslaw)**| 403 | |**2024-08-13**|**PRESENT: Zero-Shot Text-to-Prosody Control**|Perry Lam et.al.|[2408.06827](http://arxiv.org/abs/2408.06827)|**[link](https://github.com/iamanigeeit/present)**| 404 | |**2024-08-12**|**FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks**|Min Ma et.al.|[2408.06227](http://arxiv.org/abs/2408.06227)|null| 405 | |**2024-08-11**|**VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing**|Chunyu Qiang et.al.|[2408.05758](http://arxiv.org/abs/2408.05758)|null| 406 | |**2024-08-06**|**Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training**|Hawraz A. Ahmad et.al.|[2408.03887](http://arxiv.org/abs/2408.03887)|null| 407 | |**2024-08-03**|**ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features**|Peng Cheng et.al.|[2408.01808](http://arxiv.org/abs/2408.01808)|**[link](https://github.com/TASER2023/TASER)**| 408 | |**2024-08-01**|**Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation**|Xinhan Di et.al.|[2408.00284](http://arxiv.org/abs/2408.00284)|null| 409 | |**2024-07-18**|**Handling Numeric Expressions in Automatic Speech Recognition**|Christian Huber et.al.|[2408.00004](http://arxiv.org/abs/2408.00004)|null| 410 | |**2024-07-31**|**On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition**|Nick Rossenbach et.al.|[2407.21476](http://arxiv.org/abs/2407.21476)|null| 411 | |**2024-07-29**|**Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks**|Mahmoud Salhab et.al.|[2407.18571](http://arxiv.org/abs/2407.18571)|null| 412 | |**2024-07-25**|**On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures**|Nick Rossenbach et.al.|[2407.17997](http://arxiv.org/abs/2407.17997)|null| 413 | |**2024-07-24**|**Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model**|Jan Lehečka et.al.|[2407.17167](http://arxiv.org/abs/2407.17167)|null| 414 | |**2024-07-23**|**Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments**|Pai Zhu et.al.|[2407.16840](http://arxiv.org/abs/2407.16840)|null| 415 | |**2024-07-19**|**Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2**|Chun Xu et.al.|[2407.14212](http://arxiv.org/abs/2407.14212)|null| 416 | |**2024-07-18**|**Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models**|Weiqin Li et.al.|[2407.13509](http://arxiv.org/abs/2407.13509)|null| 417 | |**2024-07-22**|**TTSDS -- Text-to-Speech Distribution Score**|Christoph Minixhofer et.al.|[2407.12707](http://arxiv.org/abs/2407.12707)|**[link](https://github.com/ttsds/ttsds)**| 418 | |**2024-07-17**|**Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech**|Haibin Wu et.al.|[2407.12229](http://arxiv.org/abs/2407.12229)|**[link](https://github.com/hbwu-ntu/emoctrltts-eval)**| 419 | |**2024-07-16**|**A Language Modeling Approach to Diacritic-Free Hebrew TTS**|Amit Roth et.al.|[2407.12206](http://arxiv.org/abs/2407.12206)|null| 420 | |**2024-07-17**|**Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding**|Chuanhao Sun et.al.|[2407.09370](http://arxiv.org/abs/2407.09370)|**[link](https://github.com/zhyuan11/SPE)**| 421 | |**2024-07-11**|**Autoregressive Speech Synthesis without Vector Quantization**|Lingwei Meng et.al.|[2407.08551](http://arxiv.org/abs/2407.08551)|null| 422 | |**2024-07-10**|**Source Tracing of Audio Deepfake Systems**|Nicholas Klein et.al.|[2407.08016](http://arxiv.org/abs/2407.08016)|null| 423 | |**2024-07-07**|**ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation**|Ruibo Fu et.al.|[2407.05421](http://arxiv.org/abs/2407.05421)|null| 424 | |**2024-07-09**|**CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens**|Zhihao Du et.al.|[2407.05407](http://arxiv.org/abs/2407.05407)|null| 425 | |**2024-07-04**|**Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis**|Cong-Thanh Do et.al.|[2407.04047](http://arxiv.org/abs/2407.04047)|null| 426 | |**2024-07-04**|**Optimizing a-DCF for Spoofing-Robust Speaker Verification**|Oğuzhan Kurnaz et.al.|[2407.04034](http://arxiv.org/abs/2407.04034)|null| 427 | |**2024-07-04**|**On the Effectiveness of Acoustic BPE in Decoder-Only TTS**|Bohan Li et.al.|[2407.03892](http://arxiv.org/abs/2407.03892)|null| 428 | |**2024-07-14**|**CATT: Character-based Arabic Tashkeel Transformer**|Faris Alasmary et.al.|[2407.03236](http://arxiv.org/abs/2407.03236)|**[link](https://github.com/abjadai/catt)**| 429 | |**2024-07-02**|**Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization**|Yuchen Hu et.al.|[2407.02243](http://arxiv.org/abs/2407.02243)|null| 430 | |**2024-07-02**|**TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations**|Xiaoxue Gao et.al.|[2407.01927](http://arxiv.org/abs/2407.01927)|null| 431 | |**2024-07-01**|**Lightweight Zero-shot Text-to-Speech with Mixture of Adapters**|Kenichi Fujita et.al.|[2407.01291](http://arxiv.org/abs/2407.01291)|null| 432 | |**2024-06-30**|**NAIST Simultaneous Speech Translation System for IWSLT 2024**|Yuka Ko et.al.|[2407.00826](http://arxiv.org/abs/2407.00826)|null| 433 | |**2024-06-30**|**An Attribute Interpolation Method in Speech Synthesis by Model Merging**|Masato Murata et.al.|[2407.00766](http://arxiv.org/abs/2407.00766)|null| 434 | |**2024-06-30**|**FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis**|Yinlin Guo et.al.|[2407.00753](http://arxiv.org/abs/2407.00753)|null| 435 | |**2024-07-02**|**Open-Source Conversational AI with SpeechBrain 1.0**|Mirco Ravanelli et.al.|[2407.00463](http://arxiv.org/abs/2407.00463)|null| 436 | |**2024-06-27**|**Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models**|Borodin Kirill Nikolayevich et.al.|[2406.19243](http://arxiv.org/abs/2406.19243)|null| 437 | |**2024-06-27**|**DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability**|Hyun Joon Park et.al.|[2406.19135](http://arxiv.org/abs/2406.19135)|**[link](https://github.com/winddori2002/dex-tts)**| 438 | |**2024-06-26**|**Automatic Speech Recognition for Hindi**|Anish Saha et.al.|[2406.18135](http://arxiv.org/abs/2406.18135)|null| 439 | |**2024-06-26**|**A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons**|Tzu-Yun Hung et.al.|[2406.18089](http://arxiv.org/abs/2406.18089)|null| 440 | |**2024-06-29**|**LLM-Driven Multimodal Opinion Expression Identification**|Bonian Jia et.al.|[2406.18088](http://arxiv.org/abs/2406.18088)|null| 441 | |**2024-06-26**|**E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS**|Sefik Emre Eskimez et.al.|[2406.18009](http://arxiv.org/abs/2406.18009)|**[link](https://github.com/microsoft/e2tts-test-suite)**| 442 | |**2024-06-25**|**Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment**|Paarth Neekhara et.al.|[2406.17957](http://arxiv.org/abs/2406.17957)|null| 443 | |**2024-06-22**|**A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge**|Xiaopeng Wang et.al.|[2406.17801](http://arxiv.org/abs/2406.17801)|null| 444 | |**2024-06-25**|**High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model**|Joun Yeop Lee et.al.|[2406.17310](http://arxiv.org/abs/2406.17310)|null| 445 | |**2024-06-25**|**Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation**|Yingting Li et.al.|[2406.17257](http://arxiv.org/abs/2406.17257)|null| 446 | |**2024-06-24**|**Exploring the Capability of Mamba in Speech Applications**|Koichi Miyazaki et.al.|[2406.16808](http://arxiv.org/abs/2406.16808)|null| 447 | |**2024-06-25**|**Towards Zero-Shot Text-To-Speech for Arabic Dialects**|Khai Duy Doan et.al.|[2406.16751](http://arxiv.org/abs/2406.16751)|null| 448 | |**2024-06-22**|**TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers**|Yakun Song et.al.|[2406.15752](http://arxiv.org/abs/2406.15752)|**[link](https://github.com/Ereboas/TacoLM)**| 449 | |**2024-06-21**|**InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions**|Yu Nakagome et.al.|[2406.14890](http://arxiv.org/abs/2406.14890)|null| 450 | |**2024-06-21**|**GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech**|Wenbin Wang et.al.|[2406.14875](http://arxiv.org/abs/2406.14875)|null| 451 | |**2024-06-21**|**DASB - Discrete Audio and Speech Benchmark**|Pooneh Mousavi et.al.|[2406.14294](http://arxiv.org/abs/2406.14294)|null| 452 | |**2024-06-18**|**Instruction Data Generation and Unsupervised Adaptation for Speech Language Models**|Vahid Noroozi et.al.|[2406.12946](http://arxiv.org/abs/2406.12946)|null| 453 | |**2024-06-17**|**DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer**|Keon Lee et.al.|[2406.11427](http://arxiv.org/abs/2406.11427)|null| 454 | |**2024-06-16**|**NAST: Noise Aware Speech Tokenization for Speech Language Models**|Shoval Messica et.al.|[2406.11037](http://arxiv.org/abs/2406.11037)|**[link](https://github.com/ShovalMessica/NAST)**| 455 | |**2024-06-16**|**Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis**|Xuehao Zhou et.al.|[2406.10844](http://arxiv.org/abs/2406.10844)|null| 456 | |**2024-06-14**|**Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice**|Shubham Gupta et.al.|[2406.10422](http://arxiv.org/abs/2406.10422)|null| 457 | |**2024-06-14**|**UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner**|Dongchao Yang et.al.|[2406.10056](http://arxiv.org/abs/2406.10056)|**[link](https://github.com/yangdongchao/llm-codec)**| 458 | |**2024-06-14**|**MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model**|Jiatong Shi et.al.|[2406.09869](http://arxiv.org/abs/2406.09869)|null| 459 | |**2024-06-13**|**DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage**|Kyra Wang et.al.|[2406.08820](http://arxiv.org/abs/2406.08820)|null| 460 | |**2024-06-13**|**Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems**|Zhengyang Chen et.al.|[2406.08812](http://arxiv.org/abs/2406.08812)|null| 461 | |**2024-06-13**|**DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing**|Neha Sahipjohn et.al.|[2406.08802](http://arxiv.org/abs/2406.08802)|null| 462 | |**2024-06-12**|**Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis**|Wing-Zin Leung et.al.|[2406.08568](http://arxiv.org/abs/2406.08568)|**[link](https://github.com/WingZLeung/TTDS)**| 463 | |**2024-06-12**|**Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data**|Yuma Shirahata et.al.|[2406.08111](http://arxiv.org/abs/2406.08111)|null| 464 | |**2024-06-12**|**VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech**|Ashishkumar Gudmalwar et.al.|[2406.08076](http://arxiv.org/abs/2406.08076)|null| 465 | |**2024-06-12**|**LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning**|Masaya Kawamura et.al.|[2406.07969](http://arxiv.org/abs/2406.07969)|**[link](https://github.com/line/libritts-p)**| 466 | |**2024-06-12**|**VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment**|Bing Han et.al.|[2406.07855](http://arxiv.org/abs/2406.07855)|null| 467 | |**2024-06-12**|**EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech**|Deok-Hyeon Cho et.al.|[2406.07803](http://arxiv.org/abs/2406.07803)|**[link](https://github.com/Choddeok/EmoSphere-TTS)**| 468 | |**2024-06-11**|**The Interspeech 2024 Challenge on Speech Processing Using Discrete Units**|Xuankai Chang et.al.|[2406.07725](http://arxiv.org/abs/2406.07725)|null| 469 | |**2024-06-11**|**Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?**|Qingkai Fang et.al.|[2406.07289](http://arxiv.org/abs/2406.07289)|null| 470 | |**2024-06-11**|**AudioMarkBench: Benchmarking Robustness of Audio Watermarking**|Hongbin Liu et.al.|[2406.06979](http://arxiv.org/abs/2406.06979)|**[link](https://github.com/moyangkuo/audiomarkbench)**| 471 | |**2024-06-11**|**Controlling Emotion in Text-to-Speech with Natural Language Prompts**|Thomas Bott et.al.|[2406.06406](http://arxiv.org/abs/2406.06406)|**[link](https://github.com/digitalphonetics/ims-toucan)**| 472 | |**2024-06-10**|**Meta Learning Text-to-Speech Synthesis in over 7000 Languages**|Florian Lux et.al.|[2406.06403](http://arxiv.org/abs/2406.06403)|**[link](https://github.com/digitalphonetics/ims-toucan)**| 473 | |**2024-06-10**|**MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance**|Semin Kim et.al.|[2406.05965](http://arxiv.org/abs/2406.05965)|null| 474 | |**2024-06-11**|**WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark**|Linhan Ma et.al.|[2406.05763](http://arxiv.org/abs/2406.05763)|**[link](https://github.com/dukGuo/valle-audiodec)**| 475 | |**2024-06-09**|**An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS**|Xiaofei Wang et.al.|[2406.05699](http://arxiv.org/abs/2406.05699)|null| 476 | |**2024-06-11**|**Text-aware and Context-aware Expressive Audiobook Speech Synthesis**|Dake Guo et.al.|[2406.05672](http://arxiv.org/abs/2406.05672)|null| 477 | |**2024-06-08**|**Autoregressive Diffusion Transformer for Text-to-Speech Synthesis**|Zhijun Liu et.al.|[2406.05551](http://arxiv.org/abs/2406.05551)|null| 478 | |**2024-06-08**|**VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers**|Sanyuan Chen et.al.|[2406.05370](http://arxiv.org/abs/2406.05370)|null| 479 | |**2024-06-07**|**Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis**|Ryan Langman et.al.|[2406.05298](http://arxiv.org/abs/2406.05298)|null| 480 | |**2024-06-07**|**XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model**|Edresson Casanova et.al.|[2406.04904](http://arxiv.org/abs/2406.04904)|**[link](https://github.com/Edresson/ZS-TTS-Evaluation)**| 481 | |**2024-06-07**|**TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking**|Junzuo Zhou et.al.|[2406.04840](http://arxiv.org/abs/2406.04840)|null| 482 | |**2024-06-07**|**Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study**|Chong Zhang et.al.|[2406.04633](http://arxiv.org/abs/2406.04633)|null| 483 | |**2024-06-06**|**Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis**|Théodor Lemerle et.al.|[2406.04467](http://arxiv.org/abs/2406.04467)|**[link](https://github.com/theodorblackbird/lina-speech)**| 484 | |**2024-06-06**|**Total-Duration-Aware Duration Modeling for Text-to-Speech Systems**|Sefik Emre Eskimez et.al.|[2406.04281](http://arxiv.org/abs/2406.04281)|null| 485 | |**2024-06-06**|**Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining**|Jinlong Xue et.al.|[2406.03714](http://arxiv.org/abs/2406.03714)|null| 486 | |**2024-06-06**|**Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model**|Jinlong Xue et.al.|[2406.03706](http://arxiv.org/abs/2406.03706)|null| 487 | |**2024-06-05**|**Style Mixture of Experts for Expressive Text-To-Speech Synthesis**|Ahad Jawaid et.al.|[2406.03637](http://arxiv.org/abs/2406.03637)|null| 488 | |**2024-06-07**|**Harder or Different? Understanding Generalization of Audio Deepfake Detection**|Nicolas M. Müller et.al.|[2406.03512](http://arxiv.org/abs/2406.03512)|null| 489 | |**2024-06-05**|**LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes**|Trung Dang et.al.|[2406.02897](http://arxiv.org/abs/2406.02897)|null| 490 | |**2024-06-04**|**Seed-TTS: A Family of High-Quality Versatile Speech Generation Models**|Philip Anastassiou et.al.|[2406.02430](http://arxiv.org/abs/2406.02430)|**[link](https://github.com/BytedanceSpeech/seed-tts-eval)**| 491 | |**2024-06-05**|**SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2406.02328](http://arxiv.org/abs/2406.02328)|null| 492 | |**2024-06-04**|**BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation**|Hui-Peng Du et.al.|[2406.02162](http://arxiv.org/abs/2406.02162)|null| 493 | |**2024-06-04**|**Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis**|Kun Zhou et.al.|[2406.02009](http://arxiv.org/abs/2406.02009)|null| 494 | |**2024-06-03**|**ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec**|Shengpeng Ji et.al.|[2406.01205](http://arxiv.org/abs/2406.01205)|**[link](https://github.com/jishengpeng/controlspeech)**| 495 | |**2024-06-03**|**Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training**|Jan Melechovsky et.al.|[2406.01018](http://arxiv.org/abs/2406.01018)|null| 496 | |**2024-06-02**|**Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback**|Chen Chen et.al.|[2406.00654](http://arxiv.org/abs/2406.00654)|null| 497 | |**2024-05-31**|**Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities**|Vicky Zayats et.al.|[2405.18669](http://arxiv.org/abs/2405.18669)|null| 498 | |**2024-05-28**|**TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation**|Chenyang Le et.al.|[2405.17809](http://arxiv.org/abs/2405.17809)|null| 499 | |**2024-05-27**|**RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis**|Haoxiang Shi et.al.|[2405.17028](http://arxiv.org/abs/2405.17028)|null| 500 | |**2024-05-24**|**Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition**|Zijin Gu et.al.|[2405.15216](http://arxiv.org/abs/2405.15216)|null| 501 | |**2024-05-23**|**Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models**|Jingyi Chen et.al.|[2405.14632](http://arxiv.org/abs/2405.14632)|null| 502 | |**2024-05-22**|**A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction**|Yue Li et.al.|[2405.13477](http://arxiv.org/abs/2405.13477)|null| 503 | |**2024-05-20**|**Multi-speaker Text-to-speech Training with Speaker Anonymized Data**|Wen-Chin Huang et.al.|[2405.11767](http://arxiv.org/abs/2405.11767)|null| 504 | |**2024-05-19**|**VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications**|Mikhail Konenkov et.al.|[2405.11537](http://arxiv.org/abs/2405.11537)|null| 505 | |**2024-05-18**|**Exploring speech style spaces with language models: Emotional TTS without emotion labels**|Shreeram Suresh Chandra et.al.|[2405.11413](http://arxiv.org/abs/2405.11413)|null| 506 | |**2024-05-16**|**Faces that Speak: Jointly Synthesising Talking Face and Speech from Text**|Youngjoon Jang et.al.|[2405.10272](http://arxiv.org/abs/2405.10272)|null| 507 | |**2024-05-16**|**Building a Luganda Text-to-Speech Model From Crowdsourced Data**|Sulaiman Kagumire et.al.|[2405.10211](http://arxiv.org/abs/2405.10211)|null| 508 | |**2024-05-16**|**Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model**|Siyang Wang et.al.|[2405.09768](http://arxiv.org/abs/2405.09768)|null| 509 | |**2024-05-15**|**Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer**|Weifei Jin et.al.|[2405.09470](http://arxiv.org/abs/2405.09470)|null| 510 | |**2024-05-15**|**Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2405.09171](http://arxiv.org/abs/2405.09171)|null| 511 | |**2024-05-14**|**PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset**|Yang Hou et.al.|[2405.08838](http://arxiv.org/abs/2405.08838)|**[link](https://github.com/tobuta/PolyGlotFake)**| 512 | |**2024-04-30**|**Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech**|Hankun Wang et.al.|[2404.19723](http://arxiv.org/abs/2404.19723)|null| 513 | |**2024-04-29**|**MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis**|Xiang Li et.al.|[2404.18398](http://arxiv.org/abs/2404.18398)|null| 514 | |**2024-04-28**|**USAT: A Universal Speaker-Adaptive Text-to-Speech Approach**|Wenbin Wang et.al.|[2404.18094](http://arxiv.org/abs/2404.18094)|**[link](https://github.com/mushanshanshan/esltts)**| 515 | |**2024-04-27**|**TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality**|Tiantian Feng et.al.|[2404.17983](http://arxiv.org/abs/2404.17983)|null| 516 | |**2024-04-26**|**An RFP dataset for Real, Fake, and Partially fake audio detection**|Abdulazeez AlAli et.al.|[2404.17721](http://arxiv.org/abs/2404.17721)|null| 517 | |**2024-04-23**|**StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations**|Sen Liu et.al.|[2404.14946](http://arxiv.org/abs/2404.14946)|null| 518 | |**2024-04-23**|**Retrieval-Augmented Audio Deepfake Detection**|Zuheng Kang et.al.|[2404.13892](http://arxiv.org/abs/2404.13892)|null| 519 | |**2024-04-14**|**Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling**|Quanxiu Wang et.al.|[2404.09192](http://arxiv.org/abs/2404.09192)|null| 520 | |**2024-04-11**|**Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network**|Mayura Manawadu et.al.|[2404.07807](http://arxiv.org/abs/2404.07807)|null| 521 | |**2024-04-18**|**Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness**|Xincan Feng et.al.|[2404.06714](http://arxiv.org/abs/2404.06714)|**[link](https://github.com/xincanfeng/vitsgpt)**| 522 | |**2024-04-10**|**CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations**|Leying Zhang et.al.|[2404.06690](http://arxiv.org/abs/2404.06690)|null| 523 | |**2024-04-10**|**The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge**|Yiwei Guo et.al.|[2404.06079](http://arxiv.org/abs/2404.06079)|null| 524 | |**2024-04-07**|**Cross-Domain Audio Deepfake Detection: Dataset and Analysis**|Yuang Li et.al.|[2404.04904](http://arxiv.org/abs/2404.04904)|null| 525 | |**2024-04-06**|**HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks**|Yingting Li et.al.|[2404.04645](http://arxiv.org/abs/2404.04645)|**[link](https://github.com/declare-lab/hypertts)**| 526 | |**2024-04-18**|**Open vocabulary keyword spotting through transfer learning from speech synthesis**|Kesavaraj V et.al.|[2404.03914](http://arxiv.org/abs/2404.03914)|null| 527 | |**2024-04-06**|**RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis**|Detai Xin et.al.|[2404.03204](http://arxiv.org/abs/2404.03204)|null| 528 | |**2024-04-03**|**CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech**|Jaehyeon Kim et.al.|[2404.02781](http://arxiv.org/abs/2404.02781)|null| 529 | |**2024-04-13**|**PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders**|Yu Pan et.al.|[2404.02702](http://arxiv.org/abs/2404.02702)|null| 530 | |**2024-03-31**|**Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation**|Rohan Chaudhury et.al.|[2404.01339](http://arxiv.org/abs/2404.01339)|**[link](https://github.com/rohan-chaudhury/humane-speech-synthesis-through-zero-shot-emotion-and-disfluency-generation)**| 531 | |**2024-03-28**|**A Review of Multi-Modal Large Language and Vision Models**|Kilian Carolan et.al.|[2404.01322](http://arxiv.org/abs/2404.01322)|null| 532 | |**2024-04-09**|**KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis**|Adal Abilbekov et.al.|[2404.01033](http://arxiv.org/abs/2404.01033)|**[link](https://github.com/is2ai/kazemotts)**| 533 | |**2024-03-31**|**CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models**|Xiang Li et.al.|[2404.00569](http://arxiv.org/abs/2404.00569)|**[link](https://github.com/xiangli2022/cm-tts)**| 534 | |**2024-03-25**|**VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild**|Puyuan Peng et.al.|[2403.16973](http://arxiv.org/abs/2403.16973)|**[link](https://github.com/jasonppy/voicecraft)**| 535 | |**2024-03-20**|**Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning**|Shivam Ratnakant Mhaskar et.al.|[2403.15469](http://arxiv.org/abs/2403.15469)|null| 536 | |**2024-03-20**|**UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge**|Wataru Nakata et.al.|[2403.13720](http://arxiv.org/abs/2403.13720)|null| 537 | |**2024-03-20**|**Building speech corpus with diverse voice characteristics for its prompt-based representation**|Aya Watanabe et.al.|[2403.13353](http://arxiv.org/abs/2403.13353)|null| 538 | |**2024-03-17**|**Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations**|Claudio Pinhanez et.al.|[2403.11209](http://arxiv.org/abs/2403.11209)|null| 539 | |**2024-03-17**|**EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech**|Ziqi Liang et.al.|[2403.08164](http://arxiv.org/abs/2403.08164)|null| 540 | |**2024-03-09**|**HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling**|Chunhui Wang et.al.|[2403.05989](http://arxiv.org/abs/2403.05989)|null| 541 | |**2024-03-05**|**AttentionStitch: How Attention Solves the Speech Editing Problem**|Antonios Alexos et.al.|[2403.04804](http://arxiv.org/abs/2403.04804)|null| 542 | |**2024-03-07**|**Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation**|Sai Akarsh et.al.|[2403.04178](http://arxiv.org/abs/2403.04178)|null| 543 | |**2024-03-27**|**NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models**|Zeqian Ju et.al.|[2403.03100](http://arxiv.org/abs/2403.03100)|null| 544 | |**2024-03-04**|**Brilla AI: AI Contestant for the National Science and Maths Quiz**|George Boateng et.al.|[2403.01699](http://arxiv.org/abs/2403.01699)|**[link](https://github.com/nsmq-ai/nsmqai)**| 545 | |**2024-03-02**|**Towards Accurate Lip-to-Speech Synthesis in-the-Wild**|Sindhu Hegde et.al.|[2403.01087](http://arxiv.org/abs/2403.01087)|null| 546 | |**2024-02-29**|**Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data**|Takaaki Saeki et.al.|[2402.18932](http://arxiv.org/abs/2402.18932)|null| 547 | |**2024-02-26**|**An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation**|Ahmet Gunduz et.al.|[2402.16380](http://arxiv.org/abs/2402.16380)|**[link](https://github.com/aixplain/tts-qa)**| 548 | |**2024-02-22**|**Efficient data selection employing Semantic Similarity-based Graph Structures for model training**|Roxana Petcu et.al.|[2402.14888](http://arxiv.org/abs/2402.14888)|null| 549 | |**2024-02-22**|**Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition**|Rendi Chevi et.al.|[2402.14523](http://arxiv.org/abs/2402.14523)|null| 550 | |**2024-02-19**|**On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models**|Miri Varshavsky-Hassid et.al.|[2402.12423](http://arxiv.org/abs/2402.12423)|null| 551 | |**2024-02-19**|**Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting**|Haolin Chen et.al.|[2402.12220](http://arxiv.org/abs/2402.12220)|**[link](https://github.com/idiap/bayesian-peft)**| 552 | |**2024-02-18**|**Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru**|Zining Wang et.al.|[2402.11571](http://arxiv.org/abs/2402.11571)|null| 553 | |**2024-02-14**|**MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech**|Shengpeng Ji et.al.|[2402.09378](http://arxiv.org/abs/2402.09378)|null| 554 | |**2024-02-15**|**BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data**|Mateusz Łajszczak et.al.|[2402.08093](http://arxiv.org/abs/2402.08093)|null| 555 | |**2024-03-04**|**Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like**|Naoyuki Kanda et.al.|[2402.07383](http://arxiv.org/abs/2402.07383)|null| 556 | |**2024-02-09**|**A New Approach to Voice Authenticity**|Nicolas M. Müller et.al.|[2402.06304](http://arxiv.org/abs/2402.06304)|null| 557 | |**2024-02-08**|**Unified Speech-Text Pretraining for Spoken Dialog Modeling**|Heeseung Kim et.al.|[2402.05706](http://arxiv.org/abs/2402.05706)|null| 558 | |**2024-02-05**|**Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations**|Álvaro Martín-Cortinas et.al.|[2402.03407](http://arxiv.org/abs/2402.03407)|null| 559 | |**2024-02-02**|**Natural language guidance of high-fidelity text-to-speech with synthetic annotations**|Dan Lyth et.al.|[2402.01912](http://arxiv.org/abs/2402.01912)|null| 560 | |**2024-01-23**|**Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization**|Wei-Ping Huang et.al.|[2402.01692](http://arxiv.org/abs/2402.01692)|null| 561 | |**2024-02-01**|**Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech**|Dong Yang et.al.|[2402.00288](http://arxiv.org/abs/2402.00288)|null| 562 | |**2024-02-01**|**PAM: Prompting Audio-Language Models for Audio Quality Assessment**|Soham Deshmukh et.al.|[2402.00282](http://arxiv.org/abs/2402.00282)|**[link](https://github.com/soham97/pam)**| 563 | |**2024-01-31**|**Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2**|Jiatong Shi et.al.|[2401.17619](http://arxiv.org/abs/2401.17619)|**[link](https://github.com/espnet/espnet)**| 564 | |**2024-01-28**|**MunTTS: A Text-to-Speech System for Mundari**|Varun Gumma et.al.|[2401.15579](http://arxiv.org/abs/2401.15579)|null| 565 | |**2024-01-30**|**VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech**|Chenpeng Du et.al.|[2401.14321](http://arxiv.org/abs/2401.14321)|null| 566 | |**2024-01-25**|**Text to speech synthesis**|Harini s et.al.|[2401.13891](http://arxiv.org/abs/2401.13891)|null| 567 | |**2024-01-25**|**SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation**|Dong Zhang et.al.|[2401.13527](http://arxiv.org/abs/2401.13527)|**[link](https://github.com/0nutation/speechgpt)**| 568 | |**2024-01-22**|**Benchmarking Large Multimodal Models against Common Corruptions**|Jiawei Zhang et.al.|[2401.11943](http://arxiv.org/abs/2401.11943)|**[link](https://github.com/sail-sg/mmcbench)**| 569 | |**2024-01-22**|**Adversarial speech for voice privacy protection from Personalized Speech generation**|Shihao Chen et.al.|[2401.11857](http://arxiv.org/abs/2401.11857)|null| 570 | |**2024-02-16**|**Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis**|Vinotha R et.al.|[2401.11771](http://arxiv.org/abs/2401.11771)|null| 571 | |**2024-01-19**|**Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech**|Abhinav Garg et.al.|[2401.10465](http://arxiv.org/abs/2401.10465)|null| 572 | |**2024-02-28**|**MLAAD: The Multi-Language Audio Anti-Spoofing Dataset**|Nicolas M. Müller et.al.|[2401.09512](http://arxiv.org/abs/2401.09512)|null| 573 | |**2024-01-15**|**MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory**|Robert G. Kimelman et.al.|[2401.07967](http://arxiv.org/abs/2401.07967)|null| 574 | |**2024-01-14**|**ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering**|Yakun Song et.al.|[2401.07333](http://arxiv.org/abs/2401.07333)|null| 575 | |**2024-01-12**|**Multi-Task Learning for Front-End Text Processing in TTS**|Wonjune Kang et.al.|[2401.06321](http://arxiv.org/abs/2401.06321)|**[link](https://github.com/facebookresearch/llama-hd-dataset)**| 576 | |**2024-01-11**|**End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2**|Aniket Tathe et.al.|[2401.06183](http://arxiv.org/abs/2401.06183)|null| 577 | |**2024-01-11**|**Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection**|Lian Huang et.al.|[2401.05614](http://arxiv.org/abs/2401.05614)|null| 578 | |**2024-01-10**|**Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters**|Kenichi Fujita et.al.|[2401.05111](http://arxiv.org/abs/2401.05111)|null| 579 | |**2024-01-07**|**Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments**|Zhonghao Shi et.al.|[2401.03581](http://arxiv.org/abs/2401.03581)|null| 580 | |**2024-01-07**|**Transfer the linguistic representations from TTS to accent conversion with non-parallel data**|Xi Chen et.al.|[2401.03538](http://arxiv.org/abs/2401.03538)|null| 581 | |**2024-01-03**|**Incremental FastPitch: Chunk-based High Quality Text to Speech**|Muyang Du et.al.|[2401.01755](http://arxiv.org/abs/2401.01755)|null| 582 | |**2024-01-03**|**Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction**|Minchan Kim et.al.|[2401.01498](http://arxiv.org/abs/2401.01498)|null| 583 | |**2023-12-18**|**Assisting Blind People Using Object Detection with Vocal Feedback**|Heba Najm et.al.|[2401.01362](http://arxiv.org/abs/2401.01362)|null| 584 | |**2023-12-30**|**Boosting Large Language Model for Speech Synthesis: An Empirical Study**|Hongkun Hao et.al.|[2401.00246](http://arxiv.org/abs/2401.00246)|null| 585 | |**2024-01-01**|**Normalization of Lithuanian Text Using Regular Expressions**|Pijus Kasparaitis et.al.|[2312.17660](http://arxiv.org/abs/2312.17660)|null| 586 | |**2023-12-27**|**AE-Flow: AutoEncoder Normalizing Flow**|Jakub Mosiński et.al.|[2312.16552](http://arxiv.org/abs/2312.16552)|null| 587 | |**2023-12-22**|**Creating New Voices using Normalizing Flows**|Piotr Bilinski et.al.|[2312.14569](http://arxiv.org/abs/2312.14569)|null| 588 | |**2023-12-22**|**ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations**|Cheng Gong et.al.|[2312.14398](http://arxiv.org/abs/2312.14398)|null| 589 | |**2023-12-19**|**External Knowledge Augmented Polyphone Disambiguation Using Large Language Model**|Chen Li et.al.|[2312.11920](http://arxiv.org/abs/2312.11920)|null| 590 | |**2023-12-17**|**A review-based study on different Text-to-Speech technologies**|Md. Jalal Uddin Chowdhury et.al.|[2312.11563](http://arxiv.org/abs/2312.11563)|null| 591 | |**2024-01-31**|**MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis**|Wenhao Guan et.al.|[2312.10687](http://arxiv.org/abs/2312.10687)|null| 592 | |**2024-02-22**|**Amphion: An Open-Source Audio, Music and Speech Generation Toolkit**|Xueyao Zhang et.al.|[2312.09911](http://arxiv.org/abs/2312.09911)|**[link](https://github.com/open-mmlab/amphion)**| 593 | |**2023-12-11**|**Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism**|Georgios Milis et.al.|[2312.06613](http://arxiv.org/abs/2312.06613)|**[link](https://github.com/g-milis/NEUTART)**| 594 | |**2023-12-08**|**An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis**|Via Nielson et.al.|[2312.05415](http://arxiv.org/abs/2312.05415)|null| 595 | |**2023-12-06**|**Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis**|Zehua Chen et.al.|[2312.03491](http://arxiv.org/abs/2312.03491)|null| 596 | |**2023-12-02**|**Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning**|Raviraj Joshi et.al.|[2312.01107](http://arxiv.org/abs/2312.01107)|null| 597 | |**2023-12-02**|**Code-Mixed Text to Speech Synthesis under Low-Resource Constraints**|Raviraj Joshi et.al.|[2312.01103](http://arxiv.org/abs/2312.01103)|null| 598 | |**2023-11-29**|**Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes**|Pavel Korshunov et.al.|[2311.17655](http://arxiv.org/abs/2311.17655)|null| 599 | |**2024-02-06**|**Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech**|Enting Zhou et.al.|[2311.14816](http://arxiv.org/abs/2311.14816)|**[link](https://github.com/ETZET/SpeechEmotionAVLearning)**| 600 | |**2023-12-07**|**Guided Flows for Generative Modeling and Decision Making**|Qinqing Zheng et.al.|[2311.13443](http://arxiv.org/abs/2311.13443)|null| 601 | |**2023-11-27**|**HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis**|Sang-Hoon Lee et.al.|[2311.12454](http://arxiv.org/abs/2311.12454)|**[link](https://github.com/sh-lee-prml/hierspeechpp)**| 602 | |**2023-11-18**|**Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots**|Farideh Majidi et.al.|[2311.11116](http://arxiv.org/abs/2311.11116)|null| 603 | |**2023-11-18**|**Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys**|Gabriel Cosache et.al.|[2311.11030](http://arxiv.org/abs/2311.11030)|null| 604 | |**2023-11-17**|**A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness**|Mathias Vogel et.al.|[2311.10804](http://arxiv.org/abs/2311.10804)|null| 605 | |**2023-11-16**|**Improving fairness for spoken language understanding in atypical speech with Text-to-Speech**|Helin Wang et.al.|[2311.10149](http://arxiv.org/abs/2311.10149)|**[link](https://github.com/wanghelin1997/aty-tts)**| 606 | |**2024-02-02**|**DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation**|Jianzong Wang et.al.|[2311.07965](http://arxiv.org/abs/2311.07965)|null| 607 | |**2023-11-12**|**ChatAnything: Facetime Chat with LLM-Enhanced Personas**|Yilin Zhao et.al.|[2311.06772](http://arxiv.org/abs/2311.06772)|null| 608 | |**2023-11-11**|**NewsGPT: ChatGPT Integration for Robot-Reporter**|Abdelhadi Hireche et.al.|[2311.06640](http://arxiv.org/abs/2311.06640)|**[link](https://github.com/aeh1707/NewsGPT_Pepper)**| 609 | |**2023-11-08**|**Synthetic Speaking Children -- Why We Need Them and How to Make Them**|Muhammad Ali Farooq et.al.|[2311.06307](http://arxiv.org/abs/2311.06307)|null| 610 | |**2023-09-25**|**Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image**|Minki Kang et.al.|[2311.05844](http://arxiv.org/abs/2311.05844)|null| 611 | |**2023-11-07**|**Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning**|Rishabh Jain et.al.|[2311.04313](http://arxiv.org/abs/2311.04313)|**[link](https://github.com/c3imaging/child_tts_fastpitch)**| 612 | |**2023-11-07**|**Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment**|Jakir Hasan et.al.|[2311.03792](http://arxiv.org/abs/2311.03792)|null| 613 | |**2023-11-08**|**Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction**|Minchan Kim et.al.|[2311.02898](http://arxiv.org/abs/2311.02898)|null| 614 | |**2023-11-02**|**Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations**|Hanglei Zhang et.al.|[2311.01260](http://arxiv.org/abs/2311.01260)|null| 615 | |**2023-11-02**|**E3 TTS: Easy End-to-End Diffusion-based Text to Speech**|Yuan Gao et.al.|[2311.00945](http://arxiv.org/abs/2311.00945)|null| 616 | |**2023-10-31**|**An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation**|Yingjie Zhou et.al.|[2310.20251](http://arxiv.org/abs/2310.20251)|**[link](https://github.com/zyj-2000/cumt_2d_photospeaker)**| 617 | |**2023-10-27**|**Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN**|Neeraj Kumar et.al.|[2310.18169](http://arxiv.org/abs/2310.18169)|null| 618 | |**2023-10-25**|**ArTST: Arabic Text and Speech Transformer**|Hawau Olamide Toyin et.al.|[2310.16621](http://arxiv.org/abs/2310.16621)|**[link](https://github.com/mbzuai-nlp/artst)**| 619 | |**2023-10-25**|**Generative Pre-training for Speech with Flow Matching**|Alexander H. Liu et.al.|[2310.16338](http://arxiv.org/abs/2310.16338)|null| 620 | |**2023-10-23**|**DPP-TTS: Diversifying prosodic features of speech via determinantal point processes**|Seongho Joo et.al.|[2310.14663](http://arxiv.org/abs/2310.14663)|null| 621 | |**2023-10-22**|**An overview of text-to-speech systems and media applications**|Mohammad Reza Hasanabadi et.al.|[2310.14301](http://arxiv.org/abs/2310.14301)|null| 622 | |**2023-10-14**|**Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling**|Tiberiu Boros et.al.|[2310.09636](http://arxiv.org/abs/2310.09636)|**[link](https://github.com/tiberiu44/TTS-Cube)**| 623 | |**2023-10-14**|**Attentive Multi-Layer Perceptron for Non-autoregressive Generation**|Shuyang Jiang et.al.|[2310.09512](http://arxiv.org/abs/2310.09512)|**[link](https://github.com/shark-nlp/attentivemlp)**| 624 | |**2023-12-22**|**Crowdsourced and Automatic Speech Prominence Estimation**|Max Morrison et.al.|[2310.08464](http://arxiv.org/abs/2310.08464)|**[link](https://github.com/reseval/reseval)**| 625 | |**2023-10-12**|**On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition**|Nick Rossenbach et.al.|[2310.08132](http://arxiv.org/abs/2310.08132)|null| 626 | |**2023-10-12**|**Vec-Tok Speech: speech vectorization and tokenization for neural speech generation**|Xinfa Zhu et.al.|[2310.07246](http://arxiv.org/abs/2310.07246)|**[link](https://github.com/bakerbunker/vectok)**| 627 | |**2023-10-10**|**Prosody Analysis of Audiobooks**|Charuta Pethe et.al.|[2310.06930](http://arxiv.org/abs/2310.06930)|null| 628 | |**2023-10-09**|**JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions**|Detai Xin et.al.|[2310.06072](http://arxiv.org/abs/2310.06072)|null| 629 | |**2024-01-09**|**Unified speech and gesture synthesis using flow matching**|Shivam Mehta et.al.|[2310.05181](http://arxiv.org/abs/2310.05181)|null| 630 | |**2023-10-08**|**Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset**|Ze Liu et.al.|[2310.04982](http://arxiv.org/abs/2310.04982)|null| 631 | |**2023-10-11**|**LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT**|Jiaming Wang et.al.|[2310.04673](http://arxiv.org/abs/2310.04673)|null| 632 | |**2024-01-22**|**Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis**|Jae-Sung Bae et.al.|[2310.03538](http://arxiv.org/abs/2310.03538)|null| 633 | |**2023-10-07**|**The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains**|Erica Cooper et.al.|[2310.02640](http://arxiv.org/abs/2310.02640)|null| 634 | |**2023-10-02**|**Towards human-like spoken dialogue generation between AI agents from written dialogue**|Kentaro Mitsui et.al.|[2310.01088](http://arxiv.org/abs/2310.01088)|null| 635 | |**2023-10-01**|**Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech**|Dareen Alharthi et.al.|[2310.00706](http://arxiv.org/abs/2310.00706)|null| 636 | |**2024-03-11**|**Fewer-token Neural Speech Codec with Time-invariant Codes**|Yong Ren et.al.|[2310.00014](http://arxiv.org/abs/2310.00014)|**[link](https://github.com/y-ren16/ticodec)**| 637 | |**2024-01-31**|**ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech**|Wenhao Guan et.al.|[2309.17056](http://arxiv.org/abs/2309.17056)|null| 638 | |**2023-09-29**|**Low-Resource Self-Supervised Learning with SSL-Enhanced TTS**|Po-chun Hsu et.al.|[2309.17020](http://arxiv.org/abs/2309.17020)|null| 639 | |**2023-09-29**|**Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features**|Yuxiang Zhang et.al.|[2309.16954](http://arxiv.org/abs/2309.16954)|null| 640 | |**2023-12-18**|**High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models**|Chunyu Qiang et.al.|[2309.15512](http://arxiv.org/abs/2309.15512)|null| 641 | |**2024-01-09**|**BiSinger: Bilingual Singing Voice Synthesis**|Huali Zhou et.al.|[2309.14089](http://arxiv.org/abs/2309.14089)|**[link](https://github.com/BiSinger-SVS/BiSinger)**| 642 | |**2023-10-07**|**HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS**|Dake Guo et.al.|[2309.13907](http://arxiv.org/abs/2309.13907)|null| 643 | |**2023-09-24**|**VoiceLDM: Text-to-Speech with Environmental Context**|Yeonghyeon Lee et.al.|[2309.13664](http://arxiv.org/abs/2309.13664)|null| 644 | |**2023-09-24**|**Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control**|Aya Watanabe et.al.|[2309.13509](http://arxiv.org/abs/2309.13509)|null| 645 | |**2023-09-22**|**DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2309.12792](http://arxiv.org/abs/2309.12792)|null| 646 | |**2023-09-22**|**Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts**|Shun Lei et.al.|[2309.11977](http://arxiv.org/abs/2309.11977)|null| 647 | |**2023-09-21**|**The Impact of Silence on Speech Anti-Spoofing**|Yuxiang Zhang et.al.|[2309.11827](http://arxiv.org/abs/2309.11827)|null| 648 | |**2023-09-21**|**Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech**|Rui Liu et.al.|[2309.11724](http://arxiv.org/abs/2309.11724)|**[link](https://github.com/ai-s2-lab/emopp)**| 649 | |**2023-09-20**|**Speak While You Think: Streaming Speech Synthesis During Text Generation**|Avihu Dekel et.al.|[2309.11210](http://arxiv.org/abs/2309.11210)|null| 650 | |**2023-09-20**|**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**|Xinyu Zhou et.al.|[2309.11000](http://arxiv.org/abs/2309.11000)|**[link](https://github.com/XinyuZhou2000/Spoken_Dialogue)**| 651 | |**2023-09-19**|**Exploring Speech Enhancement for Low-resource Speech Synthesis**|Zhaoheng Ni et.al.|[2309.10795](http://arxiv.org/abs/2309.10795)|null| 652 | |**2023-09-19**|**Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition**|Ziyang Ma et.al.|[2309.10294](http://arxiv.org/abs/2309.10294)|null| 653 | |**2023-09-17**|**Augmenting text for spoken language understanding with Large Language Models**|Roshan Sharma et.al.|[2309.09390](http://arxiv.org/abs/2309.09390)|null| 654 | |**2023-09-16**|**FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework**|Jianzong Wang et.al.|[2309.08837](http://arxiv.org/abs/2309.08837)|null| 655 | |**2023-09-15**|**Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech**|Dariusz Piotrowski et.al.|[2309.08255](http://arxiv.org/abs/2309.08255)|null| 656 | |**2023-09-15**|**HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods**|Hyun-seo Shin et.al.|[2309.08208](http://arxiv.org/abs/2309.08208)|**[link](https://github.com/talkingnow/HM-Conformer)**| 657 | |**2023-12-27**|**PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions**|Reo Shimizu et.al.|[2309.08140](http://arxiv.org/abs/2309.08140)|null| 658 | |**2023-09-15**|**Diversity-based core-set selection for text-to-speech with linguistic and acoustic features**|Kentaro Seki et.al.|[2309.08127](http://arxiv.org/abs/2309.08127)|null| 659 | |**2023-09-14**|**Direct Text to Speech Translation System using Acoustic Units**|Victoria Mingote et.al.|[2309.07478](http://arxiv.org/abs/2309.07478)|null| 660 | |**2023-10-07**|**FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec**|Zhihao Du et.al.|[2309.07405](http://arxiv.org/abs/2309.07405)|**[link](https://github.com/alibaba-damo-academy/funcodec)**| 661 | |**2023-09-13**|**DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation**|Zhichao Wu et.al.|[2309.06787](http://arxiv.org/abs/2309.06787)|null| 662 | |**2023-09-11**|**Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP**|Jinzuomu Zhong et.al.|[2309.05423](http://arxiv.org/abs/2309.05423)|**[link](https://github.com/jzmzhong/Automatic-Prosody-Annotator-with-SSWP-CLAP)**| 663 | |**2024-01-16**|**VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching**|Yiwei Guo et.al.|[2309.05027](http://arxiv.org/abs/2309.05027)|**[link](https://github.com/X-LANCE/VoiceFlow-TTS)**| 664 | |**2023-09-08**|**Cross-Utterance Conditioned VAE for Speech Generation**|Yang Li et.al.|[2309.04156](http://arxiv.org/abs/2309.04156)|null| 665 | |**2023-09-07**|**Large-Scale Automatic Audiobook Creation**|Brendan Walsh et.al.|[2309.03926](http://arxiv.org/abs/2309.03926)|null| 666 | |**2023-09-11**|**GRASS: Unified Generation Model for Speech-to-Semantic Tasks**|Aobo Xia et.al.|[2309.02780](http://arxiv.org/abs/2309.02780)|null| 667 | |**2023-09-12**|**MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023**|Zhihang Xu et.al.|[2309.02743](http://arxiv.org/abs/2309.02743)|null| 668 | |**2023-10-12**|**PromptTTS 2: Describing and Generating Voices with Text Prompt**|Yichong Leng et.al.|[2309.02285](http://arxiv.org/abs/2309.02285)|null| 669 | |**2023-09-04**|**A Comparative Analysis of Pretrained Language Models for Text-to-Speech**|Marcel Granero-Moya et.al.|[2309.01576](http://arxiv.org/abs/2309.01576)|null| 670 | |**2023-09-02**|**DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin**|Tao Li et.al.|[2309.00883](http://arxiv.org/abs/2309.00883)|null| 671 | |**2023-12-18**|**Learning Speech Representation From Contrastive Token-Acoustic Pretraining**|Chunyu Qiang et.al.|[2309.00424](http://arxiv.org/abs/2309.00424)|null| 672 | |**2023-09-01**|**The FruitShell French synthesis system at the Blizzard 2023 Challenge**|Xin Qi et.al.|[2309.00223](http://arxiv.org/abs/2309.00223)|null| 673 | |**2023-08-31**|**QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning**|Haohan Guo et.al.|[2309.00126](http://arxiv.org/abs/2309.00126)|null| 674 | |**2024-01-23**|**SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**|Xin Zhang et.al.|[2308.16692](http://arxiv.org/abs/2308.16692)|**[link](https://github.com/0nutation/uslm)**| 675 | |**2023-08-31**|**Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis**|Weiqin Li et.al.|[2308.16593](http://arxiv.org/abs/2308.16593)|null| 676 | |**2023-08-31**|**Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information**|Jie Chen et.al.|[2308.16577](http://arxiv.org/abs/2308.16577)|null| 677 | |**2023-08-31**|**LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech**|Jie Chen et.al.|[2308.16569](http://arxiv.org/abs/2308.16569)|null| 678 | |**2023-08-30**|**CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis**|Yi Meng et.al.|[2308.16021](http://arxiv.org/abs/2308.16021)|null| 679 | |**2023-09-01**|**The DeepZen Speech Synthesis System for Blizzard Challenge 2023**|Christophe Veaux et.al.|[2308.15945](http://arxiv.org/abs/2308.15945)|null| 680 | |**2023-08-28**|**Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech**|Hyungchan Yoon et.al.|[2308.14909](http://arxiv.org/abs/2308.14909)|null| 681 | |**2023-09-04**|**Rep2wav: Noise Robust text-to-speech Using self-supervised representations**|Qiushi Zhu et.al.|[2308.14553](http://arxiv.org/abs/2308.14553)|null| 682 | |**2023-08-28**|**TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models**|Shengpeng Ji et.al.|[2308.14430](http://arxiv.org/abs/2308.14430)|**[link](https://github.com/jishengpeng/TextrolSpeech)**| 683 | |**2023-09-02**|**Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder**|Xuyuan Li et.al.|[2308.13365](http://arxiv.org/abs/2308.13365)|null| 684 | |**2023-08-24**|**Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations**|Wenbin Wang et.al.|[2308.13007](http://arxiv.org/abs/2308.13007)|null| 685 | |**2023-09-22**|**Sparks of Large Audio Models: A Survey and Outlook**|Siddique Latif et.al.|[2308.12792](http://arxiv.org/abs/2308.12792)|null| 686 | |**2023-10-25**|**SeamlessM4T: Massively Multilingual & Multimodal Machine Translation**|Seamless Communication et.al.|[2308.11596](http://arxiv.org/abs/2308.11596)|**[link](https://github.com/facebookresearch/seamless_communication)**| 687 | |**2023-08-31**|**Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models**|Heyang Xue et.al.|[2308.10428](http://arxiv.org/abs/2308.10428)|null| 688 | |**2023-08-16**|**AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis**|Hrishikesh Viswanath et.al.|[2308.08577](http://arxiv.org/abs/2308.08577)|null| 689 | |**2023-08-14**|**SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**|Xiaofei Wang et.al.|[2308.06873](http://arxiv.org/abs/2308.06873)|null| 690 | |**2023-08-12**|**Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation**|Zhichao Wang et.al.|[2308.06457](http://arxiv.org/abs/2308.06457)|**[link](https://github.com/zhichaowang970201/text-to-video)**| 691 | |**2023-09-09**|**AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining**|Haohe Liu et.al.|[2308.05734](http://arxiv.org/abs/2308.05734)|**[link](https://github.com/haoheliu/AudioLDM2)**| 692 | |**2023-08-09**|**Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay**|Leixian Shen et.al.|[2308.04703](http://arxiv.org/abs/2308.04703)|null| 693 | |**2023-08-08**|**Towards an AI to Win Ghana's National Science and Maths Quiz**|George Boateng et.al.|[2308.04333](http://arxiv.org/abs/2308.04333)|**[link](https://github.com/nsmq-ai/nsmqai)**| 694 | |**2023-08-08**|**WonderFlow: Narration-Centric Design of Animated Data Videos**|Yun Wang et.al.|[2308.04040](http://arxiv.org/abs/2308.04040)|null| 695 | |**2023-08-04**|**Let's Give a Voice to Conversational Agents in Virtual Reality**|Michele Yin et.al.|[2308.02665](http://arxiv.org/abs/2308.02665)|**[link](https://github.com/sislab-unitn/Let-s-Give-a-Voice-to-Conversational-Agents-in-VR)**| 696 | |**2023-08-03**|**Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation**|Minsu Kim et.al.|[2308.01831](http://arxiv.org/abs/2308.01831)|**[link](https://github.com/choijeongsoo/utut)**| 697 | |**2023-08-02**|**SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis**|Ramanan Sivaguru et.al.|[2308.01018](http://arxiv.org/abs/2308.01018)|null| 698 | |**2023-07-07**|**Artificial Eye for the Blind**|Abhinav Benagi et.al.|[2308.00801](http://arxiv.org/abs/2308.00801)|null| 699 | |**2023-07-31**|**Multilingual context-based pronunciation learning for Text-to-Speech**|Giulia Comini et.al.|[2307.16709](http://arxiv.org/abs/2307.16709)|null| 700 | |**2023-07-31**|**Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech**|Guangyan Zhang et.al.|[2307.16679](http://arxiv.org/abs/2307.16679)|null| 701 | |**2023-07-31**|**Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings**|Manuel Sam Ribeiro et.al.|[2307.16643](http://arxiv.org/abs/2307.16643)|null| 702 | |**2023-07-31**|**DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training**|Hyung-Seok Oh et.al.|[2307.16549](http://arxiv.org/abs/2307.16549)|**[link](https://github.com/hsoh0306/diffprosody)**| 703 | |**2023-07-31**|**VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design**|Jungil Kong et.al.|[2307.16430](http://arxiv.org/abs/2307.16430)|null| 704 | |**2023-07-30**|**Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation**|Yuanhao Chen et.al.|[2307.16199](http://arxiv.org/abs/2307.16199)|**[link](https://github.com/edward-martyr/shanghainese-tts)**| 705 | |**2023-07-29**|**METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer**|Xinfa Zhu et.al.|[2307.15951](http://arxiv.org/abs/2307.15951)|null| 706 | |**2023-12-18**|**Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding**|Chunyu Qiang et.al.|[2307.15484](http://arxiv.org/abs/2307.15484)|null| 707 | |**2023-07-20**|**SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer**|Daegyeom Kim et.al.|[2307.10550](http://arxiv.org/abs/2307.10550)|**[link](https://github.com/0913ktg/sc_vall-e)**| 708 | |**2023-07-18**|**SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs**|Yinghao Aaron Li et.al.|[2307.09435](http://arxiv.org/abs/2307.09435)|null| 709 | |**2023-09-28**|**Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts**|Ziyue Jiang et.al.|[2307.07218](http://arxiv.org/abs/2307.07218)|null| 710 | |**2023-07-13**|**Controllable Emphasis with zero data for text-to-speech**|Arnaud Joly et.al.|[2307.07062](http://arxiv.org/abs/2307.07062)|null| 711 | |**2023-07-11**|**On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis**|Siyang Wang et.al.|[2307.05132](http://arxiv.org/abs/2307.05132)|null| 712 | |**2023-07-10**|**The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task**|Kun Song et.al.|[2307.04630](http://arxiv.org/abs/2307.04630)|null| 713 | |**2023-10-07**|**ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading**|Yujia Xiao et.al.|[2307.00782](http://arxiv.org/abs/2307.00782)|null| 714 | |**2023-06-28**|**EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech**|Daria Diatlova et.al.|[2307.00024](http://arxiv.org/abs/2307.00024)|**[link](https://github.com/deepvk/emospeech)**| 715 | |**2023-06-29**|**High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units**|Junchen Lu et.al.|[2306.17005](http://arxiv.org/abs/2306.17005)|null| 716 | |**2023-06-28**|**UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data**|Heeseung Kim et.al.|[2306.16083](http://arxiv.org/abs/2306.16083)|**[link](https://github.com/gmltmd789/UnitSpeech)**| 717 | |**2023-10-19**|**Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale**|Matthew Le et.al.|[2306.15687](http://arxiv.org/abs/2306.15687)|null| 718 | |**2023-06-27**|**GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech**|Yahuan Cong et.al.|[2306.15304](http://arxiv.org/abs/2306.15304)|null| 719 | |**2023-06-25**|**DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech**|Sen Liu et.al.|[2306.14145](http://arxiv.org/abs/2306.14145)|null| 720 | |**2023-06-21**|**Visual-Aware Text-to-Speech**|Mohan Zhou et.al.|[2306.12020](http://arxiv.org/abs/2306.12020)|null| 721 | |**2023-06-21**|**Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer**|Jakub Swiatkowski et.al.|[2306.11662](http://arxiv.org/abs/2306.11662)|null| 722 | |**2023-06-16**|**Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation**|Kishor Kayyar Lakshminarayana et.al.|[2306.10152](http://arxiv.org/abs/2306.10152)|null| 723 | |**2023-06-16**|**CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages**|Frederico S. Oliveira et.al.|[2306.10097](http://arxiv.org/abs/2306.10097)|null| 724 | |**2023-06-14**|**Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation**|Zheng Liang et.al.|[2306.08588](http://arxiv.org/abs/2306.08588)|null| 725 | |**2023-06-14**|**Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects**|Xinghua Qu et.al.|[2306.08219](http://arxiv.org/abs/2306.08219)|**[link](https://github.com/hyllll/vcrs)**| 726 | |**2023-11-20**|**StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models**|Yinghao Aaron Li et.al.|[2306.07691](http://arxiv.org/abs/2306.07691)|null| 727 | |**2024-01-18**|**UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding**|Chenpeng Du et.al.|[2306.07547](http://arxiv.org/abs/2306.07547)|null| 728 | |**2023-06-13**|**PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling**|Ji-Sang Hwang et.al.|[2306.07489](http://arxiv.org/abs/2306.07489)|null| 729 | |**2023-06-09**|**Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech**|Shijun Wang et.al.|[2306.05709](http://arxiv.org/abs/2306.05709)|null| 730 | |**2023-06-08**|**VIFS: An End-to-End Variational Inference for Foley Sound Synthesis**|Junhyeok Lee et.al.|[2306.05004](http://arxiv.org/abs/2306.05004)|**[link](https://github.com/junjun3518/vifs)**| 731 | |**2023-07-11**|**Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge**|Wenhao Guan et.al.|[2306.04301](http://arxiv.org/abs/2306.04301)|null| 732 | |**2023-06-06**|**Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias**|Ziyue Jiang et.al.|[2306.03509](http://arxiv.org/abs/2306.03509)|null| 733 | |**2023-08-02**|**Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis**|Zhenhui Ye et.al.|[2306.03504](http://arxiv.org/abs/2306.03504)|null| 734 | |**2023-06-05**|**Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis**|Dengfeng Ke et.al.|[2306.02593](http://arxiv.org/abs/2306.02593)|null| 735 | |**2023-06-05**|**Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model**|Hoyeon Lee et.al.|[2306.02579](http://arxiv.org/abs/2306.02579)|null| 736 | |**2023-06-05**|**Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming**|Xinlei Niu et.al.|[2306.02568](http://arxiv.org/abs/2306.02568)|**[link](https://github.com/Berthaniu/LatentOptimalPathsBayesianDP)**| 737 | |**2023-06-02**|**Towards Robust FastSpeech 2 by Modelling Residual Multimodality**|Fabian Kögel et.al.|[2306.01442](http://arxiv.org/abs/2306.01442)|**[link](https://github.com/sony/ai-research-code)**| 738 | |**2023-05-30**|**Towards Selection of Text-to-speech Data to Augment ASR Training**|Shuo Liu et.al.|[2306.00998](http://arxiv.org/abs/2306.00998)|null| 739 | |**2023-06-01**|**EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis**|Haobin Tang et.al.|[2306.00648](http://arxiv.org/abs/2306.00648)|null| 740 | |**2023-06-01**|**The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech**|Phat Do et.al.|[2306.00535](http://arxiv.org/abs/2306.00535)|null| 741 | |**2023-05-31**|**Text-to-Speech Pipeline for Swiss German -- A comparison**|Tobias Bollinger et.al.|[2305.19750](http://arxiv.org/abs/2305.19750)|null| 742 | |**2023-05-31**|**XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech**|Linh The Nguyen et.al.|[2305.19709](http://arxiv.org/abs/2305.19709)|**[link](https://github.com/vinairesearch/xphonebert)**| 743 | |**2023-06-01**|**PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions**|Guanghou Liu et.al.|[2305.19522](http://arxiv.org/abs/2305.19522)|null| 744 | |**2023-05-30**|**Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages**|Phat Do et.al.|[2305.19396](http://arxiv.org/abs/2305.19396)|null| 745 | |**2023-05-30**|**Make-A-Voice: Unified Voice Synthesis With Discrete Representation**|Rongjie Huang et.al.|[2305.19269](http://arxiv.org/abs/2305.19269)|null| 746 | |**2023-05-30**|**STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions**|Michel Plüss et.al.|[2305.18855](http://arxiv.org/abs/2305.18855)|null| 747 | |**2023-05-30**|**LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus**|Yuma Koizumi et.al.|[2305.18802](http://arxiv.org/abs/2305.18802)|null| 748 | |**2023-10-09**|**An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization**|Fei Kong et.al.|[2305.18355](http://arxiv.org/abs/2305.18355)|**[link](https://github.com/kong13661/pia)**| 749 | |**2023-05-29**|**ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation**|Ambuj Mehrish et.al.|[2305.18028](http://arxiv.org/abs/2305.18028)|**[link](https://github.com/declare-lab/adapter-mix)**| 750 | |**2023-05-29**|**Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis**|Erik Ekstedt et.al.|[2305.17971](http://arxiv.org/abs/2305.17971)|null| 751 | |**2023-07-25**|**StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation**|Kun Song et.al.|[2305.17732](http://arxiv.org/abs/2305.17732)|null| 752 | |**2023-05-28**|**Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS**|Sewade Ogun et.al.|[2305.17724](http://arxiv.org/abs/2305.17724)|**[link](https://github.com/ogunlao/glowtts_stdp)**| 753 | |**2023-07-19**|**Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing**|Julia Kaiwen Lau et.al.|[2305.17445](http://arxiv.org/abs/2305.17445)|**[link](https://github.com/julianyonghao/fainasrtest)**| 754 | |**2023-05-26**|**DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction**|Vineet Bhat et.al.|[2305.16957](http://arxiv.org/abs/2305.16957)|null| 755 | |**2023-05-25**|**Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion**|Rui Liu et.al.|[2305.16353](http://arxiv.org/abs/2305.16353)|**[link](https://github.com/ai-s2-lab/m2s-add)**| 756 | |**2023-05-22**|**Text Generation with Speech Synthesis for ASR Data Augmentation**|Zhuangqun Huang et.al.|[2305.16333](http://arxiv.org/abs/2305.16333)|null| 757 | |**2023-05-25**|**VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation**|Tianrui Wang et.al.|[2305.16107](http://arxiv.org/abs/2305.16107)|null| 758 | |**2023-05-25**|**Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration**|Rustem Yeshpanov et.al.|[2305.15749](http://arxiv.org/abs/2305.15749)|**[link](https://github.com/is2ai/turkictts)**| 759 | |**2024-02-05**|**LAraBench: Benchmarking Arabic AI with Large Language Models**|Ahmed Abdelali et.al.|[2305.14982](http://arxiv.org/abs/2305.14982)|null| 760 | |**2023-05-23**|**EfficientSpeech: An On-Device Text to Speech Model**|Rowel Atienza et.al.|[2305.13905](http://arxiv.org/abs/2305.13905)|**[link](https://github.com/roatienza/efficientspeech)**| 761 | |**2023-05-23**|**ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models**|Minki Kang et.al.|[2305.13831](http://arxiv.org/abs/2305.13831)|null| 762 | |**2023-05-22**|**U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech**|Xin Jing et.al.|[2305.13195](http://arxiv.org/abs/2305.13195)|null| 763 | |**2023-05-25**|**EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels**|Kari Ali Noriy et.al.|[2305.13137](http://arxiv.org/abs/2305.13137)|**[link](https://github.com/knoriy/emns-dct)**| 764 | |**2023-05-22**|**ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer**|Huadai Liu et.al.|[2305.12708](http://arxiv.org/abs/2305.12708)|null| 765 | |**2023-05-21**|**VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages**|Shivam Mhaskar et.al.|[2305.12518](http://arxiv.org/abs/2305.12518)|null| 766 | |**2023-05-26**|**Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus**|Detai Xin et.al.|[2305.12442](http://arxiv.org/abs/2305.12442)|**[link](https://github.com/aria-k-alethia/laughter-synthesis)**| 767 | |**2023-05-20**|**ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios**|Yuyue Wang et.al.|[2305.12200](http://arxiv.org/abs/2305.12200)|null| 768 | |**2023-05-19**|**MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting**|Neil Shah et.al.|[2305.11926](http://arxiv.org/abs/2305.11926)|null| 769 | |**2024-02-20**|**Data Redaction from Conditional Generative Models**|Zhifeng Kong et.al.|[2305.11351](http://arxiv.org/abs/2305.11351)|null| 770 | |**2023-05-18**|**Parameter-Efficient Learning for Text-to-Speech Accent Adaptation**|Li-Jen Yang et.al.|[2305.11320](http://arxiv.org/abs/2305.11320)|**[link](https://github.com/TTS-Research/PEL-TTS)**| 771 | |**2023-05-19**|**Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation**|Martijn Bartelds et.al.|[2305.10951](http://arxiv.org/abs/2305.10951)|**[link](https://github.com/bartelds/asr-augmentation)**| 772 | |**2023-09-30**|**Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data**|Yusheng Tian et.al.|[2305.10891](http://arxiv.org/abs/2305.10891)|**[link](https://github.com/dmse4tts/dmse4tts)**| 773 | |**2023-05-18**|**FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs**|Won Jang et.al.|[2305.10823](http://arxiv.org/abs/2305.10823)|null| 774 | |**2023-05-18**|**CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training**|Zhenhui Ye et.al.|[2305.10763](http://arxiv.org/abs/2305.10763)|null| 775 | |**2023-08-29**|**a unified front-end framework for english text-to-speech synthesis**|Zelin Ying et.al.|[2305.10666](http://arxiv.org/abs/2305.10666)|null| 776 | |**2023-09-19**|**Controllable Speaking Styles Using a Large Language Model**|Atli Thor Sigurgeirsson et.al.|[2305.10321](http://arxiv.org/abs/2305.10321)|null| 777 | |**2023-05-23**|**Better speech synthesis through scaling**|James Betker et.al.|[2305.07243](http://arxiv.org/abs/2305.07243)|**[link](https://github.com/neonbjb/tortoise-tts)**| 778 | |**2023-10-29**|**CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model**|Zhen Ye et.al.|[2305.06908](http://arxiv.org/abs/2305.06908)|**[link](https://github.com/zhenye234/CoMoSpeech)**| 779 | |**2023-05-08**|**Accented Text-to-Speech Synthesis with Limited Data**|Xuehao Zhou et.al.|[2305.04816](http://arxiv.org/abs/2305.04816)|null| 780 | |**2023-05-03**|**M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis**|Jinlong Xue et.al.|[2305.02269](http://arxiv.org/abs/2305.02269)|null| 781 | |**2023-05-30**|**A Review of Deep Learning Techniques for Speech Processing**|Ambuj Mehrish et.al.|[2305.00359](http://arxiv.org/abs/2305.00359)|null| 782 | |**2023-04-26**|**Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis**|Ye-Xin Lu et.al.|[2304.13270](http://arxiv.org/abs/2304.13270)|null| 783 | |**2023-04-25**|**Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge**|Chenpeng Du et.al.|[2304.13121](http://arxiv.org/abs/2304.13121)|null| 784 | |**2023-04-24**|**Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model**|Kenichi Fujita et.al.|[2304.11976](http://arxiv.org/abs/2304.11976)|null| 785 | |**2023-04-23**|**DiffVoice: Text-to-Speech with Latent Diffusion**|Zhijun Liu et.al.|[2304.11750](http://arxiv.org/abs/2304.11750)|null| 786 | |**2023-04-23**|**SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model**|Jianzong Wang et.al.|[2304.11547](http://arxiv.org/abs/2304.11547)|null| 787 | |**2023-05-30**|**NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**|Kai Shen et.al.|[2304.09116](http://arxiv.org/abs/2304.09116)|null| 788 | |**2023-04-16**|**A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers**|Juan Zuluaga-Gomez et.al.|[2304.07842](http://arxiv.org/abs/2304.07842)|null| 789 | |**2023-04-13**|**Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis**|Shun Lei et.al.|[2304.06359](http://arxiv.org/abs/2304.06359)|null| 790 | |**2023-04-10**|**Enhancing Speech-to-Speech Translation with Multiple TTS Targets**|Jiatong Shi et.al.|[2304.04618](http://arxiv.org/abs/2304.04618)|null| 791 | |**2023-04-07**|**ArmanTTS single-speaker Persian dataset**|Mohammd Hasan Shamgholi et.al.|[2304.03585](http://arxiv.org/abs/2304.03585)|null| 792 | |**2023-04-03**|**Ensemble prosody prediction for expressive speech synthesis**|Tian Huey Teh et.al.|[2304.00714](http://arxiv.org/abs/2304.00714)|null| 793 | |**2023-03-29**|**AraSpot: Arabic Spoken Command Spotting**|Mahmoud Salhab et.al.|[2303.16621](http://arxiv.org/abs/2303.16621)|**[link](https://github.com/msalhab96/araspot)**| 794 | |**2023-03-28**|**Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages**|Seongyeon Park et.al.|[2303.15669](http://arxiv.org/abs/2303.15669)|**[link](https://github.com/cnaigithub/SpeechDewarping)**| 795 | |**2023-03-27**|**Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis**|Karren Yang et.al.|[2303.14885](http://arxiv.org/abs/2303.14885)|null| 796 | |**2023-03-24**|**Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis**|Takuhiro Kaneko et.al.|[2303.13909](http://arxiv.org/abs/2303.13909)|null| 797 | |**2023-04-02**|**A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI**|Chenshuang Zhang et.al.|[2303.13336](http://arxiv.org/abs/2303.13336)|null| 798 | |**2023-03-20**|**Code-Switching Text Generation and Injection in Mandarin-English ASR**|Haibin Yu et.al.|[2303.10949](http://arxiv.org/abs/2303.10949)|null| 799 | |**2023-03-14**|**Controlling High-Dimensional Data With Sparse Input**|Dan Andrei Iliescu et.al.|[2303.09446](http://arxiv.org/abs/2303.09446)|null| 800 | |**2023-03-09**|**Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports**|Hyunseung Chung et.al.|[2303.09395](http://arxiv.org/abs/2303.09395)|**[link](https://github.com/tclife/text_to_ecg)**| 801 | |**2023-03-15**|**Cross-speaker Emotion Transfer by Manipulating Speech Style Latents**|Suhee Jo et.al.|[2303.08329](http://arxiv.org/abs/2303.08329)|null| 802 | |**2023-03-14**|**QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis**|Haobin Tang et.al.|[2303.07682](http://arxiv.org/abs/2303.07682)|null| 803 | |**2023-03-10**|**An End-to-End Neural Network for Image-to-Audio Transformation**|Liu Chen et.al.|[2303.06078](http://arxiv.org/abs/2303.06078)|null| 804 | |**2023-03-09**|**Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation**|Qi Chen et.al.|[2303.05322](http://arxiv.org/abs/2303.05322)|**[link](https://github.com/moon0316/t2a)**| 805 | |**2023-03-07**|**Do Prosody Transfer Models Transfer Prosody?**|Atli Thor Sigurgeirsson et.al.|[2303.04289](http://arxiv.org/abs/2303.04289)|null| 806 | |**2023-03-07**|**Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling**|Ziqiang Zhang et.al.|[2303.03926](http://arxiv.org/abs/2303.03926)|null| 807 | |**2023-03-02**|**Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding**|Yingting Li et.al.|[2303.03267](http://arxiv.org/abs/2303.03267)|**[link](https://github.com/declare-lab/speech-adapters)**| 808 | |**2023-03-08**|**FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**|Ruiqing Xue et.al.|[2303.02939](http://arxiv.org/abs/2303.02939)|null| 809 | |**2023-08-14**|**Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations**|Yuma Koizumi et.al.|[2303.01664](http://arxiv.org/abs/2303.01664)|null| 810 | |**2023-03-11**|**Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities**|Shijun Wang et.al.|[2303.01508](http://arxiv.org/abs/2303.01508)|null| 811 | |**2023-12-17**|**ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations**|Neil Shah et.al.|[2303.01261](http://arxiv.org/abs/2303.01261)|null| 812 | |**2023-03-02**|**LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion**|Chunfeng Wang et.al.|[2303.01086](http://arxiv.org/abs/2303.01086)|null| 813 | |**2023-03-02**|**Leveraging Large Text Corpora for End-to-End Speech Summarization**|Kohei Matsuura et.al.|[2303.00978](http://arxiv.org/abs/2303.00978)|null| 814 | |**2023-03-01**|**DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction**|Raviteja Anantha et.al.|[2303.00171](http://arxiv.org/abs/2303.00171)|null| 815 | |**2023-02-28**|**ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus**|Ajinkya Kulkarni et.al.|[2303.00069](http://arxiv.org/abs/2303.00069)|null| 816 | |**2023-02-28**|**Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners**|Jocelyn Huang et.al.|[2302.14523](http://arxiv.org/abs/2302.14523)|null| 817 | |**2023-06-12**|**CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis**|Ji-Hoon Kim et.al.|[2302.14370](http://arxiv.org/abs/2302.14370)|null| 818 | |**2023-05-19**|**UniFLG: Unified Facial Landmark Generator from Text or Speech**|Kentaro Mitsui et.al.|[2302.14337](http://arxiv.org/abs/2302.14337)|null| 819 | |**2023-02-27**|**Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech**|Jiyoung Lee et.al.|[2302.13700](http://arxiv.org/abs/2302.13700)|**[link](https://github.com/naver-ai/facetts)**| 820 | |**2023-02-27**|**Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech**|Dong Yang et.al.|[2302.13652](http://arxiv.org/abs/2302.13652)|null| 821 | |**2023-02-27**|**Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow**|Yoonhyung Lee et.al.|[2302.13458](http://arxiv.org/abs/2302.13458)|null| 822 | |**2023-06-06**|**PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS**|Junhyeok Lee et.al.|[2302.12391](http://arxiv.org/abs/2302.12391)|**[link](https://github.com/anonymous-pits/pits)**| 823 | |**2023-02-21**|**Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition**|Leyuan Qu et.al.|[2302.09723](http://arxiv.org/abs/2302.09723)|null| 824 | |**2023-02-23**|**QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion**|Houjian Guo et.al.|[2302.08296](http://arxiv.org/abs/2302.08296)|**[link](https://github.com/quickvc/quickvoice-conversion)**| 825 | |**2023-02-13**|**Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages**|Sudhanshu Srivastava et.al.|[2302.06227](http://arxiv.org/abs/2302.06227)|null| 826 | |**2023-02-08**|**A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech**|Li-Wei Chen et.al.|[2302.04215](http://arxiv.org/abs/2302.04215)|**[link](https://github.com/b04901014/mqtts)**| 827 | |**2023-02-07**|**Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**|Eugene Kharitonov et.al.|[2302.03540](http://arxiv.org/abs/2302.03540)|null| 828 | |**2023-02-15**|**MAC: A unified framework boosting low resource automatic speech recognition**|Zeping Min et.al.|[2302.03498](http://arxiv.org/abs/2302.03498)|null| 829 | |**2023-06-25**|**InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt**|Dongchao Yang et.al.|[2301.13662](http://arxiv.org/abs/2301.13662)|**[link](https://github.com/yangdongchao/academicodec)**| 830 | |**2023-03-01**|**UzbekTagger: The rule-based POS tagger for Uzbek language**|Maksud Sharipov et.al.|[2301.12711](http://arxiv.org/abs/2301.12711)|null| 831 | |**2023-05-27**|**Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining**|Takaaki Saeki et.al.|[2301.12596](http://arxiv.org/abs/2301.12596)|**[link](https://github.com/takaaki-saeki/zm-text-tts)**| 832 | |**2023-01-31**|**Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker**|Navjot Kaur et.al.|[2301.12331](http://arxiv.org/abs/2301.12331)|**[link](https://github.com/chocobearz/speech_timing)**| 833 | |**2023-01-26**|**On granularity of prosodic representations in expressive text-to-speech**|Mikolaj Babianski et.al.|[2301.11446](http://arxiv.org/abs/2301.11446)|null| 834 | |**2023-01-26**|**Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study**|Massa Baali et.al.|[2301.09099](http://arxiv.org/abs/2301.09099)|**[link](https://github.com/espnet/espnet/tree/master/egs2/qasr_tts/tts1)**| 835 | |**2023-01-20**|**Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions**|Yinghao Aaron Li et.al.|[2301.08810](http://arxiv.org/abs/2301.08810)|null| 836 | |**2023-01-11**|**Modelling low-resource accents without accent-specific TTS frontend**|Georgi Tinchev et.al.|[2301.04606](http://arxiv.org/abs/2301.04606)|null| 837 | |**2022-12-11**|**BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm**|Yu-Wen Chen et.al.|[2301.04120](http://arxiv.org/abs/2301.04120)|**[link](https://github.com/yuwchen/baspro)**| 838 | |**2023-01-10**|**UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion**|Haogeng Liu et.al.|[2301.03801](http://arxiv.org/abs/2301.03801)|null| 839 | |**2023-01-10**|**Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation**|Abdullah Shahid et.al.|[2301.03751](http://arxiv.org/abs/2301.03751)|null| 840 | |**2023-09-19**|**Applying Automated Machine Translation to Educational Video Courses**|Linden Wang et.al.|[2301.03141](http://arxiv.org/abs/2301.03141)|null| 841 | |**2023-01-06**|**Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition**|David M. Chan et.al.|[2301.02736](http://arxiv.org/abs/2301.02736)|null| 842 | |**2023-01-05**|**Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers**|Chengyi Wang et.al.|[2301.02111](http://arxiv.org/abs/2301.02111)|**[link](https://github.com/microsoft/unilm)**| 843 | |**2022-12-11**|**MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset**|Kailin Liang et.al.|[2301.00657](http://arxiv.org/abs/2301.00657)|**[link](https://github.com/ssmlkl/mntts2)**| 844 | |**2022-12-30**|**ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech**|Zehua Chen et.al.|[2212.14518](http://arxiv.org/abs/2212.14518)|null| 845 | |**2022-12-29**|**StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models**|Yinghao Aaron Li et.al.|[2212.14227](http://arxiv.org/abs/2212.14227)|**[link](https://github.com/yl4579/StyleTTS-VC)**| 846 | |**2022-12-22**|**HMM-based data augmentation for E2E systems for building conversational speech synthesis systems**|Ishika Gupta et.al.|[2212.11982](http://arxiv.org/abs/2212.11982)|null| 847 | |**2022-12-21**|**ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement**|Wei-Ning Hsu et.al.|[2212.11377](http://arxiv.org/abs/2212.11377)|null| 848 | |**2022-12-20**|**TTS-Guided Training for Accent Conversion Without Parallel Data**|Yi Zhou et.al.|[2212.10204](http://arxiv.org/abs/2212.10204)|null| 849 | |**2023-06-28**|**Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling**|Tuomo Raitio et.al.|[2212.10075](http://arxiv.org/abs/2212.10075)|null| 850 | |**2022-12-16**|**Speech Aware Dialog System Technology Challenge (DSTC11)**|Hagen Soltau et.al.|[2212.08704](http://arxiv.org/abs/2212.08704)|null| 851 | |**2022-12-16**|**Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder**|Yusuke Yasuda et.al.|[2212.08329](http://arxiv.org/abs/2212.08329)|null| 852 | |**2022-12-16**|**Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language**|Yusuke Yasuda et.al.|[2212.08321](http://arxiv.org/abs/2212.08321)|null| 853 | |**2022-12-15**|**RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis**|Shinhyeok Oh et.al.|[2212.07939](http://arxiv.org/abs/2212.07939)|**[link](https://github.com/shinhyeokoh/rwen)**| 854 | |**2022-12-14**|**Probing Deep Speaker Embeddings for Speaker-related Tasks**|Zifeng Zhao et.al.|[2212.07068](http://arxiv.org/abs/2212.07068)|null| 855 | |**2022-12-08**|**SpeechLMScore: Evaluating speech generation using speech language model**|Soumi Maiti et.al.|[2212.04559](http://arxiv.org/abs/2212.04559)|**[link](https://github.com/espnet/espnet)**| 856 | |**2023-04-04**|**Learning to Dub Movies via Hierarchical Prosody Models**|Gaoxiang Cong et.al.|[2212.04054](http://arxiv.org/abs/2212.04054)|**[link](https://github.com/galaxycong/hpmdubbing)**| 857 | |**2022-12-07**|**Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning**|Ankur Debnath et.al.|[2212.03558](http://arxiv.org/abs/2212.03558)|null| 858 | |**2022-12-07**|**Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue**|Daxin Tan et.al.|[2212.03398](http://arxiv.org/abs/2212.03398)|null| 859 | |**2022-12-06**|**UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis**|Yi Lei et.al.|[2212.01546](http://arxiv.org/abs/2212.01546)|null| 860 | |**2022-11-30**|**SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech**|Byoung Jin Choi et.al.|[2211.16866](http://arxiv.org/abs/2211.16866)|null| 861 | |**2022-11-29**|**Controllable speech synthesis by learning discrete phoneme-level prosodic representations**|Nikolaos Ellinas et.al.|[2211.16307](http://arxiv.org/abs/2211.16307)|null| 862 | |**2023-05-25**|**Evaluating and reducing the distance between synthetic and real speech distributions**|Christoph Minixhofer et.al.|[2211.16049](http://arxiv.org/abs/2211.16049)|null| 863 | |**2022-11-26**|**Contextual Expressive Text-to-Speech**|Jianhong Tu et.al.|[2211.14548](http://arxiv.org/abs/2211.14548)|null| 864 | |**2022-12-05**|**Efficient Incremental Text-to-Speech on GPUs**|Muyang Du et.al.|[2211.13939](http://arxiv.org/abs/2211.13939)|null| 865 | |**2023-03-21**|**Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?**|Xuan Shi et.al.|[2211.13868](http://arxiv.org/abs/2211.13868)|**[link](https://github.com/nii-yamagishilab/midi-to-audio)**| 866 | |**2022-11-23**|**IMaSC -- ICFOSS Malayalam Speech Corpus**|Deepa P Gopinath et.al.|[2211.12796](http://arxiv.org/abs/2211.12796)|null| 867 | |**2022-11-22**|**PromptTTS: Controllable Text-to-Speech with Text Descriptions**|Zhifang Guo et.al.|[2211.12171](http://arxiv.org/abs/2211.12171)|null| 868 | |**2022-11-04**|**Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech**|Xin Zhang et.al.|[2211.09731](http://arxiv.org/abs/2211.09731)|null| 869 | |**2023-02-17**|**Towards Building Text-To-Speech Systems for the Next Billion Users**|Gokul Karthik Kumar et.al.|[2211.09536](http://arxiv.org/abs/2211.09536)|**[link](https://github.com/gokulkarthik/text2speech)**| 870 | |**2023-02-16**|**EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance**|Yiwei Guo et.al.|[2211.09496](http://arxiv.org/abs/2211.09496)|null| 871 | |**2022-11-17**|**Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation**|Chunyu Qiang et.al.|[2211.09495](http://arxiv.org/abs/2211.09495)|null| 872 | |**2022-11-17**|**NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis**|Hyeong-Seok Choi et.al.|[2211.09407](http://arxiv.org/abs/2211.09407)|null| 873 | |**2023-03-14**|**Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models**|Minki Kang et.al.|[2211.09383](http://arxiv.org/abs/2211.09383)|null| 874 | |**2023-01-04**|**Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation**|Xin Yuan et.al.|[2211.09365](http://arxiv.org/abs/2211.09365)|null| 875 | |**2022-11-14**|**SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech**|Perry Lam et.al.|[2211.07283](http://arxiv.org/abs/2211.07283)|null| 876 | |**2023-05-24**|**Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing**|Jacob J Webber et.al.|[2211.06989](http://arxiv.org/abs/2211.06989)|null| 877 | |**2023-05-29**|**OverFlow: Putting flows on top of neural transducers for better TTS**|Shivam Mehta et.al.|[2211.06892](http://arxiv.org/abs/2211.06892)|**[link](https://github.com/coqui-ai/TTS)**| 878 | |**2023-05-29**|**Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations**|Yoori Oh et.al.|[2211.06160](http://arxiv.org/abs/2211.06160)|null| 879 | |**2022-12-04**|**ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech**|Xiaoran Fan et.al.|[2211.03545](http://arxiv.org/abs/2211.03545)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**| 880 | |**2022-11-07**|**Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder**|Jan Melechovsky et.al.|[2211.03316](http://arxiv.org/abs/2211.03316)|**[link](https://github.com/dapwner/cvae-tacotron)**| 881 | |**2022-11-06**|**Parallel Attention Forcing for Machine Translation**|Qingyun Dou et.al.|[2211.03237](http://arxiv.org/abs/2211.03237)|null| 882 | |**2022-11-06**|**An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space**|Jihwan Lee et.al.|[2211.03078](http://arxiv.org/abs/2211.03078)|null| 883 | |**2022-11-04**|**NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS**|Dongchao Yang et.al.|[2211.02448](http://arxiv.org/abs/2211.02448)|null| 884 | |**2022-11-04**|**Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts**|Detai Xin et.al.|[2211.02336](http://arxiv.org/abs/2211.02336)|null| 885 | |**2023-04-16**|**Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS**|Ziqi Liang et.al.|[2211.01948](http://arxiv.org/abs/2211.01948)|null| 886 | |**2022-11-01**|**Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages**|Anusha Prakash et.al.|[2211.01338](http://arxiv.org/abs/2211.01338)|null| 887 | |**2023-05-28**|**DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP**|Kun Song et.al.|[2211.01087](http://arxiv.org/abs/2211.01087)|null| 888 | |**2022-11-22**|**Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement**|Wei Song et.al.|[2211.00967](http://arxiv.org/abs/2211.00967)|null| 889 | |**2022-11-01**|**Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers**|Cheng-Ping Hsieh et.al.|[2211.00585](http://arxiv.org/abs/2211.00585)|**[link](https://github.com/NVIDIA/NeMo)**| 890 | |**2023-06-11**|**Generating Multilingual Gender-Ambiguous Text-to-Speech Voices**|Konstantinos Markopoulos et.al.|[2211.00375](http://arxiv.org/abs/2211.00375)|null| 891 | |**2023-05-07**|**Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features**|Alexandra Vioni et.al.|[2211.00342](http://arxiv.org/abs/2211.00342)|null| 892 | |**2022-11-02**|**Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS**|Kun Song et.al.|[2210.17349](http://arxiv.org/abs/2210.17349)|null| 893 | |**2024-02-27**|**Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation**|Nikolaos Ellinas et.al.|[2210.17264](http://arxiv.org/abs/2210.17264)|null| 894 | |**2022-10-31**|**Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection**|Luigi Attorresi et.al.|[2210.17222](http://arxiv.org/abs/2210.17222)|null| 895 | |**2022-10-31**|**Structured State Space Decoder for Speech Recognition and Synthesis**|Koichi Miyazaki et.al.|[2210.17098](http://arxiv.org/abs/2210.17098)|null| 896 | |**2022-10-28**|**Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders**|Jason Fong et.al.|[2210.16045](http://arxiv.org/abs/2210.16045)|null| 897 | |**2023-02-21**|**Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform**|Masaya Kawamura et.al.|[2210.15975](http://arxiv.org/abs/2210.15975)|**[link](https://github.com/masayakawamura/mb-istft-vits)**| 898 | |**2023-02-22**|**Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis**|Yuma Shirahata et.al.|[2210.15964](http://arxiv.org/abs/2210.15964)|null| 899 | |**2022-10-28**|**Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation**|Nobuyuki Morioka et.al.|[2210.15868](http://arxiv.org/abs/2210.15868)|null| 900 | |**2023-03-15**|**Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech**|Takaaki Saeki et.al.|[2210.15447](http://arxiv.org/abs/2210.15447)|null| 901 | |**2022-10-27**|**Explicit Intensity Control for Accented Text-to-speech**|Rui Liu et.al.|[2210.15364](http://arxiv.org/abs/2210.15364)|null| 902 | |**2022-10-27**|**FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis**|Yifan Hu et.al.|[2210.15360](http://arxiv.org/abs/2210.15360)|**[link](https://github.com/walker-hyf/fctalker)**| 903 | |**2022-10-26**|**Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection**|Kentaro Seki et.al.|[2210.14850](http://arxiv.org/abs/2210.14850)|null| 904 | |**2022-10-25**|**Semi-Supervised Learning Based on Reference Model for Low-resource TTS**|Xulong Zhang et.al.|[2210.14723](http://arxiv.org/abs/2210.14723)|null| 905 | |**2022-10-26**|**Cover Reproducible Steganography via Deep Generative Models**|Kejiang Chen et.al.|[2210.14632](http://arxiv.org/abs/2210.14632)|null| 906 | |**2022-10-26**|**Improving Speech-to-Speech Translation Through Unlabeled Text**|Xuan-Phi Nguyen et.al.|[2210.14514](http://arxiv.org/abs/2210.14514)|null| 907 | |**2022-10-26**|**The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge**|Yuhao Liang et.al.|[2210.14448](http://arxiv.org/abs/2210.14448)|null| 908 | |**2022-10-25**|**Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data**|Xulong Zhang et.al.|[2210.13803](http://arxiv.org/abs/2210.13803)|null| 909 | |**2023-09-17**|**HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation**|Chunhui Wang et.al.|[2210.12740](http://arxiv.org/abs/2210.12740)|null| 910 | |**2022-10-21**|**Low-Resource Multilingual and Zero-Shot Multispeaker TTS**|Florian Lux et.al.|[2210.12223](http://arxiv.org/abs/2210.12223)|**[link](https://github.com/digitalphonetics/ims-toucan)**| 911 | |**2022-10-21**|**Adaptive re-calibration of channel-wise features for Adversarial Audio Classification**|Vardhan Dongre et.al.|[2210.11722](http://arxiv.org/abs/2210.11722)|null| 912 | |**2022-10-20**|**Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS**|Chunyu Qiang et.al.|[2210.11429](http://arxiv.org/abs/2210.11429)|null| 913 | |**2022-10-17**|**Towards Relation Extraction From Speech**|Tongtong Wu et.al.|[2210.08759](http://arxiv.org/abs/2210.08759)|**[link](https://github.com/wutong8023/speechre)**| 914 | |**2023-02-08**|**Generating Synthetic Speech from SpokenVocab for Speech Translation**|Jinming Zhao et.al.|[2210.08174](http://arxiv.org/abs/2210.08174)|**[link](https://github.com/mingzi151/spokenvocab)**| 915 | |**2022-10-17**|**LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge**|Yan Jia et.al.|[2210.07749](http://arxiv.org/abs/2210.07749)|null| 916 | |**2022-10-20**|**Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy**|Sarina Meyer et.al.|[2210.07002](http://arxiv.org/abs/2210.07002)|**[link](https://github.com/digitalphonetics/speaker-anonymization)**| 917 | |**2022-10-13**|**Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar**|Aolan Sun et.al.|[2210.06877](http://arxiv.org/abs/2210.06877)|null| 918 | |**2022-10-12**|**Can we use Common Voice to train a Multi-Speaker TTS system?**|Sewade Ogun et.al.|[2210.06370](http://arxiv.org/abs/2210.06370)|null| 919 | |**2023-06-01**|**SQuId: Measuring Speech Naturalness in Many Languages**|Thibault Sellam et.al.|[2210.06324](http://arxiv.org/abs/2210.06324)|null| 920 | |**2022-11-22**|**Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech**|Byoung Jin Choi et.al.|[2210.05979](http://arxiv.org/abs/2210.05979)|null| 921 | |**2022-10-06**|**An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era**|Andreas Triantafyllopoulos et.al.|[2210.03538](http://arxiv.org/abs/2210.03538)|null| 922 | |**2022-09-29**|**Facial Landmark Predictions with Applications to Metaverse**|Qiao Han et.al.|[2209.14698](http://arxiv.org/abs/2209.14698)|**[link](https://github.com/sweatybridge/text-to-anime)**| 923 | |**2022-09-26**|**Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech**|Yusuke Nakai et.al.|[2209.12549](http://arxiv.org/abs/2209.12549)|null| 924 | |**2022-09-22**|**EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models**|Perry Lam et.al.|[2209.10890](http://arxiv.org/abs/2209.10890)|null| 925 | |**2022-09-22**|**MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline**|Yifan Hu et.al.|[2209.10848](http://arxiv.org/abs/2209.10848)|**[link](https://github.com/walker-hyf/mntts)**| 926 | |**2022-09-22**|**Controllable Accented Text-to-Speech Synthesis**|Rui Liu et.al.|[2209.10804](http://arxiv.org/abs/2209.10804)|null| 927 | |**2022-09-16**|**TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection**|Davide Salvi et.al.|[2209.08000](http://arxiv.org/abs/2209.08000)|null| 928 | |**2022-09-14**|**Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset**|Michael Chinen et.al.|[2209.06358](http://arxiv.org/abs/2209.06358)|null| 929 | |**2022-09-08**|**SANIP: Shopping Assistant and Navigation for the visually impaired**|Shubham Deshmukh et.al.|[2209.03570](http://arxiv.org/abs/2209.03570)|null| 930 | |**2022-09-07**|**Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech**|Huu-Tien Dang et.al.|[2209.02971](http://arxiv.org/abs/2209.02971)|null| 931 | |**2022-09-02**|**Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model**|Jennifer Drexler Fox et.al.|[2209.01250](http://arxiv.org/abs/2209.01250)|null| 932 | |**2022-08-28**|**Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks**|Lev Finkelstein et.al.|[2208.13183](http://arxiv.org/abs/2208.13183)|null| 933 | |**2022-10-04**|**Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale**|Aditya Agarwal et.al.|[2208.09796](http://arxiv.org/abs/2208.09796)|null| 934 | |**2022-08-21**|**Visualising Model Training via Vowel Space for Text-To-Speech Systems**|Binu Abeysinghe et.al.|[2208.09775](http://arxiv.org/abs/2208.09775)|**[link](https://github.com/babe269/performant)**| 935 | |**2022-08-15**|**Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0**|Mohammed Salah Al-Radhi et.al.|[2208.07122](http://arxiv.org/abs/2208.07122)|null| 936 | |**2022-12-28**|**Speech Synthesis with Mixed Emotions**|Kun Zhou et.al.|[2208.05890](http://arxiv.org/abs/2208.05890)|null| 937 | |**2022-08-03**|**A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis**|Qibing Bai et.al.|[2208.02189](http://arxiv.org/abs/2208.02189)|null| 938 | |**2022-07-29**|**Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation**|Giulia Comini et.al.|[2207.14607](http://arxiv.org/abs/2207.14607)|null| 939 | |**2022-07-25**|**Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis**|Raul Fernandez et.al.|[2207.12262](http://arxiv.org/abs/2207.12262)|null| 940 | |**2022-07-01**|**A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese**|Song Zhang et.al.|[2207.12089](http://arxiv.org/abs/2207.12089)|null| 941 | |**2022-07-20**|**When Is TTS Augmentation Through a Pivot Language Useful?**|Nathaniel Robinson et.al.|[2207.09889](http://arxiv.org/abs/2207.09889)|**[link](https://github.com/n8rob/multilingual_tts_augmentation)**| 942 | |**2022-07-11**|**LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech**|Harshvardhan Anand et.al.|[2207.07118](http://arxiv.org/abs/2207.07118)|null| 943 | |**2022-07-13**|**ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech**|Rongjie Huang et.al.|[2207.06389](http://arxiv.org/abs/2207.06389)|**[link](https://github.com/Rongjiehuang/ProDiff)**| 944 | |**2022-07-13**|**Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech**|Zhengxi Liu et.al.|[2207.06088](http://arxiv.org/abs/2207.06088)|null| 945 | |**2022-07-13**|**SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate**|Nabarun Goswami et.al.|[2207.06011](http://arxiv.org/abs/2207.06011)|null| 946 | |**2022-07-13**|**Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS**|Yookyung Shin et.al.|[2207.06000](http://arxiv.org/abs/2207.06000)|null| 947 | |**2022-07-13**|**A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System**|Yi-Chiao Wu et.al.|[2207.05913](http://arxiv.org/abs/2207.05913)|null| 948 | |**2022-07-12**|**Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition**|Rodolfo Zevallos et.al.|[2207.05498](http://arxiv.org/abs/2207.05498)|null| 949 | |**2022-07-12**|**End-to-end speech recognition modeling from de-identified data**|Martin Flechl et.al.|[2207.05469](http://arxiv.org/abs/2207.05469)|null| 950 | |**2022-07-11**|**Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data**|Naoki Makishima et.al.|[2207.04659](http://arxiv.org/abs/2207.04659)|null| 951 | |**2022-07-11**|**DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders**|Yanqing Liu et.al.|[2207.04646](http://arxiv.org/abs/2207.04646)|null| 952 | |**2023-01-02**|**Dreamento: an open-source dream engineering toolbox for sleep EEG wearables**|Mahdad Jafarzadeh Esfahani et.al.|[2207.03977](http://arxiv.org/abs/2207.03977)|**[link](https://github.com/dreamento/dreamento)**| 953 | |**2022-07-07**|**BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus**|Josh Meyer et.al.|[2207.03546](http://arxiv.org/abs/2207.03546)|**[link](https://github.com/alpoktem/bible2speechdb)**| 954 | |**2022-07-05**|**Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion**|Yi Lei et.al.|[2207.01832](http://arxiv.org/abs/2207.01832)|null| 955 | |**2022-07-04**|**BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model**|Brooke Stephenson et.al.|[2207.01718](http://arxiv.org/abs/2207.01718)|null| 956 | |**2022-07-04**|**Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)**|Ariadna Sanchez et.al.|[2207.01547](http://arxiv.org/abs/2207.01547)|null| 957 | |**2022-07-04**|**Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)**|Ziyao Zhang et.al.|[2207.01507](http://arxiv.org/abs/2207.01507)|null| 958 | |**2023-03-13**|**DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech**|Keon Lee et.al.|[2207.01063](http://arxiv.org/abs/2207.01063)|**[link](https://github.com/keonlee9420/DailyTalk)**| 959 | |**2022-07-02**|**Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need**|Daniel Korzekwa et.al.|[2207.00774](http://arxiv.org/abs/2207.00774)|null| 960 | |**2022-07-01**|**Building African Voices**|Perez Ogayo et.al.|[2207.00688](http://arxiv.org/abs/2207.00688)|**[link](https://github.com/neulab/africanvoices)**| 961 | |**2022-07-01**|**Automatic Evaluation of Speaker Similarity**|Deja Kamil et.al.|[2207.00344](http://arxiv.org/abs/2207.00344)|null| 962 | |**2022-08-03**|**Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding**|Wei-Ping Huang et.al.|[2206.15427](http://arxiv.org/abs/2206.15427)|null| 963 | |**2022-06-30**|**R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS**|Kyle Kastner et.al.|[2206.15276](http://arxiv.org/abs/2206.15276)|null| 964 | |**2022-07-01**|**Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems**|Hyun-Wook Yoon et.al.|[2206.15067](http://arxiv.org/abs/2206.15067)|null| 965 | |**2022-06-30**|**TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder**|Eunwoo Song et.al.|[2206.14984](http://arxiv.org/abs/2206.14984)|null| 966 | |**2022-06-29**|**Improving Deliberation by Text-Only and Semi-Supervised Training**|Ke Hu et.al.|[2206.14716](http://arxiv.org/abs/2206.14716)|null| 967 | |**2022-06-29**|**Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody**|Peter Makarov et.al.|[2206.14643](http://arxiv.org/abs/2206.14643)|null| 968 | |**2022-06-28**|**Expressive, Variable, and Controllable Duration Modelling in TTS**|Ammar Abbas et.al.|[2206.14165](http://arxiv.org/abs/2206.14165)|null| 969 | |**2022-06-28**|**Comparison of Speech Representations for the MOS Prediction System**|Aki Kunikoshi et.al.|[2206.13817](http://arxiv.org/abs/2206.13817)|null| 970 | |**2022-06-22**|**A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data**|Raviraj Joshi et.al.|[2206.13240](http://arxiv.org/abs/2206.13240)|null| 971 | |**2022-06-25**|**Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations**|Chin-Cheng Hsu et.al.|[2206.12662](http://arxiv.org/abs/2206.12662)|null| 972 | |**2022-10-21**|**Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech**|Florian Lux et.al.|[2206.12229](http://arxiv.org/abs/2206.12229)|**[link](https://github.com/digitalphonetics/ims-toucan)**| 973 | |**2022-06-24**|**SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech**|Hyunjae Cho et.al.|[2206.12132](http://arxiv.org/abs/2206.12132)|null| 974 | |**2022-06-24**|**End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue**|Kentaro Mitsui et.al.|[2206.12040](http://arxiv.org/abs/2206.12040)|null| 975 | |**2022-05-29**|**Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning**|Sameea Naeem et.al.|[2206.11860](http://arxiv.org/abs/2206.11860)|null| 976 | |**2022-06-21**|**Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS**|Kenta Udagawa et.al.|[2206.10256](http://arxiv.org/abs/2206.10256)|null| 977 | |**2022-06-24**|**Towards Optimizing OCR for Accessibility**|Peya Mowar et.al.|[2206.10254](http://arxiv.org/abs/2206.10254)|null| 978 | |**2022-06-16**|**Automatic Prosody Annotation with Pre-Trained Text-Speech Model**|Ziqian Dai et.al.|[2206.07956](http://arxiv.org/abs/2206.07956)|**[link](https://github.com/daisyqk/automatic-prosody-annotation)**| 979 | |**2022-11-16**|**NatiQ: An End-to-end Text-to-Speech System for Arabic**|Ahmed Abdelali et.al.|[2206.07373](http://arxiv.org/abs/2206.07373)|null| 980 | |**2022-06-15**|**Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning**|Rui Liu et.al.|[2206.07229](http://arxiv.org/abs/2206.07229)|**[link](https://github.com/ttslr/strengthnet)**| 981 | |**2022-12-12**|**A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation**|Junhui Zhang et.al.|[2206.04922](http://arxiv.org/abs/2206.04922)|null| 982 | |**2022-06-09**|**Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos**|Alexander Waibel et.al.|[2206.04523](http://arxiv.org/abs/2206.04523)|null| 983 | |**2022-06-07**|**FlexLip: A Controllable Text-to-Lip System**|Dan Oneata et.al.|[2206.03206](http://arxiv.org/abs/2206.03206)|null| 984 | |**2022-10-11**|**UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder**|Jiachen Lian et.al.|[2206.02512](http://arxiv.org/abs/2206.02512)|null| 985 | |**2023-10-19**|**Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech**|Ziyue Jiang et.al.|[2206.02147](http://arxiv.org/abs/2206.02147)|**[link](https://github.com/zain-jiang/dict-tts)**| 986 | |**2022-11-02**|**AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation**|Kun Song et.al.|[2206.00208](http://arxiv.org/abs/2206.00208)|null| 987 | |**2022-05-31**|**Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish**|Alp Öktem et.al.|[2205.15599](http://arxiv.org/abs/2205.15599)|**[link](https://github.com/collectivat-dev/ladino_tts)**| 988 | |**2023-11-20**|**StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis**|Yinghao Aaron Li et.al.|[2205.15439](http://arxiv.org/abs/2205.15439)|**[link](https://github.com/yl4579/StyleTTS)**| 989 | |**2022-05-30**|**Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data**|Sungwon Kim et.al.|[2205.15370](http://arxiv.org/abs/2205.15370)|null| 990 | |**2022-05-26**|**QSpeech: Low-Qubit Quantum Speech Application Toolkit**|Zhenhou Hong et.al.|[2205.13221](http://arxiv.org/abs/2205.13221)|**[link](https://github.com/zhenhouhong/qspeech)**| 991 | |**2022-11-10**|**T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation**|Paul-Ambroise Duquenne et.al.|[2205.12216](http://arxiv.org/abs/2205.12216)|null| 992 | |**2022-05-20**|**PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit**|Hui Zhang et.al.|[2205.12007](http://arxiv.org/abs/2205.12007)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**| 993 | |**2022-05-24**|**TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS**|Xulong Zhang et.al.|[2205.11824](http://arxiv.org/abs/2205.11824)|null| 994 | |**2022-10-12**|**GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech**|Rongjie Huang et.al.|[2205.07211](http://arxiv.org/abs/2205.07211)|**[link](https://github.com/Rongjiehuang/GenerSpeech)**| 995 | |**2022-05-13**|**Talking Face Generation with Multilingual TTS**|Hyoung-Kyu Song et.al.|[2205.06421](http://arxiv.org/abs/2205.06421)|null| 996 | |**2022-05-10**|**NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality**|Xu Tan et.al.|[2205.04421](http://arxiv.org/abs/2205.04421)|**[link](https://github.com/microsoft/NeuralSpeech)**| 997 | |**2022-05-09**|**Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech**|Yang Li et.al.|[2205.04120](http://arxiv.org/abs/2205.04120)|**[link](https://github.com/neurowave-ai/cucvae-tts)**| 998 | |**2022-05-09**|**ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence**|Sangshin Oh et.al.|[2205.04104](http://arxiv.org/abs/2205.04104)|null| 999 | |**2022-07-14**|**Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss**|Efthymios Georgiou et.al.|[2204.13437](http://arxiv.org/abs/2204.13437)|null| 1000 | |**2022-04-25**|**SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech**|Zhenhui Ye et.al.|[2204.11792](http://arxiv.org/abs/2204.11792)|null| 1001 | |**2022-04-22**|**LibriS2S: A German-English Speech-to-Speech Translation Corpus**|Pedro Jeuris et.al.|[2204.10593](http://arxiv.org/abs/2204.10593)|**[link](https://github.com/pedrodke/libris2s)**| 1002 | |**2022-07-05**|**Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation**|Ryo Terashima et.al.|[2204.10020](http://arxiv.org/abs/2204.10020)|null| 1003 | |**2022-04-21**|**FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis**|Rongjie Huang et.al.|[2204.09934](http://arxiv.org/abs/2204.09934)|**[link](https://github.com/Rongjiehuang/FastDiff)**| 1004 | |**2022-04-20**|**Audio Deep Fake Detection System with Neural Stitching for ADD 2022**|Rui Yan et.al.|[2204.08720](http://arxiv.org/abs/2204.08720)|null| 1005 | |**2022-04-14**|**Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech**|Cong Zhang et.al.|[2204.07228](http://arxiv.org/abs/2204.07228)|null| 1006 | |**2022-12-09**|**Study of Indian English Pronunciation Variabilities relative to Received Pronunciation**|Priyanshi Pal et.al.|[2204.06502](http://arxiv.org/abs/2204.06502)|null| 1007 | |**2022-04-12**|**Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch**|Hanbin Bae et.al.|[2204.05753](http://arxiv.org/abs/2204.05753)|null| 1008 | |**2023-01-30**|**The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance**|Lin Zhang et.al.|[2204.05177](http://arxiv.org/abs/2204.05177)|null| 1009 | |**2022-10-27**|**Fine-grained Noise Control for Multispeaker Speech Synthesis**|Karolos Nikitaras et.al.|[2204.05070](http://arxiv.org/abs/2204.05070)|null| 1010 | |**2022-08-31**|**Karaoker: Alignment-free singing voice synthesis with speech training data**|Panos Kakoulidis et.al.|[2204.04127](http://arxiv.org/abs/2204.04127)|null| 1011 | |**2022-08-15**|**Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech**|Jae-Sung Bae et.al.|[2204.04004](http://arxiv.org/abs/2204.04004)|null| 1012 | |**2022-04-07**|**Arabic Text-To-Speech (TTS) Data Preparation**|Hala Al Masri et.al.|[2204.03255](http://arxiv.org/abs/2204.03255)|null| 1013 | |**2022-04-07**|**Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis**|Yutian Wang et.al.|[2204.03238](http://arxiv.org/abs/2204.03238)|null| 1014 | |**2022-08-24**|**SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis**|Georgia Maniati et.al.|[2204.03040](http://arxiv.org/abs/2204.03040)|null| 1015 | |**2022-09-13**|**Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation**|Sravya Popuri et.al.|[2204.02967](http://arxiv.org/abs/2204.02967)|null| 1016 | |**2022-07-02**|**Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification**|Jin Woo Lee et.al.|[2204.02639](http://arxiv.org/abs/2204.02639)|null| 1017 | |**2023-08-28**|**Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech**|Hyungchan Yoon et.al.|[2204.02172](http://arxiv.org/abs/2204.02172)|null| 1018 | |**2022-09-07**|**Deliberation Model for On-Device Spoken Language Understanding**|Duc Le et.al.|[2204.01893](http://arxiv.org/abs/2204.01893)|null| 1019 | |**2022-12-14**|**Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck**|Youngsik Eom et.al.|[2204.01387](http://arxiv.org/abs/2204.01387)|null| 1020 | |**2022-11-11**|**Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis**|Yixuan Zhou et.al.|[2204.00990](http://arxiv.org/abs/2204.00990)|null| 1021 | |**2022-06-30**|**VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature**|Chenpeng Du et.al.|[2204.00768](http://arxiv.org/abs/2204.00768)|null| 1022 | |**2022-04-01**|**AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios**|Yihan Wu et.al.|[2204.00436](http://arxiv.org/abs/2204.00436)|null| 1023 | |**2022-04-01**|**Text-To-Speech Data Augmentation for Low Resource Speech Recognition**|Rodolfo Zevallos et.al.|[2204.00291](http://arxiv.org/abs/2204.00291)|null| 1024 | |**2022-07-19**|**Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech**|Guangyan Zhang et.al.|[2203.17190](http://arxiv.org/abs/2203.17190)|null| 1025 | |**2022-03-31**|**An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer**|Wenlin Dai et.al.|[2203.16954](http://arxiv.org/abs/2203.16954)|**[link](https://github.com/thuhcsi/flattn)**| 1026 | |**2022-07-11**|**WavThruVec: Latent speech representation as intermediate features for neural speech synthesis**|Hubert Siuzdak et.al.|[2203.16930](http://arxiv.org/abs/2203.16930)|null| 1027 | |**2022-03-31**|**A Character-level Span-based Model for Mandarin Prosodic Structure Prediction**|Xueyuan Chen et.al.|[2203.16922](http://arxiv.org/abs/2203.16922)|**[link](https://github.com/thuhcsi/spanpsp)**| 1028 | |**2022-07-01**|**JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech**|Dan Lim et.al.|[2203.16852](http://arxiv.org/abs/2203.16852)|**[link](https://github.com/imdanboy/jets)**| 1029 | |**2022-03-31**|**Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset**|Zehui Yang et.al.|[2203.16844](http://arxiv.org/abs/2203.16844)|null| 1030 | |**2022-03-31**|**NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism**|Jingbei Li et.al.|[2203.16838](http://arxiv.org/abs/2203.16838)|**[link](https://github.com/thuhcsi/neufa)**| 1031 | |**2022-03-31**|**Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition**|Anirudh Gupta et.al.|[2203.16823](http://arxiv.org/abs/2203.16823)|null| 1032 | |**2022-04-21**|**Does Audio Deepfake Detection Generalize?**|Nicolas M. Müller et.al.|[2203.16263](http://arxiv.org/abs/2203.16263)|null| 1033 | |**2022-03-30**|**End to End Lip Synchronization with a Temporal AutoEncoder**|Yoav Shalev et.al.|[2203.16224](http://arxiv.org/abs/2203.16224)|**[link](https://github.com/itsyoavshalev/end-to-end-lip-synchronization-with-a-temporal-autoencoder)**| 1034 | |**2022-08-15**|**Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition**|Junrui Ni et.al.|[2203.15796](http://arxiv.org/abs/2203.15796)|**[link](https://github.com/lwang114/unsuptts)**| 1035 | |**2022-06-29**|**DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning**|Takaaki Saeki et.al.|[2203.15683](http://arxiv.org/abs/2203.15683)|null| 1036 | |**2022-11-05**|**Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation**|Rendi Chevi et.al.|[2203.15643](http://arxiv.org/abs/2203.15643)|**[link](https://github.com/rendchevi/nix-tts)**| 1037 | |**2022-10-06**|**Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus**|Minchan Kim et.al.|[2203.15447](http://arxiv.org/abs/2203.15447)|null| 1038 | |**2022-07-11**|**VoiceMe: Personalized voice generation in TTS**|Pol van Rijn et.al.|[2203.15379](http://arxiv.org/abs/2203.15379)|**[link](https://github.com/polvanrijn/voiceme)**| 1039 | 1040 | [contributors-shield]: https://img.shields.io/github/contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge 1041 | [contributors-url]: https://github.com/Vincentqyw/cv-arxiv-daily/graphs/contributors 1042 | [forks-shield]: https://img.shields.io/github/forks/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge 1043 | [forks-url]: https://github.com/Vincentqyw/cv-arxiv-daily/network/members 1044 | [stars-shield]: https://img.shields.io/github/stars/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge 1045 | [stars-url]: https://github.com/Vincentqyw/cv-arxiv-daily/stargazers 1046 | [issues-shield]: https://img.shields.io/github/issues/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge 1047 | [issues-url]: https://github.com/Vincentqyw/cv-arxiv-daily/issues 1048 | 1049 | --------------------------------------------------------------------------------