├── docs
├── tts-arxiv-daily-wechat.json
├── _config.yml
└── index.md
├── requirements.txt
├── assets
├── 4-ga-2-1.png
├── 4-ga-3-1.png
├── 4-ga-5-1.png
├── 4-ga-7.png
├── 4-ga-8.png
├── 4-ga-9.png
└── 5-pages-1.png
├── .github
├── ISSUE_TEMPLATE
│ ├── config.yml
│ ├── feature_request.md
│ ├── question.md
│ └── bug_report.md
├── PULL_REQUEST_TEMPLATE.md
└── workflows
│ ├── cv-arxiv-daily.yml
│ └── update_paper_links.yml
├── .gitignore
├── config.yaml
├── CODE_OF_CONDUCT.md
├── LICENSE
└── daily_arxiv.py
/docs/tts-arxiv-daily-wechat.json:
--------------------------------------------------------------------------------
1 | {}
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | arxiv
3 | pyyaml
--------------------------------------------------------------------------------
/assets/4-ga-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-2-1.png
--------------------------------------------------------------------------------
/assets/4-ga-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-3-1.png
--------------------------------------------------------------------------------
/assets/4-ga-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-5-1.png
--------------------------------------------------------------------------------
/assets/4-ga-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-7.png
--------------------------------------------------------------------------------
/assets/4-ga-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-8.png
--------------------------------------------------------------------------------
/assets/4-ga-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-9.png
--------------------------------------------------------------------------------
/assets/5-pages-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/5-pages-1.png
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
1 | # Configuration: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository
2 |
3 | blank_issues_enabled: false
4 |
--------------------------------------------------------------------------------
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | ## Description
2 |
3 |
4 |
5 | ## Related Issue
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | CMakeLists.txt.user
2 | CMakeLists_modified.txt
3 |
4 | .DS_Store
5 |
6 | build/
7 |
8 | lib/
9 | bin/
10 |
11 | cmake_modules/
12 | cmake-build-debug/
13 | .idea/
14 | .vscode/
15 | *.pyc
16 |
17 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: 🚀 Feature request
3 | about: Suggest an idea for this project 🏖
4 | title: ""
5 | labels: enhancement
6 | assignees:
7 | ---
8 |
9 | ## 🚀 Feature Request
10 |
11 |
12 |
13 | ## 📎 Additional context
14 |
15 |
16 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: ❓ Question
3 | about: Ask a question about this project 🎓
4 | title: ""
5 | labels: question
6 | assignees:
7 | ---
8 |
9 | ## Checklist
10 |
11 |
12 |
13 | - [ ] I've searched the project's [`issues`]
14 |
15 | ## ❓ Question
16 |
17 |
18 |
19 | How can I [...]?
20 |
21 | Is it possible to [...]?
22 |
23 | ## 📎 Additional context
24 |
25 |
26 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: 🐛 Bug report
3 | about: If something isn't working 🔧
4 | title: ""
5 | labels: bug
6 | assignees:
7 | ---
8 |
9 | ## 🐛 Bug Report
10 |
11 |
12 |
13 | ## 🔬 How To Reproduce
14 |
15 | Steps to reproduce the behavior:
16 |
17 | 1. ...
18 |
19 | ### Environment
20 |
21 | - OS: [e.g. Linux / Windows / macOS]
22 | - Python version, get it with:
23 |
24 | ```bash
25 | python --version
26 | ```
27 |
28 | ## 📎 Additional context
29 |
30 |
31 |
--------------------------------------------------------------------------------
/docs/_config.yml:
--------------------------------------------------------------------------------
1 | title: CV Arxiv Daily
2 | description: Automatically Update CV Papers Daily using Github Actions (Update Every 8th hours)
3 | show_downloads: true
4 |
5 | github:
6 | zip_url: https://github.com/Vincentqyw/cv-arxiv-daily
7 | another_url: https://github.com/Vincentqyw/cv-arxiv-daily
8 |
9 | ## add remote theme
10 | remote_theme: jekyll/minima@v2.5.1
11 |
12 | # minima:
13 | skin: dark
14 | social_links:
15 | twitter: AlphaRealcat
16 | github: vincentqyw
17 |
18 | plugins:
19 | - jekyll-remote-theme # add this line to the plugins list if you already have one
20 | - jekyll-feed
21 | - jekyll-seo-tag
22 |
--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | # TODO: add papers by configuration file
2 | base_url: "https://arxiv.paperswithcode.com/api/v0/papers/"
3 | user_name: "wetdog"
4 | repo_name: "cv-arxiv-daily"
5 | show_authors: True
6 | show_links: True
7 | show_badge: True
8 | max_results: 10
9 |
10 | publish_readme: True
11 | publish_gitpage: True
12 | publish_wechat: False
13 |
14 | # file paths
15 | json_readme_path: './docs/tts-arxiv-daily.json'
16 | json_gitpage_path: './docs/tts-arxiv-daily-web.json'
17 | json_wechat_path: './docs/tts-arxiv-daily-wechat.json'
18 |
19 | md_readme_path: 'README.md'
20 | md_gitpage_path: './docs/index.md'
21 | md_wechat_path: './docs/wechat.md'
22 |
23 | # keywords to search
24 | keywords:
25 | "TTS":
26 | filters: ["TTS", "Text to speech"]
27 |
--------------------------------------------------------------------------------
/.github/workflows/cv-arxiv-daily.yml:
--------------------------------------------------------------------------------
1 | # This is a basic workflow to help you get started with Actions
2 |
3 | name: Run Arxiv Papers Daily
4 |
5 | # Controls when the workflow will run
6 | on:
7 | # Allows you to run this workflow manually from the Actions tab
8 | workflow_dispatch:
9 | schedule:
10 | - cron: "0 0/12 * * *" #'*/60 * * * *'
11 | # Triggers the workflow on push or pull request events but only for the main branch
12 | # push:
13 | # branches:
14 | # - main
15 |
16 | env:
17 |
18 | GITHUB_USER_NAME: wetdog
19 | GITHUB_USER_EMAIL: jose091@gmail.com
20 |
21 |
22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel
23 | jobs:
24 | # This workflow contains a single job called "build"
25 | build:
26 | name: update
27 | # The type of runner that the job will run on
28 | runs-on: ubuntu-latest
29 |
30 | # Steps represent a sequence of tasks that will be executed as part of the job
31 | steps:
32 | - name: Checkout
33 | uses: actions/checkout@v3
34 |
35 | - name: Set up Python Env
36 | uses: actions/setup-python@v4
37 | with:
38 | python-version: '3.10'
39 | #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
40 | - name: Install dependencies
41 | run: |
42 | python -m pip install --upgrade pip
43 | pip install arxiv
44 | pip install requests
45 | pip install pyyaml
46 |
47 | - name: Run daily arxiv
48 | run: |
49 | python daily_arxiv.py
50 |
51 | - name: Push new cv-arxiv-daily.md
52 | uses: github-actions-x/commit@v2.9
53 | with:
54 | github-token: ${{ secrets.GITHUB_TOKEN }}
55 | commit-message: "Github Action Automatic Update TTS Arxiv Papers"
56 | files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md
57 | rebase: 'true'
58 | name: ${{ env.GITHUB_USER_NAME }}
59 | email: ${{ env.GITHUB_USER_EMAIL }}
60 |
--------------------------------------------------------------------------------
/.github/workflows/update_paper_links.yml:
--------------------------------------------------------------------------------
1 | # This is a basic workflow to help you get started with Actions
2 |
3 | name: Run Update Paper Links Weekly
4 |
5 | # Controls when the workflow will run
6 | on:
7 | # Allows you to run this workflow manually from the Actions tab
8 | workflow_dispatch:
9 | schedule:
10 | - cron: "0 8 * * 1" #Run At 08:00 on Monday
11 | # Triggers the workflow on push or pull request events but only for the main branch
12 | # push:
13 | # branches:
14 | # - main
15 |
16 | env:
17 |
18 | GITHUB_USER_NAME: wetdog
19 | GITHUB_USER_EMAIL: jose091@gmail.com
20 |
21 |
22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel
23 | jobs:
24 | # This workflow contains a single job called "build"
25 | build:
26 | name: update
27 | # The type of runner that the job will run on
28 | runs-on: ubuntu-latest
29 |
30 | # Steps represent a sequence of tasks that will be executed as part of the job
31 | steps:
32 | - name: Checkout
33 | uses: actions/checkout@v3
34 |
35 | - name: Set up Python Env
36 | uses: actions/setup-python@v4
37 | with:
38 | python-version: '3.10'
39 | #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
40 | - name: Install dependencies
41 | run: |
42 | python -m pip install --upgrade pip
43 | pip install arxiv
44 | pip install requests
45 | pip install pyyaml
46 |
47 | - name: Run daily arxiv
48 | run: |
49 | python daily_arxiv.py --update_paper_links
50 |
51 | - name: Push new tts-arxiv-daily.md
52 | uses: github-actions-x/commit@v2.9
53 | with:
54 | github-token: ${{ secrets.GITHUB_TOKEN }}
55 | commit-message: "Github Action Automatic Update CV Arxiv Papers"
56 | files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md
57 | rebase: 'true'
58 | name: ${{ env.GITHUB_USER_NAME }}
59 | email: ${{ env.GITHUB_USER_EMAIL }}
60 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | We as members, contributors, and leaders pledge to make participation in our
6 | community a harassment-free experience for everyone, regardless of age, body
7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
8 | identity and expression, level of experience, education, socio-economic status,
9 | nationality, personal appearance, race, religion, or sexual identity
10 | and orientation.
11 |
12 | We pledge to act and interact in ways that contribute to an open, welcoming,
13 | diverse, inclusive, and healthy community.
14 |
15 | ## Our Standards
16 |
17 | Examples of behavior that contributes to a positive environment for our
18 | community include:
19 |
20 | * Demonstrating empathy and kindness toward other people
21 | * Being respectful of differing opinions, viewpoints, and experiences
22 | * Giving and gracefully accepting constructive feedback
23 | * Accepting responsibility and apologizing to those affected by our mistakes,
24 | and learning from the experience
25 | * Focusing on what is best not just for us as individuals, but for the
26 | overall community
27 |
28 | Examples of unacceptable behavior include:
29 |
30 | * The use of sexualized language or imagery, and sexual attention or
31 | advances of any kind
32 | * Trolling, insulting or derogatory comments, and personal or political attacks
33 | * Public or private harassment
34 | * Publishing others' private information, such as a physical or email
35 | address, without their explicit permission
36 | * Other conduct which could reasonably be considered inappropriate in a
37 | professional setting
38 |
39 | ## Enforcement Responsibilities
40 |
41 | Community leaders are responsible for clarifying and enforcing our standards of
42 | acceptable behavior and will take appropriate and fair corrective action in
43 | response to any behavior that they deem inappropriate, threatening, offensive,
44 | or harmful.
45 |
46 | Community leaders have the right and responsibility to remove, edit, or reject
47 | comments, commits, code, wiki edits, issues, and other contributions that are
48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
49 | decisions when appropriate.
50 |
51 | ## Scope
52 |
53 | This Code of Conduct applies within all community spaces, and also applies when
54 | an individual is officially representing the community in public spaces.
55 | Examples of representing our community include using an official e-mail address,
56 | posting via an official social media account, or acting as an appointed
57 | representative at an online or offline event.
58 |
59 | ## Enforcement
60 |
61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
62 | reported to the community leaders responsible for enforcement at
63 | alpharealcat@gmail.com.
64 | All complaints will be reviewed and investigated promptly and fairly.
65 |
66 | All community leaders are obligated to respect the privacy and security of the
67 | reporter of any incident.
68 |
69 | ## Enforcement Guidelines
70 |
71 | Community leaders will follow these Community Impact Guidelines in determining
72 | the consequences for any action they deem in violation of this Code of Conduct:
73 |
74 | ### 1. Correction
75 |
76 | **Community Impact**: Use of inappropriate language or other behavior deemed
77 | unprofessional or unwelcome in the community.
78 |
79 | **Consequence**: A private, written warning from community leaders, providing
80 | clarity around the nature of the violation and an explanation of why the
81 | behavior was inappropriate. A public apology may be requested.
82 |
83 | ### 2. Warning
84 |
85 | **Community Impact**: A violation through a single incident or series
86 | of actions.
87 |
88 | **Consequence**: A warning with consequences for continued behavior. No
89 | interaction with the people involved, including unsolicited interaction with
90 | those enforcing the Code of Conduct, for a specified period of time. This
91 | includes avoiding interactions in community spaces as well as external channels
92 | like social media. Violating these terms may lead to a temporary or
93 | permanent ban.
94 |
95 | ### 3. Temporary Ban
96 |
97 | **Community Impact**: A serious violation of community standards, including
98 | sustained inappropriate behavior.
99 |
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 |
106 | ### 4. Permanent Ban
107 |
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior, harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 |
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 |
115 | ## Attribution
116 |
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 |
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 |
124 | [homepage]: https://www.contributor-covenant.org
125 |
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/daily_arxiv.py:
--------------------------------------------------------------------------------
1 | import os
2 | import re
3 | import json
4 | import arxiv
5 | import yaml
6 | import logging
7 | import argparse
8 | import datetime
9 | import requests
10 |
11 | logging.basicConfig(format='[%(asctime)s %(levelname)s] %(message)s',
12 | datefmt='%m/%d/%Y %H:%M:%S',
13 | level=logging.INFO)
14 |
15 | base_url = "https://arxiv.paperswithcode.com/api/v0/papers/"
16 | github_url = "https://api.github.com/search/repositories"
17 | arxiv_url = "http://arxiv.org/"
18 |
19 | def load_config(config_file:str) -> dict:
20 | '''
21 | config_file: input config file path
22 | return: a dict of configuration
23 | '''
24 | # make filters pretty
25 | def pretty_filters(**config) -> dict:
26 | keywords = dict()
27 | EXCAPE = '\"'
28 | QUOTA = '' # NO-USE
29 | OR = 'OR' # TODO
30 | def parse_filters(filters:list):
31 | ret = ''
32 | for idx in range(0,len(filters)):
33 | filter = filters[idx]
34 | if len(filter.split()) > 1:
35 | ret += (EXCAPE + filter + EXCAPE)
36 | else:
37 | ret += (QUOTA + filter + QUOTA)
38 | if idx != len(filters) - 1:
39 | ret += OR
40 | return ret
41 | for k,v in config['keywords'].items():
42 | keywords[k] = parse_filters(v['filters'])
43 | return keywords
44 | with open(config_file,'r') as f:
45 | config = yaml.load(f,Loader=yaml.FullLoader)
46 | config['kv'] = pretty_filters(**config)
47 | logging.info(f'config = {config}')
48 | return config
49 |
50 | def get_authors(authors, first_author = False):
51 | output = str()
52 | if first_author == False:
53 | output = ", ".join(str(author) for author in authors)
54 | else:
55 | output = authors[0]
56 | return output
57 | def sort_papers(papers):
58 | output = dict()
59 | keys = list(papers.keys())
60 | keys.sort(reverse=True)
61 | for key in keys:
62 | output[key] = papers[key]
63 | return output
64 | import requests
65 |
66 | def get_code_link(qword:str) -> str:
67 | """
68 | This short function was auto-generated by ChatGPT.
69 | I only renamed some params and added some comments.
70 | @param qword: query string, eg. arxiv ids and paper titles
71 | @return paper_code in github: string, if not found, return None
72 | """
73 | # query = f"arxiv:{arxiv_id}"
74 | query = f"{qword}"
75 | params = {
76 | "q": query,
77 | "sort": "stars",
78 | "order": "desc"
79 | }
80 | r = requests.get(github_url, params=params)
81 | results = r.json()
82 | code_link = None
83 | if results["total_count"] > 0:
84 | code_link = results["items"][0]["html_url"]
85 | return code_link
86 |
87 | def get_daily_papers(topic,query="slam", max_results=2):
88 | """
89 | @param topic: str
90 | @param query: str
91 | @return paper_with_code: dict
92 | """
93 | # output
94 | content = dict()
95 | content_to_web = dict()
96 | search_engine = arxiv.Search(
97 | query = query,
98 | max_results = max_results,
99 | sort_by = arxiv.SortCriterion.SubmittedDate
100 | )
101 |
102 | for result in search_engine.results():
103 |
104 | paper_id = result.get_short_id()
105 | paper_title = result.title
106 | paper_url = result.entry_id
107 | code_url = base_url + paper_id #TODO
108 | paper_abstract = result.summary.replace("\n"," ")
109 | paper_authors = get_authors(result.authors)
110 | paper_first_author = get_authors(result.authors,first_author = True)
111 | primary_category = result.primary_category
112 | publish_time = result.published.date()
113 | update_time = result.updated.date()
114 | comments = result.comment
115 |
116 | logging.info(f"Time = {update_time} title = {paper_title} author = {paper_first_author}")
117 |
118 | # eg: 2108.09112v1 -> 2108.09112
119 | ver_pos = paper_id.find('v')
120 | if ver_pos == -1:
121 | paper_key = paper_id
122 | else:
123 | paper_key = paper_id[0:ver_pos]
124 | paper_url = arxiv_url + 'abs/' + paper_key
125 |
126 | try:
127 | # source code link
128 | r = requests.get(code_url).json()
129 | repo_url = None
130 | if "official" in r and r["official"]:
131 | repo_url = r["official"]["url"]
132 | # TODO: not found, two more chances
133 | # else:
134 | # repo_url = get_code_link(paper_title)
135 | # if repo_url is None:
136 | # repo_url = get_code_link(paper_key)
137 | if repo_url is not None:
138 | content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|**[link]({})**|\n".format(
139 | update_time,paper_title,paper_first_author,paper_key,paper_url,repo_url)
140 | content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({}), Code: **[{}]({})**".format(
141 | update_time,paper_title,paper_first_author,paper_url,paper_url,repo_url,repo_url)
142 |
143 | else:
144 | content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|null|\n".format(
145 | update_time,paper_title,paper_first_author,paper_key,paper_url)
146 | content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({})".format(
147 | update_time,paper_title,paper_first_author,paper_url,paper_url)
148 |
149 | # TODO: select useful comments
150 | comments = None
151 | if comments != None:
152 | content_to_web[paper_key] += f", {comments}\n"
153 | else:
154 | content_to_web[paper_key] += f"\n"
155 |
156 | except Exception as e:
157 | logging.error(f"exception: {e} with id: {paper_key}")
158 |
159 | data = {topic:content}
160 | data_web = {topic:content_to_web}
161 | return data,data_web
162 |
163 | def update_paper_links(filename):
164 | '''
165 | weekly update paper links in json file
166 | '''
167 | def parse_arxiv_string(s):
168 | parts = s.split("|")
169 | date = parts[1].strip()
170 | title = parts[2].strip()
171 | authors = parts[3].strip()
172 | arxiv_id = parts[4].strip()
173 | code = parts[5].strip()
174 | arxiv_id = re.sub(r'v\d+', '', arxiv_id)
175 | return date,title,authors,arxiv_id,code
176 |
177 | with open(filename,"r") as f:
178 | content = f.read()
179 | if not content:
180 | m = {}
181 | else:
182 | m = json.loads(content)
183 |
184 | json_data = m.copy()
185 |
186 | for keywords,v in json_data.items():
187 | logging.info(f'keywords = {keywords}')
188 | for paper_id,contents in v.items():
189 | contents = str(contents)
190 |
191 | update_time, paper_title, paper_first_author, paper_url, code_url = parse_arxiv_string(contents)
192 |
193 | contents = "|{}|{}|{}|{}|{}|\n".format(update_time,paper_title,paper_first_author,paper_url,code_url)
194 | json_data[keywords][paper_id] = str(contents)
195 | logging.info(f'paper_id = {paper_id}, contents = {contents}')
196 |
197 | valid_link = False if '|null|' in contents else True
198 | if valid_link:
199 | continue
200 | try:
201 | code_url = base_url + paper_id #TODO
202 | r = requests.get(code_url).json()
203 | repo_url = None
204 | if "official" in r and r["official"]:
205 | repo_url = r["official"]["url"]
206 | if repo_url is not None:
207 | new_cont = contents.replace('|null|',f'|**[link]({repo_url})**|')
208 | logging.info(f'ID = {paper_id}, contents = {new_cont}')
209 | json_data[keywords][paper_id] = str(new_cont)
210 |
211 | except Exception as e:
212 | logging.error(f"exception: {e} with id: {paper_id}")
213 | # dump to json file
214 | with open(filename,"w") as f:
215 | json.dump(json_data,f)
216 |
217 | def update_json_file(filename,data_dict):
218 | '''
219 | daily update json file using data_dict
220 | '''
221 | with open(filename,"r") as f:
222 | content = f.read()
223 | if not content:
224 | m = {}
225 | else:
226 | m = json.loads(content)
227 |
228 | json_data = m.copy()
229 |
230 | # update papers in each keywords
231 | for data in data_dict:
232 | for keyword in data.keys():
233 | papers = data[keyword]
234 |
235 | if keyword in json_data.keys():
236 | json_data[keyword].update(papers)
237 | else:
238 | json_data[keyword] = papers
239 |
240 | with open(filename,"w") as f:
241 | json.dump(json_data,f)
242 |
243 | def json_to_md(filename,md_filename,
244 | task = '',
245 | to_web = False,
246 | use_title = True,
247 | use_tc = True,
248 | show_badge = True,
249 | use_b2t = True):
250 | """
251 | @param filename: str
252 | @param md_filename: str
253 | @return None
254 | """
255 | def pretty_math(s:str) -> str:
256 | ret = ''
257 | match = re.search(r"\$.*\$", s)
258 | if match == None:
259 | return s
260 | math_start,math_end = match.span()
261 | space_trail = space_leading = ''
262 | if s[:math_start][-1] != ' ' and '*' != s[:math_start][-1]: space_trail = ' '
263 | if s[math_end:][0] != ' ' and '*' != s[math_end:][0]: space_leading = ' '
264 | ret += s[:math_start]
265 | ret += f'{space_trail}${match.group()[1:-1].strip()}${space_leading}'
266 | ret += s[math_end:]
267 | return ret
268 |
269 | DateNow = datetime.date.today()
270 | DateNow = str(DateNow)
271 | DateNow = DateNow.replace('-','.')
272 |
273 | with open(filename,"r") as f:
274 | content = f.read()
275 | if not content:
276 | data = {}
277 | else:
278 | data = json.loads(content)
279 |
280 | # clean README.md if daily already exist else create it
281 | with open(md_filename,"w+") as f:
282 | pass
283 |
284 | # write data into README.md
285 | with open(md_filename,"a+") as f:
286 |
287 | if (use_title == True) and (to_web == True):
288 | f.write("---\n" + "layout: default\n" + "---\n\n")
289 |
290 | if show_badge == True:
291 | f.write(f"[![Contributors][contributors-shield]][contributors-url]\n")
292 | f.write(f"[![Forks][forks-shield]][forks-url]\n")
293 | f.write(f"[![Stargazers][stars-shield]][stars-url]\n")
294 | f.write(f"[![Issues][issues-shield]][issues-url]\n\n")
295 |
296 | if use_title == True:
297 | #f.write(("
CV-ARXIV-DAILY"
298 | # "
Automatically Update CV Papers Daily
\n"))
299 | f.write("## Updated on " + DateNow + "\n")
300 | else:
301 | f.write("> Updated on " + DateNow + "\n")
302 |
303 | # TODO: add usage
304 | f.write("> Usage instructions: [here](./docs/README.md#usage)\n\n")
305 | f.write("> This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily)\n\n")
306 |
307 | #Add: table of contents
308 | if use_tc == True:
309 | f.write("\n")
310 | f.write(" Table of Contents
\n")
311 | f.write(" \n")
312 | for keyword in data.keys():
313 | day_content = data[keyword]
314 | if not day_content:
315 | continue
316 | kw = keyword.replace(' ','-')
317 | f.write(f" - {keyword}
\n")
318 | f.write("
\n")
319 | f.write(" \n\n")
320 |
321 | for keyword in data.keys():
322 | day_content = data[keyword]
323 | if not day_content:
324 | continue
325 | # the head of each part
326 | f.write(f"## {keyword}\n\n")
327 |
328 | if use_title == True :
329 | if to_web == False:
330 | f.write("|Publish Date|Title|Authors|PDF|Code|\n" + "|---|---|---|---|---|\n")
331 | else:
332 | f.write("| Publish Date | Title | Authors | PDF | Code |\n")
333 | f.write("|:---------|:-----------------------|:---------|:------|:------|\n")
334 |
335 | # sort papers by date
336 | day_content = sort_papers(day_content)
337 |
338 | for _,v in day_content.items():
339 | if v is not None:
340 | f.write(pretty_math(v)) # make latex pretty
341 |
342 | f.write(f"\n")
343 |
344 | #Add: back to top
345 | if use_b2t:
346 | top_info = f"#Updated on {DateNow}"
347 | top_info = top_info.replace(' ','-').replace('.','')
348 | f.write(f"(back to top)
\n\n")
349 |
350 | if show_badge == True:
351 | # we don't like long string, break it!
352 | f.write((f"[contributors-shield]: https://img.shields.io/github/"
353 | f"contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge\n"))
354 | f.write((f"[contributors-url]: https://github.com/Vincentqyw/"
355 | f"cv-arxiv-daily/graphs/contributors\n"))
356 | f.write((f"[forks-shield]: https://img.shields.io/github/forks/Vincentqyw/"
357 | f"cv-arxiv-daily.svg?style=for-the-badge\n"))
358 | f.write((f"[forks-url]: https://github.com/Vincentqyw/"
359 | f"cv-arxiv-daily/network/members\n"))
360 | f.write((f"[stars-shield]: https://img.shields.io/github/stars/Vincentqyw/"
361 | f"cv-arxiv-daily.svg?style=for-the-badge\n"))
362 | f.write((f"[stars-url]: https://github.com/Vincentqyw/"
363 | f"cv-arxiv-daily/stargazers\n"))
364 | f.write((f"[issues-shield]: https://img.shields.io/github/issues/Vincentqyw/"
365 | f"cv-arxiv-daily.svg?style=for-the-badge\n"))
366 | f.write((f"[issues-url]: https://github.com/Vincentqyw/"
367 | f"cv-arxiv-daily/issues\n\n"))
368 |
369 | logging.info(f"{task} finished")
370 |
371 | def demo(**config):
372 | # TODO: use config
373 | data_collector = []
374 | data_collector_web= []
375 |
376 | keywords = config['kv']
377 | max_results = config['max_results']
378 | publish_readme = config['publish_readme']
379 | publish_gitpage = config['publish_gitpage']
380 | publish_wechat = config['publish_wechat']
381 | show_badge = config['show_badge']
382 |
383 | b_update = config['update_paper_links']
384 | logging.info(f'Update Paper Link = {b_update}')
385 | if config['update_paper_links'] == False:
386 | logging.info(f"GET daily papers begin")
387 | for topic, keyword in keywords.items():
388 | logging.info(f"Keyword: {topic}")
389 | data, data_web = get_daily_papers(topic, query = keyword,
390 | max_results = max_results)
391 | data_collector.append(data)
392 | data_collector_web.append(data_web)
393 | print("\n")
394 | logging.info(f"GET daily papers end")
395 |
396 | # 1. update README.md file
397 | if publish_readme:
398 | json_file = config['json_readme_path']
399 | md_file = config['md_readme_path']
400 | # update paper links
401 | if config['update_paper_links']:
402 | update_paper_links(json_file)
403 | else:
404 | # update json data
405 | update_json_file(json_file,data_collector)
406 | # json data to markdown
407 | json_to_md(json_file,md_file, task ='Update Readme', \
408 | show_badge = show_badge)
409 |
410 | # 2. update docs/index.md file (to gitpage)
411 | if publish_gitpage:
412 | json_file = config['json_gitpage_path']
413 | md_file = config['md_gitpage_path']
414 | # TODO: duplicated update paper links!!!
415 | if config['update_paper_links']:
416 | update_paper_links(json_file)
417 | else:
418 | update_json_file(json_file,data_collector)
419 | json_to_md(json_file, md_file, task ='Update GitPage', \
420 | to_web = True, show_badge = show_badge, \
421 | use_tc=False, use_b2t=False)
422 |
423 | # 3. Update docs/wechat.md file
424 | if publish_wechat:
425 | json_file = config['json_wechat_path']
426 | md_file = config['md_wechat_path']
427 | # TODO: duplicated update paper links!!!
428 | if config['update_paper_links']:
429 | update_paper_links(json_file)
430 | else:
431 | update_json_file(json_file, data_collector_web)
432 | json_to_md(json_file, md_file, task ='Update Wechat', \
433 | to_web=False, use_title= False, show_badge = show_badge)
434 |
435 | if __name__ == "__main__":
436 | parser = argparse.ArgumentParser()
437 | parser.add_argument('--config_path',type=str, default='config.yaml',
438 | help='configuration file path')
439 | parser.add_argument('--update_paper_links', default=False,
440 | action="store_true",help='whether to update paper links etc.')
441 | args = parser.parse_args()
442 | config = load_config(args.config_path)
443 | config = {**config, 'update_paper_links':args.update_paper_links}
444 | demo(**config)
445 |
--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | ---
4 |
5 | [![Contributors][contributors-shield]][contributors-url]
6 | [![Forks][forks-shield]][forks-url]
7 | [![Stargazers][stars-shield]][stars-url]
8 | [![Issues][issues-shield]][issues-url]
9 |
10 | ## Updated on 2025.12.22
11 | > Usage instructions: [here](./docs/README.md#usage)
12 |
13 | > This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily)
14 |
15 | ## TTS
16 |
17 | | Publish Date | Title | Authors | PDF | Code |
18 | |:---------|:-----------------------|:---------|:------|:------|
19 | |**2025-07-23**|**BoSS: Beyond-Semantic Speech**|Qing Wang et.al.|[2507.17563](http://arxiv.org/abs/2507.17563)|null|
20 | |**2025-07-21**|**A2TTS: TTS for Low Resource Indian Languages**|Ayush Singh Bhadoriya et.al.|[2507.15272](http://arxiv.org/abs/2507.15272)|null|
21 | |**2025-07-22**|**Hear Your Code Fail, Voice-Assisted Debugging for Python**|Sayed Mahbub Hasan Amiri et.al.|[2507.15007](http://arxiv.org/abs/2507.15007)|null|
22 | |**2025-07-20**|**DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis**|Yinghao Aaron Li et.al.|[2507.14988](http://arxiv.org/abs/2507.14988)|null|
23 | |**2025-07-17**|**NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech**|Maksim Borisov et.al.|[2507.13155](http://arxiv.org/abs/2507.13155)|null|
24 | |**2025-07-17**|**Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication**|Tianyu Song et.al.|[2507.13052](http://arxiv.org/abs/2507.13052)|null|
25 | |**2025-07-17**|**Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes**|Zhou Feng et.al.|[2507.12932](http://arxiv.org/abs/2507.12932)|null|
26 | |**2025-07-16**|**Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations**|Yichen Han et.al.|[2507.12197](http://arxiv.org/abs/2507.12197)|null|
27 | |**2025-07-16**|**EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis**|Haoxun Li et.al.|[2507.12015](http://arxiv.org/abs/2507.12015)|null|
28 | |**2025-07-15**|**Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection**|Ivan Viakhirev et.al.|[2507.11777](http://arxiv.org/abs/2507.11777)|null|
29 | |**2025-07-15**|**P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge**|Marvin Sach et.al.|[2507.11306](http://arxiv.org/abs/2507.11306)|null|
30 | |**2025-07-14**|**Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition**|Mengzhe Geng et.al.|[2507.10827](http://arxiv.org/abs/2507.10827)|null|
31 | |**2025-07-14**|**An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments**|Mikko Korkiakoski et.al.|[2507.10469](http://arxiv.org/abs/2507.10469)|null|
32 | |**2025-07-12**|**ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching**|Han Zhu et.al.|[2507.09318](http://arxiv.org/abs/2507.09318)|null|
33 | |**2025-07-12**|**Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning**|Dominika Woszczyk et.al.|[2507.09310](http://arxiv.org/abs/2507.09310)|null|
34 | |**2025-07-12**|**ClaritySpeech: Dementia Obfuscation in Speech**|Dominika Woszczyk et.al.|[2507.09282](http://arxiv.org/abs/2507.09282)|null|
35 | |**2025-07-11**|**Exploiting Leaderboards for Large-Scale Distribution of Malicious Models**|Anshuman Suri et.al.|[2507.08983](http://arxiv.org/abs/2507.08983)|null|
36 | |**2025-07-11**|**Unlocking Speech Instruction Data Potential with Query Rewriting**|Yonghua Hei et.al.|[2507.08603](http://arxiv.org/abs/2507.08603)|null|
37 | |**2025-07-11**|**MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling**|Jingjing Tang et.al.|[2507.08530](http://arxiv.org/abs/2507.08530)|null|
38 | |**2025-07-11**|**Active Learning for Text-to-Speech Synthesis with Informative Sample Collection**|Kentaro Seki et.al.|[2507.08319](http://arxiv.org/abs/2507.08319)|null|
39 | |**2025-07-10**|**SecureSpeech: Prompt-based Speaker and Content Protection**|Belinda Soh Hui Hui et.al.|[2507.07799](http://arxiv.org/abs/2507.07799)|null|
40 | |**2025-07-09**|**Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents**|Zackary Rackauckas et.al.|[2507.06483](http://arxiv.org/abs/2507.06483)|null|
41 | |**2025-07-08**|**Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis**|Xintong Hu et.al.|[2507.06116](http://arxiv.org/abs/2507.06116)|null|
42 | |**2025-07-08**|**Differentiable Reward Optimization for LLM based TTS system**|Changfeng Gao et.al.|[2507.05911](http://arxiv.org/abs/2507.05911)|null|
43 | |**2025-07-08**|**OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model**|Chen Wang et.al.|[2507.05177](http://arxiv.org/abs/2507.05177)|null|
44 | |**2025-07-07**|**Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2507.04598](http://arxiv.org/abs/2507.04598)|null|
45 | |**2025-07-06**|**TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet**|Jaeseok Jeong et.al.|[2507.04349](http://arxiv.org/abs/2507.04349)|null|
46 | |**2025-07-05**|**PresentAgent: Multimodal Agent for Presentation Video Generation**|Jingwei Shi et.al.|[2507.04036](http://arxiv.org/abs/2507.04036)|null|
47 | |**2025-07-05**|**Prosody Labeling with Phoneme-BERT and Speech Foundation Models**|Tomoki Koriyama et.al.|[2507.03912](http://arxiv.org/abs/2507.03912)|null|
48 | |**2025-07-05**|**Traceable TTS: Toward Watermark-Free TTS with Strong Traceability**|Yuxiang Zhao et.al.|[2507.03887](http://arxiv.org/abs/2507.03887)|null|
49 | |**2025-07-03**|**Open-Source System for Multilingual Translation and Cloned Speech Synthesis**|Mateo Cámara et.al.|[2507.02530](http://arxiv.org/abs/2507.02530)|null|
50 | |**2025-07-03**|**JoyTTS: LLM-based Spoken Chatbot With Voice Cloning**|Fangru Zhou et.al.|[2507.02380](http://arxiv.org/abs/2507.02380)|null|
51 | |**2025-07-02**|**A Dataset for Automatic Assessment of TTS Quality in Spanish**|Alejandro Sosa Welford et.al.|[2507.01805](http://arxiv.org/abs/2507.01805)|null|
52 | |**2025-07-02**|**SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech**|Cheng Zhuangfei et.al.|[2507.01348](http://arxiv.org/abs/2507.01348)|null|
53 | |**2025-07-02**|**Multi-interaction TTS toward professional recording reproduction**|Hiroki Kanagawa et.al.|[2507.00808](http://arxiv.org/abs/2507.00808)|null|
54 | |**2025-06-30**|**Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis**|Paul Mayer et.al.|[2507.00227](http://arxiv.org/abs/2507.00227)|null|
55 | |**2025-06-30**|**JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching**|Mingi Kwon et.al.|[2506.23552](http://arxiv.org/abs/2506.23552)|null|
56 | |**2025-06-29**|**You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties**|Paige Tuttösí et.al.|[2506.23367](http://arxiv.org/abs/2506.23367)|null|
57 | |**2025-06-23**|**IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech**|Siyi Zhou et.al.|[2506.21619](http://arxiv.org/abs/2506.21619)|null|
58 | |**2025-06-25**|**An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS**|Marie Kunešová et.al.|[2506.20190](http://arxiv.org/abs/2506.20190)|null|
59 | |**2025-06-24**|**TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems**|Christoph Minixhofer et.al.|[2506.19441](http://arxiv.org/abs/2506.19441)|null|
60 | |**2025-06-23**|**Selecting N-lowest scores for training MOS prediction models**|Yuto Kondo et.al.|[2506.18326](http://arxiv.org/abs/2506.18326)|null|
61 | |**2025-06-23**|**Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting**|Yuto Kondo et.al.|[2506.18307](http://arxiv.org/abs/2506.18307)|null|
62 | |**2025-06-23**|**JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles**|Yuto Kondo et.al.|[2506.18296](http://arxiv.org/abs/2506.18296)|null|
63 | |**2025-06-20**|**RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching**|Hyun Joon Park et.al.|[2506.16741](http://arxiv.org/abs/2506.16741)|null|
64 | |**2025-06-20**|**LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization**|Daejin Jo et.al.|[2506.16738](http://arxiv.org/abs/2506.16738)|null|
65 | |**2025-06-19**|**Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement**|Tuan-Nam Nguyen et.al.|[2506.16580](http://arxiv.org/abs/2506.16580)|null|
66 | |**2025-06-19**|**InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems**|Kexin Huang et.al.|[2506.16381](http://arxiv.org/abs/2506.16381)|**[link](https://github.com/kexinhuang19/instructttseval)**|
67 | |**2025-06-19**|**Optimizing Multilingual Text-To-Speech with Accents & Emotions**|Pranav Pawar et.al.|[2506.16310](http://arxiv.org/abs/2506.16310)|null|
68 | |**2025-06-18**|**TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data**|Kentaro Seki et.al.|[2506.15614](http://arxiv.org/abs/2506.15614)|null|
69 | |**2025-06-18**|**PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction**|Shufan Li et.al.|[2506.15556](http://arxiv.org/abs/2506.15556)|null|
70 | |**2025-06-18**|**EmojiVoice: Towards long-term controllable expressivity in robot speech**|Paige Tuttösí et.al.|[2506.15085](http://arxiv.org/abs/2506.15085)|null|
71 | |**2025-06-17**|**Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification**|Yiyang Zhao et.al.|[2506.14226](http://arxiv.org/abs/2506.14226)|null|
72 | |**2025-06-16**|**EmoNews: A Spoken Dialogue System for Expressive News Conversations**|Ryuki Matsuura et.al.|[2506.13894](http://arxiv.org/abs/2506.13894)|null|
73 | |**2025-06-16**|**ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching**|Han Zhu et.al.|[2506.13053](http://arxiv.org/abs/2506.13053)|null|
74 | |**2025-06-14**|**StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling**|Hui Wang et.al.|[2506.12570](http://arxiv.org/abs/2506.12570)|null|
75 | |**2025-06-14**|**Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech**|Yakov Kolani et.al.|[2506.12311](http://arxiv.org/abs/2506.12311)|null|
76 | |**2025-06-11**|**S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder**|Yu Pan et.al.|[2506.11160](http://arxiv.org/abs/2506.11160)|null|
77 | |**2025-06-16**|**A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data**|Cheng-Kang Chou et.al.|[2506.11130](http://arxiv.org/abs/2506.11130)|null|
78 | |**2025-06-10**|**GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions**|Wenkang Han et.al.|[2506.11127](http://arxiv.org/abs/2506.11127)|null|
79 | |**2025-06-10**|**ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams**|Freddie Grabovski et.al.|[2506.11125](http://arxiv.org/abs/2506.11125)|null|
80 | |**2025-06-12**|**Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs**|Hayato Futami et.al.|[2506.10299](http://arxiv.org/abs/2506.10299)|null|
81 | |**2025-06-11**|**UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching**|Neta Glazer et.al.|[2506.09874](http://arxiv.org/abs/2506.09874)|null|
82 | |**2025-06-15**|**EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection**|Christoph Schuhmann et.al.|[2506.09827](http://arxiv.org/abs/2506.09827)|null|
83 | |**2025-06-11**|**Ming-Omni: A Unified Multimodal Model for Perception and Generation**|Inclusion AI et.al.|[2506.09344](http://arxiv.org/abs/2506.09344)|**[link](https://github.com/inclusionai/ming)**|
84 | |**2025-06-10**|**A Review on Score-based Generative Models for Audio Applications**|Ge Zhu et.al.|[2506.08457](http://arxiv.org/abs/2506.08457)|null|
85 | |**2025-06-09**|**Seeing Voices: Generating A-Roll Video from Audio with Mirage**|Aditi Sundararaman et.al.|[2506.08279](http://arxiv.org/abs/2506.08279)|null|
86 | |**2025-06-09**|**Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation**|Rui Hu et.al.|[2506.07646](http://arxiv.org/abs/2506.07646)|null|
87 | |**2025-06-07**|**SynHate: Detecting Hate Speech in Synthetic Deepfake Audio**|Rishabh Ranjan et.al.|[2506.06772](http://arxiv.org/abs/2506.06772)|null|
88 | |**2025-06-09**|**Voice Impression Control in Zero-Shot TTS**|Keinichi Fujita et.al.|[2506.05688](http://arxiv.org/abs/2506.05688)|null|
89 | |**2025-06-05**|**Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning**|Hien Ohnaka et.al.|[2506.04527](http://arxiv.org/abs/2506.04527)|null|
90 | |**2025-06-04**|**Can we reconstruct a dysarthric voice with the large speech model Parler TTS?**|Ariadna Sanchez et.al.|[2506.04397](http://arxiv.org/abs/2506.04397)|null|
91 | |**2025-06-04**|**HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset**|Ryan Langman et.al.|[2506.04152](http://arxiv.org/abs/2506.04152)|null|
92 | |**2025-06-04**|**UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation**|Jinting Wang et.al.|[2506.04134](http://arxiv.org/abs/2506.04134)|null|
93 | |**2025-06-04**|**A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions**|Chung-Chun Wang et.al.|[2506.04077](http://arxiv.org/abs/2506.04077)|null|
94 | |**2025-06-04**|**Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages**|Utkarsh Pathak et.al.|[2506.03884](http://arxiv.org/abs/2506.03884)|null|
95 | |**2025-06-04**|**Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts**|Sidharth Pulipaka et.al.|[2506.03793](http://arxiv.org/abs/2506.03793)|null|
96 | |**2025-06-04**|**BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing**|Masaya Kawamura et.al.|[2506.03515](http://arxiv.org/abs/2506.03515)|null|
97 | |**2025-06-03**|**Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation**|Yongqi Wang et.al.|[2506.02997](http://arxiv.org/abs/2506.02997)|null|
98 | |**2025-06-03**|**Towards a Japanese Full-duplex Spoken Dialogue System**|Atsumoto Ohashi et.al.|[2506.02979](http://arxiv.org/abs/2506.02979)|null|
99 | |**2025-06-03**|**CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech**|Helin Wang et.al.|[2506.02863](http://arxiv.org/abs/2506.02863)|null|
100 | |**2025-06-03**|**Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions**|Xiaoxue Gao et.al.|[2506.02742](http://arxiv.org/abs/2506.02742)|null|
101 | |**2025-06-02**|**SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction**|Saurabh Agrawal et.al.|[2506.02082](http://arxiv.org/abs/2506.02082)|null|
102 | |**2025-06-02**|**Zero-Shot Text-to-Speech for Vietnamese**|Thi Vu et.al.|[2506.01322](http://arxiv.org/abs/2506.01322)|null|
103 | |**2025-06-02**|**CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction**|Yudong Lu et.al.|[2506.01268](http://arxiv.org/abs/2506.01268)|null|
104 | |**2025-06-02**|**WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing**|Yu Nakagome et.al.|[2506.01263](http://arxiv.org/abs/2506.01263)|null|
105 | |**2025-06-01**|**DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation**|Ming Meng et.al.|[2506.01020](http://arxiv.org/abs/2506.01020)|null|
106 | |**2025-06-01**|**Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models**|Kyowoon Lee et.al.|[2506.00832](http://arxiv.org/abs/2506.00832)|null|
107 | |**2025-05-30**|**Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation**|Wenrui Liu et.al.|[2505.24496](http://arxiv.org/abs/2505.24496)|null|
108 | |**2025-05-30**|**DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec**|Peijie Chen et.al.|[2505.24314](http://arxiv.org/abs/2505.24314)|null|
109 | |**2025-05-29**|**Can Emotion Fool Anti-spoofing?**|Aurosweta Mahapatra et.al.|[2505.23962](http://arxiv.org/abs/2505.23962)|null|
110 | |**2025-05-29**|**Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes**|Neta Glazer et.al.|[2505.23619](http://arxiv.org/abs/2505.23619)|null|
111 | |**2025-05-29**|**EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge**|Ruskin Raj Manku et.al.|[2505.23009](http://arxiv.org/abs/2505.23009)|**[link](https://github.com/boson-ai/emergenttts-eval-public)**|
112 | |**2025-05-29**|**LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting**|Pai Zhu et.al.|[2505.22995](http://arxiv.org/abs/2505.22995)|null|
113 | |**2025-05-28**|**Tell me Habibi, is it Real or Fake?**|Kartik Kuckreja et.al.|[2505.22581](http://arxiv.org/abs/2505.22581)|null|
114 | |**2025-05-28**|**A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity**|Charlotte Pouw et.al.|[2505.22236](http://arxiv.org/abs/2505.22236)|null|
115 | |**2025-05-27**|**Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech**|Nam-Gyu Kim et.al.|[2505.20868](http://arxiv.org/abs/2505.20868)|null|
116 | |**2025-05-26**|**Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling**|Qixi Zheng et.al.|[2505.19931](http://arxiv.org/abs/2505.19931)|null|
117 | |**2025-05-26**|**DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech**|Deok-Hyeon Cho et.al.|[2505.19687](http://arxiv.org/abs/2505.19687)|null|
118 | |**2025-05-26**|**KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization**|Zhaolin Li et.al.|[2505.19679](http://arxiv.org/abs/2505.19679)|null|
119 | |**2025-05-26**|**Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling**|Haiyang Sun et.al.|[2505.19669](http://arxiv.org/abs/2505.19669)|null|
120 | |**2025-05-26**|**Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment**|Jeongsoo Choi et.al.|[2505.19595](http://arxiv.org/abs/2505.19595)|**[link](https://github.com/zhikangniu/a-dma)**|
121 | |**2025-05-25**|**SpeakStream: Streaming Text-to-Speech with Interleaved Data**|Richard He Bai et.al.|[2505.19206](http://arxiv.org/abs/2505.19206)|null|
122 | |**2025-05-25**|**CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning**|Renyuan Li et.al.|[2505.19119](http://arxiv.org/abs/2505.19119)|null|
123 | |**2025-05-25**|**Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis**|Minsu Kim et.al.|[2505.18972](http://arxiv.org/abs/2505.18972)|null|
124 | |**2025-05-27**|**RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations**|Ashwin Sankar et.al.|[2505.18609](http://arxiv.org/abs/2505.18609)|null|
125 | |**2025-05-24**|**MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt**|Zhichao Wu et.al.|[2505.18453](http://arxiv.org/abs/2505.18453)|null|
126 | |**2025-05-23**|**What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection**|Binh Nguyen et.al.|[2505.17513](http://arxiv.org/abs/2505.17513)|null|
127 | |**2025-05-23**|**UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information**|Rui Wang et.al.|[2505.17426](http://arxiv.org/abs/2505.17426)|null|
128 | |**2025-05-23**|**Speechless: Speech Instruction Training Without Speech for Low Resource Languages**|Alan Dao et.al.|[2505.17417](http://arxiv.org/abs/2505.17417)|null|
129 | |**2025-05-22**|**Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2**|Zackary Rackauckas et.al.|[2505.17320](http://arxiv.org/abs/2505.17320)|null|
130 | |**2025-05-21**|**Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech**|Yejin Lee et.al.|[2505.17093](http://arxiv.org/abs/2505.17093)|null|
131 | |**2025-05-22**|**From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition**|Tianduo Wang et.al.|[2505.16972](http://arxiv.org/abs/2505.16972)|**[link](https://github.com/tianduowang/speech-bt)**|
132 | |**2025-05-21**|**MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling**|Yifan Cheng et.al.|[2505.15772](http://arxiv.org/abs/2505.15772)|null|
133 | |**2025-05-21**|**Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information**|Nicholas Sanders et.al.|[2505.15667](http://arxiv.org/abs/2505.15667)|null|
134 | |**2025-05-21**|**Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models**|Zirui Song et.al.|[2505.15406](http://arxiv.org/abs/2505.15406)|**[link](https://github.com/mbzuai-nlp/audiojailbreak)**|
135 | |**2025-05-20**|**FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation**|Yutong Liu et.al.|[2505.14351](http://arxiv.org/abs/2505.14351)|null|
136 | |**2025-05-21**|**AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models**|Guangke Chen et.al.|[2505.14103](http://arxiv.org/abs/2505.14103)|null|
137 | |**2025-05-20**|**SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement**|Kuan-Yu Chen et.al.|[2505.14066](http://arxiv.org/abs/2505.14066)|null|
138 | |**2025-05-22**|**Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising**|Ye-Xin Lu et.al.|[2505.13830](http://arxiv.org/abs/2505.13830)|null|
139 | |**2025-05-19**|**OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching**|Hieu-Nghia Huynh-Nguyen et.al.|[2505.12800](http://arxiv.org/abs/2505.12800)|null|
140 | |**2025-05-18**|**Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis**|Dong Yang et.al.|[2505.12226](http://arxiv.org/abs/2505.12226)|null|
141 | |**2025-05-16**|**Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese**|Xihuai Wang et.al.|[2505.11200](http://arxiv.org/abs/2505.11200)|null|
142 | |**2025-05-16**|**BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset**|Istiaq Ahmed Fahad et.al.|[2505.10885](http://arxiv.org/abs/2505.10885)|**[link](https://github.com/KamruzzamanAsif/BanglaFake)**|
143 | |**2025-05-15**|**UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech**|Jiaxuan Liu et.al.|[2505.10599](http://arxiv.org/abs/2505.10599)|null|
144 | |**2025-05-12**|**MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder**|Bowen Zhang et.al.|[2505.07916](http://arxiv.org/abs/2505.07916)|null|
145 | |**2025-05-12**|**Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications**|Biel Tura Vecino et.al.|[2505.07701](http://arxiv.org/abs/2505.07701)|null|
146 | |**2025-05-10**|**VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback**|Eason Chen et.al.|[2505.06676](http://arxiv.org/abs/2505.06676)|null|
147 | |**2025-05-10**|**Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation**|Abbas Bertina et.al.|[2505.06599](http://arxiv.org/abs/2505.06599)|null|
148 | |**2025-05-15**|**FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech**|Linhan Ma et.al.|[2505.05159](http://arxiv.org/abs/2505.05159)|null|
149 | |**2025-05-08**|**Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations**|Linrong Pan et.al.|[2505.05056](http://arxiv.org/abs/2505.05056)|null|
150 | |**2025-05-08**|**A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration**|Shaja Arul Selvamani et.al.|[2505.04885](http://arxiv.org/abs/2505.04885)|null|
151 | |**2025-05-07**|**Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment**|Xueyao Zhang et.al.|[2505.04113](http://arxiv.org/abs/2505.04113)|null|
152 | |**2025-05-06**|**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**|Zuwei Long et.al.|[2505.03739](http://arxiv.org/abs/2505.03739)|**[link](https://github.com/vita-mllm/vita-audio)**|
153 | |**2025-05-05**|**Generating Narrated Lecture Videos from Slides with Synchronized Highlights**|Alexander Holmberg et.al.|[2505.02966](http://arxiv.org/abs/2505.02966)|null|
154 | |**2025-05-05**|**Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play**|Yemin Shi et.al.|[2505.02707](http://arxiv.org/abs/2505.02707)|**[link](https://github.com/maitrix-org/voila)**|
155 | |**2025-04-30**|**Sadeed: Advancing Arabic Diacritization Through Small Language Model**|Zeina Aldallal et.al.|[2504.21635](http://arxiv.org/abs/2504.21635)|null|
156 | |**2025-04-29**|**ClonEval: An Open Voice Cloning Benchmark**|Iwona Christop et.al.|[2504.20581](http://arxiv.org/abs/2504.20581)|null|
157 | |**2025-05-02**|**Towards Flow-Matching-based TTS without Classifier-Free Guidance**|Yuzhe Liang et.al.|[2504.20334](http://arxiv.org/abs/2504.20334)|null|
158 | |**2025-04-27**|**Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget**|Xin Li et.al.|[2504.19146](http://arxiv.org/abs/2504.19146)|**[link](https://github.com/MYZY-AI/Muyan-TTS)**|
159 | |**2025-04-22**|**A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models**|Gengxian Cao et.al.|[2504.15552](http://arxiv.org/abs/2504.15552)|null|
160 | |**2025-04-18**|**ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents**|Takuya Sera et.al.|[2504.13793](http://arxiv.org/abs/2504.13793)|null|
161 | |**2025-04-22**|**EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting**|Guanrou Yang et.al.|[2504.12867](http://arxiv.org/abs/2504.12867)|null|
162 | |**2025-04-15**|**GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture**|Yaodong Song et.al.|[2504.12339](http://arxiv.org/abs/2504.12339)|null|
163 | |**2025-04-15**|**Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation**|Yan Rong et.al.|[2504.11002](http://arxiv.org/abs/2504.11002)|null|
164 | |**2025-04-15**|**Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy**|Botao Zhao et.al.|[2504.10819](http://arxiv.org/abs/2504.10819)|null|
165 | |**2025-04-14**|**Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis**|Yifan Yang et.al.|[2504.10352](http://arxiv.org/abs/2504.10352)|null|
166 | |**2025-04-14**|**AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis**|Dan Luo et.al.|[2504.10309](http://arxiv.org/abs/2504.10309)|null|
167 | |**2025-04-11**|**Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation**|Haowei Lou et.al.|[2504.08274](http://arxiv.org/abs/2504.08274)|null|
168 | |**2025-04-10**|**Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis**|Yizhong Geng et.al.|[2504.07858](http://arxiv.org/abs/2504.07858)|null|
169 | |**2025-04-10**|**SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow**|Kaidi Wang et.al.|[2504.07776](http://arxiv.org/abs/2504.07776)|null|
170 | |**2025-04-08**|**AVENet: Disentangling Features by Approximating Average Features for Voice Conversion**|Wenyu Wang et.al.|[2504.05833](http://arxiv.org/abs/2504.05833)|null|
171 | |**2025-04-07**|**SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation**|Stephen Brade et.al.|[2504.05106](http://arxiv.org/abs/2504.05106)|null|
172 | |**2025-04-04**|**RWKVTTS: Yet another TTS based on RWKV-7**|Lin yueyu et.al.|[2504.03289](http://arxiv.org/abs/2504.03289)|**[link](https://github.com/yynil/rwkvtts)**|
173 | |**2025-04-09**|**F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization**|Xiaohui Sun et.al.|[2504.02407](http://arxiv.org/abs/2504.02407)|null|
174 | |**2025-04-02**|**TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection**|Zhiming Ma et.al.|[2503.24115](http://arxiv.org/abs/2503.24115)|**[link](https://github.com/jimmyma99/teleantifraud)**|
175 | |**2025-03-30**|**Speculative End-Turn Detector for Efficient Speech Chatbot Assistant**|Hyunjong Ok et.al.|[2503.23439](http://arxiv.org/abs/2503.23439)|null|
176 | |**2025-03-29**|**SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System**|Hyeongju Kim et.al.|[2503.23108](http://arxiv.org/abs/2503.23108)|null|
177 | |**2025-03-26**|**Dual Audio-Centric Modality Coupling for Talking Head Generation**|Ao Fu et.al.|[2503.22728](http://arxiv.org/abs/2503.22728)|null|
178 | |**2025-03-28**|**DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation**|Haomin Zhang et.al.|[2503.22265](http://arxiv.org/abs/2503.22265)|null|
179 | |**2025-03-26**|**Text-Driven Voice Conversion via Latent State-Space Modeling**|Wen Li et.al.|[2503.20999](http://arxiv.org/abs/2503.20999)|null|
180 | |**2025-03-28**|**FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System**|Hao-Han Guo et.al.|[2503.20499](http://arxiv.org/abs/2503.20499)|null|
181 | |**2025-03-21**|**Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication**|Yiwen Xu et.al.|[2503.17479](http://arxiv.org/abs/2503.17479)|null|
182 | |**2025-03-10**|**VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection**|Kunal Chavan et.al.|[2503.16488](http://arxiv.org/abs/2503.16488)|null|
183 | |**2025-03-19**|**MoonCast: High-Quality Zero-Shot Podcast Generation**|Zeqian Ju et.al.|[2503.14345](http://arxiv.org/abs/2503.14345)|**[link](https://github.com/jzq2000/mooncast)**|
184 | |**2025-03-14**|**MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation**|Sungwoo Cho et.al.|[2503.11026](http://arxiv.org/abs/2503.11026)|null|
185 | |**2025-03-11**|**An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR**|Sewade Ogun et.al.|[2503.08954](http://arxiv.org/abs/2503.08954)|null|
186 | |**2025-03-03**|**Direct Speech to Speech Translation: A Review**|Mohammad Sarim et.al.|[2503.04799](http://arxiv.org/abs/2503.04799)|null|
187 | |**2025-03-06**|**LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM**|Sambal Shikhar et.al.|[2503.04724](http://arxiv.org/abs/2503.04724)|null|
188 | |**2025-03-06**|**Scaling Rich Style-Prompted Text-to-Speech Datasets**|Anuj Diwan et.al.|[2503.04713](http://arxiv.org/abs/2503.04713)|**[link](https://github.com/ajd12342/paraspeechcaps)**|
189 | |**2025-03-04**|**InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training**|Dingdong Wang et.al.|[2503.02769](http://arxiv.org/abs/2503.02769)|null|
190 | |**2025-03-03**|**Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens**|Xinsheng Wang et.al.|[2503.01710](http://arxiv.org/abs/2503.01710)|**[link](https://github.com/sparkaudio/spark-tts)**|
191 | |**2025-03-02**|**UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation**|Alexander H. Liu et.al.|[2503.00733](http://arxiv.org/abs/2503.00733)|null|
192 | |**2025-03-12**|**Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale**|Max M. Lang et.al.|[2502.20140](http://arxiv.org/abs/2502.20140)|null|
193 | |**2025-02-26**|**Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis**|Ziyue Jiang et.al.|[2502.18924](http://arxiv.org/abs/2502.18924)|null|
194 | |**2025-03-08**|**Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding**|Tianyun Liu et.al.|[2502.18889](http://arxiv.org/abs/2502.18889)|null|
195 | |**2025-02-24**|**Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM**|Jiatong Shi et.al.|[2502.16897](http://arxiv.org/abs/2502.16897)|null|
196 | |**2025-02-17**|**NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing**|Yifan Liang et.al.|[2502.12002](http://arxiv.org/abs/2502.12002)|null|
197 | |**2025-02-16**|**SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer**|Zhengyan Sheng et.al.|[2502.11094](http://arxiv.org/abs/2502.11094)|null|
198 | |**2025-02-14**|**VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect**|Qingyuan Fei et.al.|[2502.10329](http://arxiv.org/abs/2502.10329)|null|
199 | |**2025-02-13**|**TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument**|Kyungsu Kim et.al.|[2502.08939](http://arxiv.org/abs/2502.08939)|**[link](https://github.com/kyungsukim42/tokensynth)**|
200 | |**2025-03-02**|**ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech**|Xin Wang et.al.|[2502.08857](http://arxiv.org/abs/2502.08857)|null|
201 | |**2025-02-11**|**LoRP-TTS: Low-Rank Personalized Text-To-Speech**|Łukasz Bondaruk et.al.|[2502.07562](http://arxiv.org/abs/2502.07562)|null|
202 | |**2025-02-11**|**Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction**|Leying Zhang et.al.|[2502.07345](http://arxiv.org/abs/2502.07345)|null|
203 | |**2025-02-11**|**Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement**|Xueyao Zhang et.al.|[2502.07243](http://arxiv.org/abs/2502.07243)|null|
204 | |**2025-02-10**|**Synthetic Audio Helps for Cognitive State Tasks**|Adil Soubki et.al.|[2502.06922](http://arxiv.org/abs/2502.06922)|**[link](https://github.com/adil-soubki/sad-training)**|
205 | |**2025-02-19**|**Speech to Speech Translation with Translatotron: A State of the Art Review**|Jules R. Kala et.al.|[2502.05980](http://arxiv.org/abs/2502.05980)|null|
206 | |**2025-02-09**|**BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting**|Mohammad Jahid Ibna Basher et.al.|[2502.05729](http://arxiv.org/abs/2502.05729)|null|
207 | |**2025-02-08**|**Gender Bias in Instruction-Guided Speech Synthesis Models**|Chun-Yi Kuan et.al.|[2502.05649](http://arxiv.org/abs/2502.05649)|null|
208 | |**2025-02-08**|**IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System**|Wei Deng et.al.|[2502.05512](http://arxiv.org/abs/2502.05512)|null|
209 | |**2025-02-05**|**Metis: A Foundation Speech Generation Model with Masked Generative Pre-training**|Yuancheng Wang et.al.|[2502.03128](http://arxiv.org/abs/2502.03128)|**[link](https://github.com/open-mmlab/amphion)**|
210 | |**2025-02-05**|**Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech**|Jixun Yao et.al.|[2502.02950](http://arxiv.org/abs/2502.02950)|null|
211 | |**2025-02-04**|**Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet**|Shenran Wang et.al.|[2502.02703](http://arxiv.org/abs/2502.02703)|null|
212 | |**2025-02-04**|**Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation**|Peidong Wang et.al.|[2502.02683](http://arxiv.org/abs/2502.02683)|null|
213 | |**2025-02-02**|**EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis**|Junuk Cha et.al.|[2502.00654](http://arxiv.org/abs/2502.00654)|null|
214 | |**2025-01-31**|**VisualSpeech: Enhance Prosody with Visual Context in TTS**|Shumin Que et.al.|[2501.19258](http://arxiv.org/abs/2501.19258)|null|
215 | |**2025-01-29**|**BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights**|Chan-Jan Hsu et.al.|[2501.17790](http://arxiv.org/abs/2501.17790)|null|
216 | |**2025-01-28**|**Compact Neural TTS Voices for Accessibility**|Kunal Jain et.al.|[2501.17332](http://arxiv.org/abs/2501.17332)|null|
217 | |**2025-01-26**|**Overview of the Amphion Toolkit (v0.2)**|Jiaqi Li et.al.|[2501.15442](http://arxiv.org/abs/2501.15442)|**[link](https://github.com/open-mmlab/amphion)**|
218 | |**2025-01-24**|**Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models**|Tianrui Wang et.al.|[2501.14273](http://arxiv.org/abs/2501.14273)|null|
219 | |**2025-01-24**|**Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation**|Wen Huang et.al.|[2501.14240](http://arxiv.org/abs/2501.14240)|null|
220 | |**2025-01-24**|**LoCoML: A Framework for Real-World ML Inference Pipelines**|Kritin Maddireddy et.al.|[2501.14165](http://arxiv.org/abs/2501.14165)|null|
221 | |**2025-01-23**|**Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement**|Jae-Sung Bae et.al.|[2501.13372](http://arxiv.org/abs/2501.13372)|null|
222 | |**2025-01-21**|**A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data**|Minh Tran et.al.|[2501.12501](http://arxiv.org/abs/2501.12501)|null|
223 | |**2025-01-15**|**Speech Synthesis along Perceptual Voice Quality Dimensions**|Frederik Rautenberg et.al.|[2501.08791](http://arxiv.org/abs/2501.08791)|null|
224 | |**2025-01-15**|**Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification**|Li Zhang et.al.|[2501.08691](http://arxiv.org/abs/2501.08691)|null|
225 | |**2025-01-15**|**Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement**|Qianniu Chen et.al.|[2501.08566](http://arxiv.org/abs/2501.08566)|null|
226 | |**2025-01-19**|**MathReader : Text-to-Speech for Mathematical Documents**|Sieun Hyeon et.al.|[2501.07088](http://arxiv.org/abs/2501.07088)|**[link](https://github.com/hyeonsieun/mathreader)**|
227 | |**2025-01-10**|**TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer**|Vladimir Bataev et.al.|[2501.06320](http://arxiv.org/abs/2501.06320)|null|
228 | |**2025-01-10**|**MinMo: A Multimodal Large Language Model for Seamless Voice Interaction**|Qian Chen et.al.|[2501.06282](http://arxiv.org/abs/2501.06282)|null|
229 | |**2025-01-10**|**PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control**|Shaozuo Zhang et.al.|[2501.06276](http://arxiv.org/abs/2501.06276)|null|
230 | |**2025-01-10**|**Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron**|Kishor Kayyar Lakshminarayana et.al.|[2501.05976](http://arxiv.org/abs/2501.05976)|null|
231 | |**2025-01-10**|**MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model**|Matthew Baas et.al.|[2501.05787](http://arxiv.org/abs/2501.05787)|null|
232 | |**2025-01-09**|**Probing Speaker-specific Features in Speaker Representations**|Aemon Yat Fei Chiu et.al.|[2501.05310](http://arxiv.org/abs/2501.05310)|null|
233 | |**2025-01-08**|**Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model**|Sanjana Sankar et.al.|[2501.04799](http://arxiv.org/abs/2501.04799)|null|
234 | |**2025-01-08**|**DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions**|Weidong Chen et.al.|[2501.04256](http://arxiv.org/abs/2501.04256)|null|
235 | |**2025-01-02**|**FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles**|Tian-Hao Zhang et.al.|[2501.03181](http://arxiv.org/abs/2501.03181)|null|
236 | |**2025-01-02**|**RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer**|Seongho Hong et.al.|[2501.01182](http://arxiv.org/abs/2501.01182)|**[link](https://github.com/seongho608/ringformer)**|
237 | |**2025-01-02**|**Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT**|Dongyang Dai et.al.|[2501.01102](http://arxiv.org/abs/2501.01102)|null|
238 | |**2025-01-06**|**Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study**|Mykola Maslych et.al.|[2501.00168](http://arxiv.org/abs/2501.00168)|null|
239 | |**2024-12-28**|**Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting**|Wooseok Han et.al.|[2412.20155](http://arxiv.org/abs/2412.20155)|null|
240 | |**2024-12-26**|**"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities**|Jiawei Yu et.al.|[2412.19102](http://arxiv.org/abs/2412.19102)|null|
241 | |**2024-12-26**|**Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID**|Ahmad Alfani Handoyo et.al.|[2412.19043](http://arxiv.org/abs/2412.19043)|null|
242 | |**2024-12-25**|**Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset**|Neil Shah et.al.|[2412.18839](http://arxiv.org/abs/2412.18839)|null|
243 | |**2024-12-24**|**GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing**|Wen Ku et.al.|[2412.18300](http://arxiv.org/abs/2412.18300)|null|
244 | |**2024-12-22**|**Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective**|Hankun Wang et.al.|[2412.17048](http://arxiv.org/abs/2412.17048)|null|
245 | |**2024-12-22**|**Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis**|Ye-Xin Lu et.al.|[2412.16977](http://arxiv.org/abs/2412.16977)|null|
246 | |**2024-12-22**|**Autoregressive Speech Synthesis with Next-Distribution Prediction**|Xinfa Zhu et.al.|[2412.16846](http://arxiv.org/abs/2412.16846)|null|
247 | |**2024-12-23**|**Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers**|Yifan Yang et.al.|[2412.16102](http://arxiv.org/abs/2412.16102)|null|
248 | |**2024-12-19**|**Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling**|Leying Zhang et.al.|[2412.14890](http://arxiv.org/abs/2412.14890)|null|
249 | |**2024-12-17**|**Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge**|Mahieyin Rahmun et.al.|[2412.13279](http://arxiv.org/abs/2412.13279)|**[link](https://github.com/AGenCyLab/SPCUP2022)**|
250 | |**2024-12-17**|**Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion**|Syed Zohaib Hassan et.al.|[2412.12710](http://arxiv.org/abs/2412.12710)|null|
251 | |**2024-12-17**|**Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes**|Kuiyuan Zhang et.al.|[2412.12619](http://arxiv.org/abs/2412.12619)|null|
252 | |**2024-12-17**|**Hierarchical Control of Emotion Rendering in Speech Synthesis**|Sho Inoue et.al.|[2412.12498](http://arxiv.org/abs/2412.12498)|**[link](https://github.com/shinshoji01/hed-project-page)**|
253 | |**2024-12-19**|**ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis**|Xiangheng He et.al.|[2412.11795](http://arxiv.org/abs/2412.11795)|null|
254 | |**2024-12-17**|**Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech**|Rui Liu et.al.|[2412.11409](http://arxiv.org/abs/2412.11409)|**[link](https://github.com/ai-s2-lab/m2se-vtts)**|
255 | |**2024-12-16**|**Efficient Generative Modeling with Residual Vector Quantization-Based Tokens**|Jaehyeon Kim et.al.|[2412.10208](http://arxiv.org/abs/2412.10208)|null|
256 | |**2024-12-13**|**AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation**|Xiyuan Gao et.al.|[2412.10103](http://arxiv.org/abs/2412.10103)|null|
257 | |**2024-12-13**|**CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder**|Jianwei Cui et.al.|[2412.08918](http://arxiv.org/abs/2412.08918)|null|
258 | |**2024-12-11**|**Multimodal Latent Language Modeling with Next-Token Diffusion**|Yutao Sun et.al.|[2412.08635](http://arxiv.org/abs/2412.08635)|**[link](https://github.com/microsoft/unilm/tree/master/LatentLM)**|
259 | |**2024-12-11**|**A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction**|Sowmya Cheripally et.al.|[2412.08312](http://arxiv.org/abs/2412.08312)|null|
260 | |**2024-12-11**|**A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings**|Anindita Mondal et.al.|[2412.08283](http://arxiv.org/abs/2412.08283)|null|
261 | |**2024-12-11**|**LatentSpeech: Latent Diffusion for Text-To-Speech Generation**|Haowei Lou et.al.|[2412.08117](http://arxiv.org/abs/2412.08117)|null|
262 | |**2024-12-11**|**Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration**|Haowei Lou et.al.|[2412.08112](http://arxiv.org/abs/2412.08112)|null|
263 | |**2024-12-09**|**Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey**|Tianxin Xie et.al.|[2412.06602](http://arxiv.org/abs/2412.06602)|**[link](https://github.com/imxtx/awesome-controllabe-speech-synthesis)**|
264 | |**2024-12-12**|**EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations**|Weizhen Bian et.al.|[2412.06581](http://arxiv.org/abs/2412.06581)|null|
265 | |**2024-12-01**|**Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor**|Ashwin Baluja et.al.|[2412.05315](http://arxiv.org/abs/2412.05315)|null|
266 | |**2024-12-04**|**DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles**|Jiaxuan Liu et.al.|[2412.03388](http://arxiv.org/abs/2412.03388)|null|
267 | |**2024-12-03**|**GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot**|Aohan Zeng et.al.|[2412.02612](http://arxiv.org/abs/2412.02612)|**[link](https://github.com/thudm/glm-4-voice)**|
268 | |**2024-11-19**|**A Context-Based Numerical Format Prediction for a Text-To-Speech System**|Yaser Darwesh et.al.|[2412.00028](http://arxiv.org/abs/2412.00028)|null|
269 | |**2024-11-27**|**Continual Learning in Machine Speech Chain Using Gradient Episodic Memory**|Geoffrey Tyndall et.al.|[2411.18320](http://arxiv.org/abs/2411.18320)|null|
270 | |**2024-11-27**|**SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation**|Wenyi Yu et.al.|[2411.18138](http://arxiv.org/abs/2411.18138)|null|
271 | |**2024-11-26**|**WavChat: A Survey of Spoken Dialogue Models**|Shengpeng Ji et.al.|[2411.13577](http://arxiv.org/abs/2411.13577)|**[link](https://github.com/jishengpeng/wavchat)**|
272 | |**2024-12-02**|**I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception**|Jiawei Zhang et.al.|[2411.13314](http://arxiv.org/abs/2411.13314)|null|
273 | |**2024-11-20**|**Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM**|Jiawei Yu et.al.|[2411.13159](http://arxiv.org/abs/2411.13159)|null|
274 | |**2024-11-19**|**Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation**|Praveen Srinivasa Varadhan et.al.|[2411.12719](http://arxiv.org/abs/2411.12719)|null|
275 | |**2024-11-19**|**Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D**|Adithya TG et.al.|[2411.12619](http://arxiv.org/abs/2411.12619)|null|
276 | |**2024-11-18**|**ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram**|Xiao-Hang Jiang et.al.|[2411.11258](http://arxiv.org/abs/2411.11258)|null|
277 | |**2024-11-12**|**Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models**|Dongrui Han et.al.|[2411.07563](http://arxiv.org/abs/2411.07563)|null|
278 | |**2024-11-11**|**Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities**|Snehasish Paul Shivali Chauhan et.al.|[2411.06970](http://arxiv.org/abs/2411.06970)|null|
279 | |**2024-11-10**|**Debatts: Zero-Shot Debating Text-to-Speech Synthesis**|Yiqiao Huang et.al.|[2411.06540](http://arxiv.org/abs/2411.06540)|null|
280 | |**2024-11-07**|**CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR**|Kadir Burak Buldu et.al.|[2411.04671](http://arxiv.org/abs/2411.04671)|null|
281 | |**2024-11-04**|**EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector**|Deok-Hyeon Cho et.al.|[2411.02625](http://arxiv.org/abs/2411.02625)|**[link](https://github.com/Choddeok/EmoSpherepp)**|
282 | |**2024-11-09**|**Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis**|Shijia Liao et.al.|[2411.01156](http://arxiv.org/abs/2411.01156)|**[link](https://github.com/fishaudio/fish-speech)**|
283 | |**2024-10-31**|**Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?**|Ioannis Tsiamas et.al.|[2410.24019](http://arxiv.org/abs/2410.24019)|null|
284 | |**2024-10-30**|**Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis**|Théodor Lemerle et.al.|[2410.23320](http://arxiv.org/abs/2410.23320)|**[link](https://github.com/theodorblackbird/lina-speech)**|
285 | |**2024-10-29**|**Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech**|Eric Battenberg et.al.|[2410.22179](http://arxiv.org/abs/2410.22179)|null|
286 | |**2024-10-29**|**Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding**|Bohan Li et.al.|[2410.21951](http://arxiv.org/abs/2410.21951)|null|
287 | |**2024-10-29**|**RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis**|Kehan Sui et.al.|[2410.21641](http://arxiv.org/abs/2410.21641)|null|
288 | |**2024-10-28**|**Asynchronous Tool Usage for Real-Time Agents**|Antonio A. Ginart et.al.|[2410.21620](http://arxiv.org/abs/2410.21620)|null|
289 | |**2024-10-28**|**Enhancing TTS Stability in Hebrew using Discrete Semantic Units**|Ella Zeldes et.al.|[2410.21502](http://arxiv.org/abs/2410.21502)|null|
290 | |**2024-10-28**|**Mitigating Unauthorized Speech Synthesis for Voice Protection**|Zhisheng Zhang et.al.|[2410.20742](http://arxiv.org/abs/2410.20742)|**[link](https://github.com/wxzyd123/pivotal_objective_perturbation)**|
291 | |**2024-10-27**|**Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation**|Maohao Shen et.al.|[2410.20336](http://arxiv.org/abs/2410.20336)|null|
292 | |**2024-10-24**|**Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis**|Suparna De et.al.|[2410.19199](http://arxiv.org/abs/2410.19199)|null|
293 | |**2024-10-24**|**STTATTS: Unified Speech-To-Text And Text-To-Speech Model**|Hawau Olamide Toyin et.al.|[2410.18607](http://arxiv.org/abs/2410.18607)|**[link](https://github.com/mbzuai-nlp/sttatts)**|
294 | |**2024-10-24**|**Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts**|ChaeHun Park et.al.|[2410.18444](http://arxiv.org/abs/2410.18444)|null|
295 | |**2024-10-23**|**ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams**|Srija Anand et.al.|[2410.17901](http://arxiv.org/abs/2410.17901)|null|
296 | |**2024-10-22**|**Continuous Speech Tokenizer in Text To Speech**|Yixing Li et.al.|[2410.17081](http://arxiv.org/abs/2410.17081)|null|
297 | |**2024-10-22**|**Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap**|Guanrou Yang et.al.|[2410.16726](http://arxiv.org/abs/2410.16726)|null|
298 | |**2024-10-21**|**Continuous Speech Synthesis using per-token Latent Diffusion**|Arnon Turetzky et.al.|[2410.16048](http://arxiv.org/abs/2410.16048)|null|
299 | |**2024-10-18**|**A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages**|Sujitha Sathiyamoorthy et.al.|[2410.14197](http://arxiv.org/abs/2410.14197)|null|
300 | |**2024-10-18**|**Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech**|Shuwei He et.al.|[2410.14101](http://arxiv.org/abs/2410.14101)|**[link](https://github.com/ms2ku-vtts/ms2ku-vtts)**|
301 | |**2024-10-17**|**Enhancing Crowdsourced Audio for Text-to-Speech Models**|José Giraldo et.al.|[2410.13357](http://arxiv.org/abs/2410.13357)|null|
302 | |**2024-10-17**|**DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech**|Jan Melechovsky et.al.|[2410.13342](http://arxiv.org/abs/2410.13342)|null|
303 | |**2024-10-17**|**DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2410.13288](http://arxiv.org/abs/2410.13288)|null|
304 | |**2024-10-17**|**Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation**|Sreyan Ghosh et.al.|[2410.13198](http://arxiv.org/abs/2410.13198)|null|
305 | |**2024-10-16**|**ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs**|Rui-Chen Zheng et.al.|[2410.12359](http://arxiv.org/abs/2410.12359)|null|
306 | |**2024-10-14**|**IsoChronoMeter: A simple and effective isochronic translation evaluation metric**|Nikolai Rozanov et.al.|[2410.11127](http://arxiv.org/abs/2410.11127)|null|
307 | |**2024-10-14**|**DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization**|Yingahao Aaron Li et.al.|[2410.11097](http://arxiv.org/abs/2410.11097)|null|
308 | |**2024-10-12**|**Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling**|Rui Liu et.al.|[2410.09524](http://arxiv.org/abs/2410.09524)|null|
309 | |**2024-10-10**|**Unsupervised Data Validation Methods for Efficient Model Training**|Yurii Paniv et.al.|[2410.07880](http://arxiv.org/abs/2410.07880)|null|
310 | |**2024-10-15**|**F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching**|Yushen Chen et.al.|[2410.06885](http://arxiv.org/abs/2410.06885)|**[link](https://github.com/SWivid/F5-TTS)**|
311 | |**2024-10-09**|**Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch**|Teodora Răgman et.al.|[2410.06787](http://arxiv.org/abs/2410.06787)|null|
312 | |**2024-10-09**|**Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS**|Onkar Kishor Susladkar et.al.|[2410.06608](http://arxiv.org/abs/2410.06608)|null|
313 | |**2024-10-09**|**Can DeepFake Speech be Reliably Detected?**|Hongbin Liu et.al.|[2410.06572](http://arxiv.org/abs/2410.06572)|null|
314 | |**2024-10-07**|**SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech**|Minchan Kim et.al.|[2410.04690](http://arxiv.org/abs/2410.04690)|null|
315 | |**2024-10-06**|**HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis**|Yuto Nishimura et.al.|[2410.04380](http://arxiv.org/abs/2410.04380)|null|
316 | |**2024-10-10**|**SONAR: A Synthetic AI-Audio Detection Framework and Benchmark**|Xiang Li et.al.|[2410.04324](http://arxiv.org/abs/2410.04324)|**[link](https://github.com/jessegator/sonar)**|
317 | |**2024-10-05**|**Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System**|Ze Li et.al.|[2410.04017](http://arxiv.org/abs/2410.04017)|null|
318 | |**2024-10-01**|**Recent Advances in Speech Language Models: A Survey**|Wenqian Cui et.al.|[2410.03751](http://arxiv.org/abs/2410.03751)|null|
319 | |**2024-10-04**|**Generative Semantic Communication for Text-to-Speech Synthesis**|Jiahao Zheng et.al.|[2410.03459](http://arxiv.org/abs/2410.03459)|null|
320 | |**2024-10-04**|**Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens**|Jinzheng Zhao et.al.|[2410.03298](http://arxiv.org/abs/2410.03298)|null|
321 | |**2024-10-04**|**Narrative Player: Reviving Data Narratives with Visuals**|Zekai Shao et.al.|[2410.03268](http://arxiv.org/abs/2410.03268)|null|
322 | |**2024-10-04**|**MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech**|Taejun Bak et.al.|[2410.03192](http://arxiv.org/abs/2410.03192)|null|
323 | |**2024-10-01**|**Augmentation through Laundering Attacks for Audio Spoof Detection**|Hashim Ali et.al.|[2410.01108](http://arxiv.org/abs/2410.01108)|null|
324 | |**2024-10-01**|**Zero-Shot Text-to-Speech from Continuous Text Streams**|Trung Dang et.al.|[2410.00767](http://arxiv.org/abs/2410.00767)|null|
325 | |**2024-10-01**|**EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control**|Haozhe Chen et.al.|[2410.00316](http://arxiv.org/abs/2410.00316)|**[link](https://github.com/tonychenxyz/emoknob)**|
326 | |**2024-09-30**|**Word-wise intonation model for cross-language TTS systems**|Tomilov A. A. et.al.|[2409.20374](http://arxiv.org/abs/2409.20374)|null|
327 | |**2024-09-27**|**Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech**|Youngjae Kim et.al.|[2409.18622](http://arxiv.org/abs/2409.18622)|null|
328 | |**2024-09-26**|**Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control**|Ryuichi Yamamoto et.al.|[2409.17452](http://arxiv.org/abs/2409.17452)|null|
329 | |**2024-09-25**|**Exploring synthetic data for cross-speaker style transfer in style representation based TTS**|Lucas H. Ueda et.al.|[2409.17364](http://arxiv.org/abs/2409.17364)|null|
330 | |**2024-09-25**|**Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions**|Kun Zhou et.al.|[2409.16681](http://arxiv.org/abs/2409.16681)|null|
331 | |**2024-09-25**|**Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation**|Siyin Wang et.al.|[2409.16644](http://arxiv.org/abs/2409.16644)|null|
332 | |**2024-09-24**|**FastTalker: Jointly Generating Speech and Conversational Gestures from Text**|Zixin Guo et.al.|[2409.16404](http://arxiv.org/abs/2409.16404)|null|
333 | |**2024-09-24**|**Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling**|Ville Heilala et.al.|[2409.16376](http://arxiv.org/abs/2409.16376)|null|
334 | |**2024-09-24**|**Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech**|Yunji Chu et.al.|[2409.16203](http://arxiv.org/abs/2409.16203)|null|
335 | |**2024-09-24**|**NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers**|Nohil Park et.al.|[2409.15760](http://arxiv.org/abs/2409.15760)|null|
336 | |**2024-09-24**|**VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance**|Jiheum Yeom et.al.|[2409.15759](http://arxiv.org/abs/2409.15759)|null|
337 | |**2024-09-24**|**StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis**|Zhiyong Chen et.al.|[2409.15741](http://arxiv.org/abs/2409.15741)|null|
338 | |**2024-09-23**|**A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection**|Lam Pham et.al.|[2409.15180](http://arxiv.org/abs/2409.15180)|null|
339 | |**2024-09-23**|**LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation**|Hieu-Thi Luong et.al.|[2409.14743](http://arxiv.org/abs/2409.14743)|null|
340 | |**2024-09-20**|**Zero-shot Cross-lingual Voice Transfer for TTS**|Fadi Biadsy et.al.|[2409.13910](http://arxiv.org/abs/2409.13910)|null|
341 | |**2024-09-20**|**On the Feasibility of Fully AI-automated Vishing Attacks**|João Figueiredo et.al.|[2409.13793](http://arxiv.org/abs/2409.13793)|null|
342 | |**2024-09-19**|**Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space**|Sebastião Quintas et.al.|[2409.12745](http://arxiv.org/abs/2409.12745)|null|
343 | |**2024-09-19**|**Preference Alignment Improves Language Model-Based TTS**|Jinchuan Tian et.al.|[2409.12403](http://arxiv.org/abs/2409.12403)|null|
344 | |**2024-09-18**|**Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference**|Edresson Casanova et.al.|[2409.12117](http://arxiv.org/abs/2409.12117)|null|
345 | |**2024-09-18**|**Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems**|Anusha Prakash et.al.|[2409.11915](http://arxiv.org/abs/2409.11915)|null|
346 | |**2024-09-18**|**DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech**|Xin Qi et.al.|[2409.11835](http://arxiv.org/abs/2409.11835)|null|
347 | |**2024-09-18**|**Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation**|Haohan Guo et.al.|[2409.11630](http://arxiv.org/abs/2409.11630)|null|
348 | |**2024-09-17**|**SpMis: An Investigation of Synthetic Spoken Misinformation Detection**|Peizhuo Liu et.al.|[2409.11308](http://arxiv.org/abs/2409.11308)|null|
349 | |**2024-09-19**|**The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives**|Samee Arif et.al.|[2409.11261](http://arxiv.org/abs/2409.11261)|**[link](https://github.com/ulrs0/The-Art-of-Story-Telling)**|
350 | |**2024-09-17**|**Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora**|Francesco Nespoli et.al.|[2409.11107](http://arxiv.org/abs/2409.11107)|null|
351 | |**2024-09-16**|**Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization**|Xiaoxue Gao et.al.|[2409.10157](http://arxiv.org/abs/2409.10157)|null|
352 | |**2024-09-16**|**StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion**|Yinghao Aaron Li et.al.|[2409.10058](http://arxiv.org/abs/2409.10058)|null|
353 | |**2024-09-15**|**Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning**|Siqi Sun et.al.|[2409.09891](http://arxiv.org/abs/2409.09891)|null|
354 | |**2024-09-14**|**E1 TTS: Simple and Fast Non-Autoregressive TTS**|Zhijun Liu et.al.|[2409.09351](http://arxiv.org/abs/2409.09351)|null|
355 | |**2024-09-14**|**Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation**|Changjin Han et.al.|[2409.09311](http://arxiv.org/abs/2409.09311)|null|
356 | |**2024-09-14**|**SafeEar: Content Privacy-Preserving Audio Deepfake Detection**|Xinfeng Li et.al.|[2409.09272](http://arxiv.org/abs/2409.09272)|**[link](https://github.com/LetterLiGo/SafeEar)**|
357 | |**2024-09-13**|**AccentBox: Towards High-Fidelity Zero-Shot Accent Generation**|Jinzuomu Zhong et.al.|[2409.09098](http://arxiv.org/abs/2409.09098)|null|
358 | |**2024-09-17**|**HLTCOE JHU Submission to the Voice Privacy Challenge 2024**|Henry Li Xinyuan et.al.|[2409.08913](http://arxiv.org/abs/2409.08913)|null|
359 | |**2024-09-13**|**Text-To-Speech Synthesis In The Wild**|Jee-weon Jung et.al.|[2409.08711](http://arxiv.org/abs/2409.08711)|null|
360 | |**2024-09-14**|**Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions**|Amila Indika et.al.|[2409.07945](http://arxiv.org/abs/2409.07945)|null|
361 | |**2024-09-12**|**Full-text Error Correction for Chinese Speech Recognition with Large Language Model**|Zhiyuan Tang et.al.|[2409.07790](http://arxiv.org/abs/2409.07790)|null|
362 | |**2024-09-11**|**SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis**|Helin Wang et.al.|[2409.07556](http://arxiv.org/abs/2409.07556)|**[link](https://github.com/WangHelin1997/SSR-Speech)**|
363 | |**2024-09-11**|**D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack**|Hong-Hanh Nguyen-Le et.al.|[2409.07390](http://arxiv.org/abs/2409.07390)|null|
364 | |**2024-09-11**|**Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT**|Kazuki Yamauchi et.al.|[2409.07265](http://arxiv.org/abs/2409.07265)|null|
365 | |**2024-09-11**|**Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment**|Tien-Hong Lo et.al.|[2409.07151](http://arxiv.org/abs/2409.07151)|null|
366 | |**2024-09-10**|**Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models**|Xin Jing et.al.|[2409.06451](http://arxiv.org/abs/2409.06451)|null|
367 | |**2024-09-10**|**What happens to diffusion model likelihood when your model is conditional?**|Mattias Cross et.al.|[2409.06364](http://arxiv.org/abs/2409.06364)|null|
368 | |**2024-09-10**|**VoiceWukong: Benchmarking Deepfake Voice Detection**|Ziwei Yan et.al.|[2409.06348](http://arxiv.org/abs/2409.06348)|null|
369 | |**2024-09-09**|**AS-Speech: Adaptive Style For Speech Synthesis**|Zhipeng Li et.al.|[2409.05730](http://arxiv.org/abs/2409.05730)|null|
370 | |**2024-09-09**|**IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS**|Ashwin Sankar et.al.|[2409.05356](http://arxiv.org/abs/2409.05356)|**[link](https://github.com/ai4bharat/indicvoices-r)**|
371 | |**2024-09-10**|**Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion**|Zhengyang Chen et.al.|[2409.05004](http://arxiv.org/abs/2409.05004)|null|
372 | |**2024-09-01**|**Sample-Efficient Diffusion for Text-To-Speech Synthesis**|Justin Lovelace et.al.|[2409.03717](http://arxiv.org/abs/2409.03717)|**[link](https://github.com/justinlovelace/sesd)**|
373 | |**2024-09-10**|**LAST: Language Model Aware Speech Tokenization**|Arnon Turetzky et.al.|[2409.03701](http://arxiv.org/abs/2409.03701)|null|
374 | |**2024-09-05**|**FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications**|Hao-Han Guo et.al.|[2409.03283](http://arxiv.org/abs/2409.03283)|null|
375 | |**2024-09-04**|**Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems**|Jeongmin Liu et.al.|[2409.02517](http://arxiv.org/abs/2409.02517)|null|
376 | |**2024-09-03**|**VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka**|Li-Wei Chen et.al.|[2409.01548](http://arxiv.org/abs/2409.01548)|null|
377 | |**2024-09-02**|**A multilingual training strategy for low resource Text to Speech**|Asma Amalas et.al.|[2409.01217](http://arxiv.org/abs/2409.01217)|null|
378 | |**2024-09-02**|**A Framework for Synthetic Audio Conversations Generation using Large Language Models**|Kaung Myat Kyaw et.al.|[2409.00946](http://arxiv.org/abs/2409.00946)|null|
379 | |**2024-09-02**|**SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis**|Haohan Guo et.al.|[2409.00933](http://arxiv.org/abs/2409.00933)|**[link](https://github.com/hhguo/socodec)**|
380 | |**2024-09-01**|**MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer**|Yuancheng Wang et.al.|[2409.00750](http://arxiv.org/abs/2409.00750)|null|
381 | |**2024-08-30**|**SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection**|Ismail Rasim Ulgen et.al.|[2408.17432](http://arxiv.org/abs/2408.17432)|null|
382 | |**2024-08-30**|**AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge**|Kirill Borodin et.al.|[2408.17352](http://arxiv.org/abs/2408.17352)|null|
383 | |**2024-08-30**|**Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model**|Zhen Ye et.al.|[2408.17175](http://arxiv.org/abs/2408.17175)|**[link](https://github.com/zhenye234/xcodec)**|
384 | |**2024-08-30**|**Utilizing Speaker Profiles for Impersonation Audio Detection**|Hao Gu et.al.|[2408.17009](http://arxiv.org/abs/2408.17009)|null|
385 | |**2024-08-29**|**Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis**|Zehai Tu et.al.|[2408.16373](http://arxiv.org/abs/2408.16373)|null|
386 | |**2024-08-28**|**Multi-modal Adversarial Training for Zero-Shot Voice Cloning**|John Janiczek et.al.|[2408.15916](http://arxiv.org/abs/2408.15916)|null|
387 | |**2024-08-29**|**Easy, Interpretable, Effective: openSMILE for voice deepfake detection**|Octavian Pascu et.al.|[2408.15775](http://arxiv.org/abs/2408.15775)|null|
388 | |**2024-08-28**|**VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling**|Yixuan Zhou et.al.|[2408.15676](http://arxiv.org/abs/2408.15676)|**[link](https://github.com/thuhcsi/voxinstruct)**|
389 | |**2024-08-28**|**VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech**|Heeseung Kim et.al.|[2408.14739](http://arxiv.org/abs/2408.14739)|null|
390 | |**2024-08-27**|**StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech**|Haowei Lou et.al.|[2408.14713](http://arxiv.org/abs/2408.14713)|null|
391 | |**2024-08-27**|**DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance**|Jinhyeok Yang et.al.|[2408.14423](http://arxiv.org/abs/2408.14423)|null|
392 | |**2024-08-26**|**Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard**|Wonjune Kang et.al.|[2408.13970](http://arxiv.org/abs/2408.13970)|null|
393 | |**2024-08-28**|**SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2408.13893](http://arxiv.org/abs/2408.13893)|null|
394 | |**2024-08-22**|**Positional Description for Numerical Normalization**|Deepanshu Gupta et.al.|[2408.12430](http://arxiv.org/abs/2408.12430)|null|
395 | |**2024-08-22**|**VoiceX: A Text-To-Speech Framework for Custom Voices**|Silvan Mertes et.al.|[2408.12170](http://arxiv.org/abs/2408.12170)|null|
396 | |**2024-08-13**|**Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation**|Yinghao Aaron Li et.al.|[2408.11849](http://arxiv.org/abs/2408.11849)|null|
397 | |**2024-08-20**|**EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech**|Xin Qi et.al.|[2408.10852](http://arxiv.org/abs/2408.10852)|null|
398 | |**2024-08-20**|**SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS**|Karl El Hajal et.al.|[2408.10771](http://arxiv.org/abs/2408.10771)|null|
399 | |**2024-08-20**|**Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting**|Hyun Jin Park et.al.|[2408.10463](http://arxiv.org/abs/2408.10463)|null|
400 | |**2024-08-17**|**Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition**|Samuele Cornell et.al.|[2408.09215](http://arxiv.org/abs/2408.09215)|**[link](https://github.com/popcornell/ASRLightningFT)**|
401 | |**2024-08-14**|**PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation**|Sang-Hoon Lee et.al.|[2408.07547](http://arxiv.org/abs/2408.07547)|**[link](https://github.com/sh-lee-prml/periodwave)**|
402 | |**2024-08-13**|**SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis**|Osamu Take et.al.|[2408.06858](http://arxiv.org/abs/2408.06858)|**[link](https://github.com/sarulab-speech/saslaw)**|
403 | |**2024-08-13**|**PRESENT: Zero-Shot Text-to-Prosody Control**|Perry Lam et.al.|[2408.06827](http://arxiv.org/abs/2408.06827)|**[link](https://github.com/iamanigeeit/present)**|
404 | |**2024-08-12**|**FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks**|Min Ma et.al.|[2408.06227](http://arxiv.org/abs/2408.06227)|null|
405 | |**2024-08-11**|**VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing**|Chunyu Qiang et.al.|[2408.05758](http://arxiv.org/abs/2408.05758)|null|
406 | |**2024-08-06**|**Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training**|Hawraz A. Ahmad et.al.|[2408.03887](http://arxiv.org/abs/2408.03887)|null|
407 | |**2024-08-03**|**ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features**|Peng Cheng et.al.|[2408.01808](http://arxiv.org/abs/2408.01808)|**[link](https://github.com/TASER2023/TASER)**|
408 | |**2024-08-01**|**Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation**|Xinhan Di et.al.|[2408.00284](http://arxiv.org/abs/2408.00284)|null|
409 | |**2024-07-18**|**Handling Numeric Expressions in Automatic Speech Recognition**|Christian Huber et.al.|[2408.00004](http://arxiv.org/abs/2408.00004)|null|
410 | |**2024-07-31**|**On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition**|Nick Rossenbach et.al.|[2407.21476](http://arxiv.org/abs/2407.21476)|null|
411 | |**2024-07-29**|**Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks**|Mahmoud Salhab et.al.|[2407.18571](http://arxiv.org/abs/2407.18571)|null|
412 | |**2024-07-25**|**On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures**|Nick Rossenbach et.al.|[2407.17997](http://arxiv.org/abs/2407.17997)|null|
413 | |**2024-07-24**|**Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model**|Jan Lehečka et.al.|[2407.17167](http://arxiv.org/abs/2407.17167)|null|
414 | |**2024-07-23**|**Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments**|Pai Zhu et.al.|[2407.16840](http://arxiv.org/abs/2407.16840)|null|
415 | |**2024-07-19**|**Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2**|Chun Xu et.al.|[2407.14212](http://arxiv.org/abs/2407.14212)|null|
416 | |**2024-07-18**|**Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models**|Weiqin Li et.al.|[2407.13509](http://arxiv.org/abs/2407.13509)|null|
417 | |**2024-07-22**|**TTSDS -- Text-to-Speech Distribution Score**|Christoph Minixhofer et.al.|[2407.12707](http://arxiv.org/abs/2407.12707)|**[link](https://github.com/ttsds/ttsds)**|
418 | |**2024-07-17**|**Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech**|Haibin Wu et.al.|[2407.12229](http://arxiv.org/abs/2407.12229)|**[link](https://github.com/hbwu-ntu/emoctrltts-eval)**|
419 | |**2024-07-16**|**A Language Modeling Approach to Diacritic-Free Hebrew TTS**|Amit Roth et.al.|[2407.12206](http://arxiv.org/abs/2407.12206)|null|
420 | |**2024-07-17**|**Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding**|Chuanhao Sun et.al.|[2407.09370](http://arxiv.org/abs/2407.09370)|**[link](https://github.com/zhyuan11/SPE)**|
421 | |**2024-07-11**|**Autoregressive Speech Synthesis without Vector Quantization**|Lingwei Meng et.al.|[2407.08551](http://arxiv.org/abs/2407.08551)|null|
422 | |**2024-07-10**|**Source Tracing of Audio Deepfake Systems**|Nicholas Klein et.al.|[2407.08016](http://arxiv.org/abs/2407.08016)|null|
423 | |**2024-07-07**|**ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation**|Ruibo Fu et.al.|[2407.05421](http://arxiv.org/abs/2407.05421)|null|
424 | |**2024-07-09**|**CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens**|Zhihao Du et.al.|[2407.05407](http://arxiv.org/abs/2407.05407)|null|
425 | |**2024-07-04**|**Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis**|Cong-Thanh Do et.al.|[2407.04047](http://arxiv.org/abs/2407.04047)|null|
426 | |**2024-07-04**|**Optimizing a-DCF for Spoofing-Robust Speaker Verification**|Oğuzhan Kurnaz et.al.|[2407.04034](http://arxiv.org/abs/2407.04034)|null|
427 | |**2024-07-04**|**On the Effectiveness of Acoustic BPE in Decoder-Only TTS**|Bohan Li et.al.|[2407.03892](http://arxiv.org/abs/2407.03892)|null|
428 | |**2024-07-14**|**CATT: Character-based Arabic Tashkeel Transformer**|Faris Alasmary et.al.|[2407.03236](http://arxiv.org/abs/2407.03236)|**[link](https://github.com/abjadai/catt)**|
429 | |**2024-07-02**|**Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization**|Yuchen Hu et.al.|[2407.02243](http://arxiv.org/abs/2407.02243)|null|
430 | |**2024-07-02**|**TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations**|Xiaoxue Gao et.al.|[2407.01927](http://arxiv.org/abs/2407.01927)|null|
431 | |**2024-07-01**|**Lightweight Zero-shot Text-to-Speech with Mixture of Adapters**|Kenichi Fujita et.al.|[2407.01291](http://arxiv.org/abs/2407.01291)|null|
432 | |**2024-06-30**|**NAIST Simultaneous Speech Translation System for IWSLT 2024**|Yuka Ko et.al.|[2407.00826](http://arxiv.org/abs/2407.00826)|null|
433 | |**2024-06-30**|**An Attribute Interpolation Method in Speech Synthesis by Model Merging**|Masato Murata et.al.|[2407.00766](http://arxiv.org/abs/2407.00766)|null|
434 | |**2024-06-30**|**FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis**|Yinlin Guo et.al.|[2407.00753](http://arxiv.org/abs/2407.00753)|null|
435 | |**2024-07-02**|**Open-Source Conversational AI with SpeechBrain 1.0**|Mirco Ravanelli et.al.|[2407.00463](http://arxiv.org/abs/2407.00463)|null|
436 | |**2024-06-27**|**Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models**|Borodin Kirill Nikolayevich et.al.|[2406.19243](http://arxiv.org/abs/2406.19243)|null|
437 | |**2024-06-27**|**DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability**|Hyun Joon Park et.al.|[2406.19135](http://arxiv.org/abs/2406.19135)|**[link](https://github.com/winddori2002/dex-tts)**|
438 | |**2024-06-26**|**Automatic Speech Recognition for Hindi**|Anish Saha et.al.|[2406.18135](http://arxiv.org/abs/2406.18135)|null|
439 | |**2024-06-26**|**A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons**|Tzu-Yun Hung et.al.|[2406.18089](http://arxiv.org/abs/2406.18089)|null|
440 | |**2024-06-29**|**LLM-Driven Multimodal Opinion Expression Identification**|Bonian Jia et.al.|[2406.18088](http://arxiv.org/abs/2406.18088)|null|
441 | |**2024-06-26**|**E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS**|Sefik Emre Eskimez et.al.|[2406.18009](http://arxiv.org/abs/2406.18009)|**[link](https://github.com/microsoft/e2tts-test-suite)**|
442 | |**2024-06-25**|**Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment**|Paarth Neekhara et.al.|[2406.17957](http://arxiv.org/abs/2406.17957)|null|
443 | |**2024-06-22**|**A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge**|Xiaopeng Wang et.al.|[2406.17801](http://arxiv.org/abs/2406.17801)|null|
444 | |**2024-06-25**|**High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model**|Joun Yeop Lee et.al.|[2406.17310](http://arxiv.org/abs/2406.17310)|null|
445 | |**2024-06-25**|**Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation**|Yingting Li et.al.|[2406.17257](http://arxiv.org/abs/2406.17257)|null|
446 | |**2024-06-24**|**Exploring the Capability of Mamba in Speech Applications**|Koichi Miyazaki et.al.|[2406.16808](http://arxiv.org/abs/2406.16808)|null|
447 | |**2024-06-25**|**Towards Zero-Shot Text-To-Speech for Arabic Dialects**|Khai Duy Doan et.al.|[2406.16751](http://arxiv.org/abs/2406.16751)|null|
448 | |**2024-06-22**|**TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers**|Yakun Song et.al.|[2406.15752](http://arxiv.org/abs/2406.15752)|**[link](https://github.com/Ereboas/TacoLM)**|
449 | |**2024-06-21**|**InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions**|Yu Nakagome et.al.|[2406.14890](http://arxiv.org/abs/2406.14890)|null|
450 | |**2024-06-21**|**GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech**|Wenbin Wang et.al.|[2406.14875](http://arxiv.org/abs/2406.14875)|null|
451 | |**2024-06-21**|**DASB - Discrete Audio and Speech Benchmark**|Pooneh Mousavi et.al.|[2406.14294](http://arxiv.org/abs/2406.14294)|null|
452 | |**2024-06-18**|**Instruction Data Generation and Unsupervised Adaptation for Speech Language Models**|Vahid Noroozi et.al.|[2406.12946](http://arxiv.org/abs/2406.12946)|null|
453 | |**2024-06-17**|**DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer**|Keon Lee et.al.|[2406.11427](http://arxiv.org/abs/2406.11427)|null|
454 | |**2024-06-16**|**NAST: Noise Aware Speech Tokenization for Speech Language Models**|Shoval Messica et.al.|[2406.11037](http://arxiv.org/abs/2406.11037)|**[link](https://github.com/ShovalMessica/NAST)**|
455 | |**2024-06-16**|**Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis**|Xuehao Zhou et.al.|[2406.10844](http://arxiv.org/abs/2406.10844)|null|
456 | |**2024-06-14**|**Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice**|Shubham Gupta et.al.|[2406.10422](http://arxiv.org/abs/2406.10422)|null|
457 | |**2024-06-14**|**UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner**|Dongchao Yang et.al.|[2406.10056](http://arxiv.org/abs/2406.10056)|**[link](https://github.com/yangdongchao/llm-codec)**|
458 | |**2024-06-14**|**MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model**|Jiatong Shi et.al.|[2406.09869](http://arxiv.org/abs/2406.09869)|null|
459 | |**2024-06-13**|**DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage**|Kyra Wang et.al.|[2406.08820](http://arxiv.org/abs/2406.08820)|null|
460 | |**2024-06-13**|**Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems**|Zhengyang Chen et.al.|[2406.08812](http://arxiv.org/abs/2406.08812)|null|
461 | |**2024-06-13**|**DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing**|Neha Sahipjohn et.al.|[2406.08802](http://arxiv.org/abs/2406.08802)|null|
462 | |**2024-06-12**|**Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis**|Wing-Zin Leung et.al.|[2406.08568](http://arxiv.org/abs/2406.08568)|**[link](https://github.com/WingZLeung/TTDS)**|
463 | |**2024-06-12**|**Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data**|Yuma Shirahata et.al.|[2406.08111](http://arxiv.org/abs/2406.08111)|null|
464 | |**2024-06-12**|**VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech**|Ashishkumar Gudmalwar et.al.|[2406.08076](http://arxiv.org/abs/2406.08076)|null|
465 | |**2024-06-12**|**LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning**|Masaya Kawamura et.al.|[2406.07969](http://arxiv.org/abs/2406.07969)|**[link](https://github.com/line/libritts-p)**|
466 | |**2024-06-12**|**VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment**|Bing Han et.al.|[2406.07855](http://arxiv.org/abs/2406.07855)|null|
467 | |**2024-06-12**|**EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech**|Deok-Hyeon Cho et.al.|[2406.07803](http://arxiv.org/abs/2406.07803)|**[link](https://github.com/Choddeok/EmoSphere-TTS)**|
468 | |**2024-06-11**|**The Interspeech 2024 Challenge on Speech Processing Using Discrete Units**|Xuankai Chang et.al.|[2406.07725](http://arxiv.org/abs/2406.07725)|null|
469 | |**2024-06-11**|**Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?**|Qingkai Fang et.al.|[2406.07289](http://arxiv.org/abs/2406.07289)|null|
470 | |**2024-06-11**|**AudioMarkBench: Benchmarking Robustness of Audio Watermarking**|Hongbin Liu et.al.|[2406.06979](http://arxiv.org/abs/2406.06979)|**[link](https://github.com/moyangkuo/audiomarkbench)**|
471 | |**2024-06-11**|**Controlling Emotion in Text-to-Speech with Natural Language Prompts**|Thomas Bott et.al.|[2406.06406](http://arxiv.org/abs/2406.06406)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
472 | |**2024-06-10**|**Meta Learning Text-to-Speech Synthesis in over 7000 Languages**|Florian Lux et.al.|[2406.06403](http://arxiv.org/abs/2406.06403)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
473 | |**2024-06-10**|**MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance**|Semin Kim et.al.|[2406.05965](http://arxiv.org/abs/2406.05965)|null|
474 | |**2024-06-11**|**WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark**|Linhan Ma et.al.|[2406.05763](http://arxiv.org/abs/2406.05763)|**[link](https://github.com/dukGuo/valle-audiodec)**|
475 | |**2024-06-09**|**An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS**|Xiaofei Wang et.al.|[2406.05699](http://arxiv.org/abs/2406.05699)|null|
476 | |**2024-06-11**|**Text-aware and Context-aware Expressive Audiobook Speech Synthesis**|Dake Guo et.al.|[2406.05672](http://arxiv.org/abs/2406.05672)|null|
477 | |**2024-06-08**|**Autoregressive Diffusion Transformer for Text-to-Speech Synthesis**|Zhijun Liu et.al.|[2406.05551](http://arxiv.org/abs/2406.05551)|null|
478 | |**2024-06-08**|**VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers**|Sanyuan Chen et.al.|[2406.05370](http://arxiv.org/abs/2406.05370)|null|
479 | |**2024-06-07**|**Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis**|Ryan Langman et.al.|[2406.05298](http://arxiv.org/abs/2406.05298)|null|
480 | |**2024-06-07**|**XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model**|Edresson Casanova et.al.|[2406.04904](http://arxiv.org/abs/2406.04904)|**[link](https://github.com/Edresson/ZS-TTS-Evaluation)**|
481 | |**2024-06-07**|**TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking**|Junzuo Zhou et.al.|[2406.04840](http://arxiv.org/abs/2406.04840)|null|
482 | |**2024-06-07**|**Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study**|Chong Zhang et.al.|[2406.04633](http://arxiv.org/abs/2406.04633)|null|
483 | |**2024-06-06**|**Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis**|Théodor Lemerle et.al.|[2406.04467](http://arxiv.org/abs/2406.04467)|**[link](https://github.com/theodorblackbird/lina-speech)**|
484 | |**2024-06-06**|**Total-Duration-Aware Duration Modeling for Text-to-Speech Systems**|Sefik Emre Eskimez et.al.|[2406.04281](http://arxiv.org/abs/2406.04281)|null|
485 | |**2024-06-06**|**Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining**|Jinlong Xue et.al.|[2406.03714](http://arxiv.org/abs/2406.03714)|null|
486 | |**2024-06-06**|**Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model**|Jinlong Xue et.al.|[2406.03706](http://arxiv.org/abs/2406.03706)|null|
487 | |**2024-06-05**|**Style Mixture of Experts for Expressive Text-To-Speech Synthesis**|Ahad Jawaid et.al.|[2406.03637](http://arxiv.org/abs/2406.03637)|null|
488 | |**2024-06-07**|**Harder or Different? Understanding Generalization of Audio Deepfake Detection**|Nicolas M. Müller et.al.|[2406.03512](http://arxiv.org/abs/2406.03512)|null|
489 | |**2024-06-05**|**LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes**|Trung Dang et.al.|[2406.02897](http://arxiv.org/abs/2406.02897)|null|
490 | |**2024-06-04**|**Seed-TTS: A Family of High-Quality Versatile Speech Generation Models**|Philip Anastassiou et.al.|[2406.02430](http://arxiv.org/abs/2406.02430)|**[link](https://github.com/BytedanceSpeech/seed-tts-eval)**|
491 | |**2024-06-05**|**SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2406.02328](http://arxiv.org/abs/2406.02328)|null|
492 | |**2024-06-04**|**BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation**|Hui-Peng Du et.al.|[2406.02162](http://arxiv.org/abs/2406.02162)|null|
493 | |**2024-06-04**|**Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis**|Kun Zhou et.al.|[2406.02009](http://arxiv.org/abs/2406.02009)|null|
494 | |**2024-06-03**|**ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec**|Shengpeng Ji et.al.|[2406.01205](http://arxiv.org/abs/2406.01205)|**[link](https://github.com/jishengpeng/controlspeech)**|
495 | |**2024-06-03**|**Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training**|Jan Melechovsky et.al.|[2406.01018](http://arxiv.org/abs/2406.01018)|null|
496 | |**2024-06-02**|**Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback**|Chen Chen et.al.|[2406.00654](http://arxiv.org/abs/2406.00654)|null|
497 | |**2024-05-31**|**Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities**|Vicky Zayats et.al.|[2405.18669](http://arxiv.org/abs/2405.18669)|null|
498 | |**2024-05-28**|**TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation**|Chenyang Le et.al.|[2405.17809](http://arxiv.org/abs/2405.17809)|null|
499 | |**2024-05-27**|**RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis**|Haoxiang Shi et.al.|[2405.17028](http://arxiv.org/abs/2405.17028)|null|
500 | |**2024-05-24**|**Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition**|Zijin Gu et.al.|[2405.15216](http://arxiv.org/abs/2405.15216)|null|
501 | |**2024-05-23**|**Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models**|Jingyi Chen et.al.|[2405.14632](http://arxiv.org/abs/2405.14632)|null|
502 | |**2024-05-22**|**A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction**|Yue Li et.al.|[2405.13477](http://arxiv.org/abs/2405.13477)|null|
503 | |**2024-05-20**|**Multi-speaker Text-to-speech Training with Speaker Anonymized Data**|Wen-Chin Huang et.al.|[2405.11767](http://arxiv.org/abs/2405.11767)|null|
504 | |**2024-05-19**|**VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications**|Mikhail Konenkov et.al.|[2405.11537](http://arxiv.org/abs/2405.11537)|null|
505 | |**2024-05-18**|**Exploring speech style spaces with language models: Emotional TTS without emotion labels**|Shreeram Suresh Chandra et.al.|[2405.11413](http://arxiv.org/abs/2405.11413)|null|
506 | |**2024-05-16**|**Faces that Speak: Jointly Synthesising Talking Face and Speech from Text**|Youngjoon Jang et.al.|[2405.10272](http://arxiv.org/abs/2405.10272)|null|
507 | |**2024-05-16**|**Building a Luganda Text-to-Speech Model From Crowdsourced Data**|Sulaiman Kagumire et.al.|[2405.10211](http://arxiv.org/abs/2405.10211)|null|
508 | |**2024-05-16**|**Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model**|Siyang Wang et.al.|[2405.09768](http://arxiv.org/abs/2405.09768)|null|
509 | |**2024-05-15**|**Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer**|Weifei Jin et.al.|[2405.09470](http://arxiv.org/abs/2405.09470)|null|
510 | |**2024-05-15**|**Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2405.09171](http://arxiv.org/abs/2405.09171)|null|
511 | |**2024-05-14**|**PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset**|Yang Hou et.al.|[2405.08838](http://arxiv.org/abs/2405.08838)|**[link](https://github.com/tobuta/PolyGlotFake)**|
512 | |**2024-04-30**|**Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech**|Hankun Wang et.al.|[2404.19723](http://arxiv.org/abs/2404.19723)|null|
513 | |**2024-04-29**|**MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis**|Xiang Li et.al.|[2404.18398](http://arxiv.org/abs/2404.18398)|null|
514 | |**2024-04-28**|**USAT: A Universal Speaker-Adaptive Text-to-Speech Approach**|Wenbin Wang et.al.|[2404.18094](http://arxiv.org/abs/2404.18094)|**[link](https://github.com/mushanshanshan/esltts)**|
515 | |**2024-04-27**|**TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality**|Tiantian Feng et.al.|[2404.17983](http://arxiv.org/abs/2404.17983)|null|
516 | |**2024-04-26**|**An RFP dataset for Real, Fake, and Partially fake audio detection**|Abdulazeez AlAli et.al.|[2404.17721](http://arxiv.org/abs/2404.17721)|null|
517 | |**2024-04-23**|**StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations**|Sen Liu et.al.|[2404.14946](http://arxiv.org/abs/2404.14946)|null|
518 | |**2024-04-23**|**Retrieval-Augmented Audio Deepfake Detection**|Zuheng Kang et.al.|[2404.13892](http://arxiv.org/abs/2404.13892)|null|
519 | |**2024-04-14**|**Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling**|Quanxiu Wang et.al.|[2404.09192](http://arxiv.org/abs/2404.09192)|null|
520 | |**2024-04-11**|**Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network**|Mayura Manawadu et.al.|[2404.07807](http://arxiv.org/abs/2404.07807)|null|
521 | |**2024-04-18**|**Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness**|Xincan Feng et.al.|[2404.06714](http://arxiv.org/abs/2404.06714)|**[link](https://github.com/xincanfeng/vitsgpt)**|
522 | |**2024-04-10**|**CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations**|Leying Zhang et.al.|[2404.06690](http://arxiv.org/abs/2404.06690)|null|
523 | |**2024-04-10**|**The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge**|Yiwei Guo et.al.|[2404.06079](http://arxiv.org/abs/2404.06079)|null|
524 | |**2024-04-07**|**Cross-Domain Audio Deepfake Detection: Dataset and Analysis**|Yuang Li et.al.|[2404.04904](http://arxiv.org/abs/2404.04904)|null|
525 | |**2024-04-06**|**HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks**|Yingting Li et.al.|[2404.04645](http://arxiv.org/abs/2404.04645)|**[link](https://github.com/declare-lab/hypertts)**|
526 | |**2024-04-18**|**Open vocabulary keyword spotting through transfer learning from speech synthesis**|Kesavaraj V et.al.|[2404.03914](http://arxiv.org/abs/2404.03914)|null|
527 | |**2024-04-06**|**RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis**|Detai Xin et.al.|[2404.03204](http://arxiv.org/abs/2404.03204)|null|
528 | |**2024-04-03**|**CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech**|Jaehyeon Kim et.al.|[2404.02781](http://arxiv.org/abs/2404.02781)|null|
529 | |**2024-04-13**|**PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders**|Yu Pan et.al.|[2404.02702](http://arxiv.org/abs/2404.02702)|null|
530 | |**2024-03-31**|**Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation**|Rohan Chaudhury et.al.|[2404.01339](http://arxiv.org/abs/2404.01339)|**[link](https://github.com/rohan-chaudhury/humane-speech-synthesis-through-zero-shot-emotion-and-disfluency-generation)**|
531 | |**2024-03-28**|**A Review of Multi-Modal Large Language and Vision Models**|Kilian Carolan et.al.|[2404.01322](http://arxiv.org/abs/2404.01322)|null|
532 | |**2024-04-09**|**KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis**|Adal Abilbekov et.al.|[2404.01033](http://arxiv.org/abs/2404.01033)|**[link](https://github.com/is2ai/kazemotts)**|
533 | |**2024-03-31**|**CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models**|Xiang Li et.al.|[2404.00569](http://arxiv.org/abs/2404.00569)|**[link](https://github.com/xiangli2022/cm-tts)**|
534 | |**2024-03-25**|**VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild**|Puyuan Peng et.al.|[2403.16973](http://arxiv.org/abs/2403.16973)|**[link](https://github.com/jasonppy/voicecraft)**|
535 | |**2024-03-20**|**Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning**|Shivam Ratnakant Mhaskar et.al.|[2403.15469](http://arxiv.org/abs/2403.15469)|null|
536 | |**2024-03-20**|**UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge**|Wataru Nakata et.al.|[2403.13720](http://arxiv.org/abs/2403.13720)|null|
537 | |**2024-03-20**|**Building speech corpus with diverse voice characteristics for its prompt-based representation**|Aya Watanabe et.al.|[2403.13353](http://arxiv.org/abs/2403.13353)|null|
538 | |**2024-03-17**|**Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations**|Claudio Pinhanez et.al.|[2403.11209](http://arxiv.org/abs/2403.11209)|null|
539 | |**2024-03-17**|**EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech**|Ziqi Liang et.al.|[2403.08164](http://arxiv.org/abs/2403.08164)|null|
540 | |**2024-03-09**|**HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling**|Chunhui Wang et.al.|[2403.05989](http://arxiv.org/abs/2403.05989)|null|
541 | |**2024-03-05**|**AttentionStitch: How Attention Solves the Speech Editing Problem**|Antonios Alexos et.al.|[2403.04804](http://arxiv.org/abs/2403.04804)|null|
542 | |**2024-03-07**|**Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation**|Sai Akarsh et.al.|[2403.04178](http://arxiv.org/abs/2403.04178)|null|
543 | |**2024-03-27**|**NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models**|Zeqian Ju et.al.|[2403.03100](http://arxiv.org/abs/2403.03100)|null|
544 | |**2024-03-04**|**Brilla AI: AI Contestant for the National Science and Maths Quiz**|George Boateng et.al.|[2403.01699](http://arxiv.org/abs/2403.01699)|**[link](https://github.com/nsmq-ai/nsmqai)**|
545 | |**2024-03-02**|**Towards Accurate Lip-to-Speech Synthesis in-the-Wild**|Sindhu Hegde et.al.|[2403.01087](http://arxiv.org/abs/2403.01087)|null|
546 | |**2024-02-29**|**Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data**|Takaaki Saeki et.al.|[2402.18932](http://arxiv.org/abs/2402.18932)|null|
547 | |**2024-02-26**|**An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation**|Ahmet Gunduz et.al.|[2402.16380](http://arxiv.org/abs/2402.16380)|**[link](https://github.com/aixplain/tts-qa)**|
548 | |**2024-02-22**|**Efficient data selection employing Semantic Similarity-based Graph Structures for model training**|Roxana Petcu et.al.|[2402.14888](http://arxiv.org/abs/2402.14888)|null|
549 | |**2024-02-22**|**Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition**|Rendi Chevi et.al.|[2402.14523](http://arxiv.org/abs/2402.14523)|null|
550 | |**2024-02-19**|**On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models**|Miri Varshavsky-Hassid et.al.|[2402.12423](http://arxiv.org/abs/2402.12423)|null|
551 | |**2024-02-19**|**Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting**|Haolin Chen et.al.|[2402.12220](http://arxiv.org/abs/2402.12220)|**[link](https://github.com/idiap/bayesian-peft)**|
552 | |**2024-02-18**|**Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru**|Zining Wang et.al.|[2402.11571](http://arxiv.org/abs/2402.11571)|null|
553 | |**2024-02-14**|**MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech**|Shengpeng Ji et.al.|[2402.09378](http://arxiv.org/abs/2402.09378)|null|
554 | |**2024-02-15**|**BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data**|Mateusz Łajszczak et.al.|[2402.08093](http://arxiv.org/abs/2402.08093)|null|
555 | |**2024-03-04**|**Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like**|Naoyuki Kanda et.al.|[2402.07383](http://arxiv.org/abs/2402.07383)|null|
556 | |**2024-02-09**|**A New Approach to Voice Authenticity**|Nicolas M. Müller et.al.|[2402.06304](http://arxiv.org/abs/2402.06304)|null|
557 | |**2024-02-08**|**Unified Speech-Text Pretraining for Spoken Dialog Modeling**|Heeseung Kim et.al.|[2402.05706](http://arxiv.org/abs/2402.05706)|null|
558 | |**2024-02-05**|**Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations**|Álvaro Martín-Cortinas et.al.|[2402.03407](http://arxiv.org/abs/2402.03407)|null|
559 | |**2024-02-02**|**Natural language guidance of high-fidelity text-to-speech with synthetic annotations**|Dan Lyth et.al.|[2402.01912](http://arxiv.org/abs/2402.01912)|null|
560 | |**2024-01-23**|**Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization**|Wei-Ping Huang et.al.|[2402.01692](http://arxiv.org/abs/2402.01692)|null|
561 | |**2024-02-01**|**Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech**|Dong Yang et.al.|[2402.00288](http://arxiv.org/abs/2402.00288)|null|
562 | |**2024-02-01**|**PAM: Prompting Audio-Language Models for Audio Quality Assessment**|Soham Deshmukh et.al.|[2402.00282](http://arxiv.org/abs/2402.00282)|**[link](https://github.com/soham97/pam)**|
563 | |**2024-01-31**|**Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2**|Jiatong Shi et.al.|[2401.17619](http://arxiv.org/abs/2401.17619)|**[link](https://github.com/espnet/espnet)**|
564 | |**2024-01-28**|**MunTTS: A Text-to-Speech System for Mundari**|Varun Gumma et.al.|[2401.15579](http://arxiv.org/abs/2401.15579)|null|
565 | |**2024-01-30**|**VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech**|Chenpeng Du et.al.|[2401.14321](http://arxiv.org/abs/2401.14321)|null|
566 | |**2024-01-25**|**Text to speech synthesis**|Harini s et.al.|[2401.13891](http://arxiv.org/abs/2401.13891)|null|
567 | |**2024-01-25**|**SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation**|Dong Zhang et.al.|[2401.13527](http://arxiv.org/abs/2401.13527)|**[link](https://github.com/0nutation/speechgpt)**|
568 | |**2024-01-22**|**Benchmarking Large Multimodal Models against Common Corruptions**|Jiawei Zhang et.al.|[2401.11943](http://arxiv.org/abs/2401.11943)|**[link](https://github.com/sail-sg/mmcbench)**|
569 | |**2024-01-22**|**Adversarial speech for voice privacy protection from Personalized Speech generation**|Shihao Chen et.al.|[2401.11857](http://arxiv.org/abs/2401.11857)|null|
570 | |**2024-02-16**|**Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis**|Vinotha R et.al.|[2401.11771](http://arxiv.org/abs/2401.11771)|null|
571 | |**2024-01-19**|**Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech**|Abhinav Garg et.al.|[2401.10465](http://arxiv.org/abs/2401.10465)|null|
572 | |**2024-02-28**|**MLAAD: The Multi-Language Audio Anti-Spoofing Dataset**|Nicolas M. Müller et.al.|[2401.09512](http://arxiv.org/abs/2401.09512)|null|
573 | |**2024-01-15**|**MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory**|Robert G. Kimelman et.al.|[2401.07967](http://arxiv.org/abs/2401.07967)|null|
574 | |**2024-01-14**|**ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering**|Yakun Song et.al.|[2401.07333](http://arxiv.org/abs/2401.07333)|null|
575 | |**2024-01-12**|**Multi-Task Learning for Front-End Text Processing in TTS**|Wonjune Kang et.al.|[2401.06321](http://arxiv.org/abs/2401.06321)|**[link](https://github.com/facebookresearch/llama-hd-dataset)**|
576 | |**2024-01-11**|**End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2**|Aniket Tathe et.al.|[2401.06183](http://arxiv.org/abs/2401.06183)|null|
577 | |**2024-01-11**|**Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection**|Lian Huang et.al.|[2401.05614](http://arxiv.org/abs/2401.05614)|null|
578 | |**2024-01-10**|**Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters**|Kenichi Fujita et.al.|[2401.05111](http://arxiv.org/abs/2401.05111)|null|
579 | |**2024-01-07**|**Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments**|Zhonghao Shi et.al.|[2401.03581](http://arxiv.org/abs/2401.03581)|null|
580 | |**2024-01-07**|**Transfer the linguistic representations from TTS to accent conversion with non-parallel data**|Xi Chen et.al.|[2401.03538](http://arxiv.org/abs/2401.03538)|null|
581 | |**2024-01-03**|**Incremental FastPitch: Chunk-based High Quality Text to Speech**|Muyang Du et.al.|[2401.01755](http://arxiv.org/abs/2401.01755)|null|
582 | |**2024-01-03**|**Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction**|Minchan Kim et.al.|[2401.01498](http://arxiv.org/abs/2401.01498)|null|
583 | |**2023-12-18**|**Assisting Blind People Using Object Detection with Vocal Feedback**|Heba Najm et.al.|[2401.01362](http://arxiv.org/abs/2401.01362)|null|
584 | |**2023-12-30**|**Boosting Large Language Model for Speech Synthesis: An Empirical Study**|Hongkun Hao et.al.|[2401.00246](http://arxiv.org/abs/2401.00246)|null|
585 | |**2024-01-01**|**Normalization of Lithuanian Text Using Regular Expressions**|Pijus Kasparaitis et.al.|[2312.17660](http://arxiv.org/abs/2312.17660)|null|
586 | |**2023-12-27**|**AE-Flow: AutoEncoder Normalizing Flow**|Jakub Mosiński et.al.|[2312.16552](http://arxiv.org/abs/2312.16552)|null|
587 | |**2023-12-22**|**Creating New Voices using Normalizing Flows**|Piotr Bilinski et.al.|[2312.14569](http://arxiv.org/abs/2312.14569)|null|
588 | |**2023-12-22**|**ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations**|Cheng Gong et.al.|[2312.14398](http://arxiv.org/abs/2312.14398)|null|
589 | |**2023-12-19**|**External Knowledge Augmented Polyphone Disambiguation Using Large Language Model**|Chen Li et.al.|[2312.11920](http://arxiv.org/abs/2312.11920)|null|
590 | |**2023-12-17**|**A review-based study on different Text-to-Speech technologies**|Md. Jalal Uddin Chowdhury et.al.|[2312.11563](http://arxiv.org/abs/2312.11563)|null|
591 | |**2024-01-31**|**MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis**|Wenhao Guan et.al.|[2312.10687](http://arxiv.org/abs/2312.10687)|null|
592 | |**2024-02-22**|**Amphion: An Open-Source Audio, Music and Speech Generation Toolkit**|Xueyao Zhang et.al.|[2312.09911](http://arxiv.org/abs/2312.09911)|**[link](https://github.com/open-mmlab/amphion)**|
593 | |**2023-12-11**|**Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism**|Georgios Milis et.al.|[2312.06613](http://arxiv.org/abs/2312.06613)|**[link](https://github.com/g-milis/NEUTART)**|
594 | |**2023-12-08**|**An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis**|Via Nielson et.al.|[2312.05415](http://arxiv.org/abs/2312.05415)|null|
595 | |**2023-12-06**|**Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis**|Zehua Chen et.al.|[2312.03491](http://arxiv.org/abs/2312.03491)|null|
596 | |**2023-12-02**|**Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning**|Raviraj Joshi et.al.|[2312.01107](http://arxiv.org/abs/2312.01107)|null|
597 | |**2023-12-02**|**Code-Mixed Text to Speech Synthesis under Low-Resource Constraints**|Raviraj Joshi et.al.|[2312.01103](http://arxiv.org/abs/2312.01103)|null|
598 | |**2023-11-29**|**Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes**|Pavel Korshunov et.al.|[2311.17655](http://arxiv.org/abs/2311.17655)|null|
599 | |**2024-02-06**|**Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech**|Enting Zhou et.al.|[2311.14816](http://arxiv.org/abs/2311.14816)|**[link](https://github.com/ETZET/SpeechEmotionAVLearning)**|
600 | |**2023-12-07**|**Guided Flows for Generative Modeling and Decision Making**|Qinqing Zheng et.al.|[2311.13443](http://arxiv.org/abs/2311.13443)|null|
601 | |**2023-11-27**|**HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis**|Sang-Hoon Lee et.al.|[2311.12454](http://arxiv.org/abs/2311.12454)|**[link](https://github.com/sh-lee-prml/hierspeechpp)**|
602 | |**2023-11-18**|**Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots**|Farideh Majidi et.al.|[2311.11116](http://arxiv.org/abs/2311.11116)|null|
603 | |**2023-11-18**|**Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys**|Gabriel Cosache et.al.|[2311.11030](http://arxiv.org/abs/2311.11030)|null|
604 | |**2023-11-17**|**A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness**|Mathias Vogel et.al.|[2311.10804](http://arxiv.org/abs/2311.10804)|null|
605 | |**2023-11-16**|**Improving fairness for spoken language understanding in atypical speech with Text-to-Speech**|Helin Wang et.al.|[2311.10149](http://arxiv.org/abs/2311.10149)|**[link](https://github.com/wanghelin1997/aty-tts)**|
606 | |**2024-02-02**|**DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation**|Jianzong Wang et.al.|[2311.07965](http://arxiv.org/abs/2311.07965)|null|
607 | |**2023-11-12**|**ChatAnything: Facetime Chat with LLM-Enhanced Personas**|Yilin Zhao et.al.|[2311.06772](http://arxiv.org/abs/2311.06772)|null|
608 | |**2023-11-11**|**NewsGPT: ChatGPT Integration for Robot-Reporter**|Abdelhadi Hireche et.al.|[2311.06640](http://arxiv.org/abs/2311.06640)|**[link](https://github.com/aeh1707/NewsGPT_Pepper)**|
609 | |**2023-11-08**|**Synthetic Speaking Children -- Why We Need Them and How to Make Them**|Muhammad Ali Farooq et.al.|[2311.06307](http://arxiv.org/abs/2311.06307)|null|
610 | |**2023-09-25**|**Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image**|Minki Kang et.al.|[2311.05844](http://arxiv.org/abs/2311.05844)|null|
611 | |**2023-11-07**|**Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning**|Rishabh Jain et.al.|[2311.04313](http://arxiv.org/abs/2311.04313)|**[link](https://github.com/c3imaging/child_tts_fastpitch)**|
612 | |**2023-11-07**|**Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment**|Jakir Hasan et.al.|[2311.03792](http://arxiv.org/abs/2311.03792)|null|
613 | |**2023-11-08**|**Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction**|Minchan Kim et.al.|[2311.02898](http://arxiv.org/abs/2311.02898)|null|
614 | |**2023-11-02**|**Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations**|Hanglei Zhang et.al.|[2311.01260](http://arxiv.org/abs/2311.01260)|null|
615 | |**2023-11-02**|**E3 TTS: Easy End-to-End Diffusion-based Text to Speech**|Yuan Gao et.al.|[2311.00945](http://arxiv.org/abs/2311.00945)|null|
616 | |**2023-10-31**|**An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation**|Yingjie Zhou et.al.|[2310.20251](http://arxiv.org/abs/2310.20251)|**[link](https://github.com/zyj-2000/cumt_2d_photospeaker)**|
617 | |**2023-10-27**|**Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN**|Neeraj Kumar et.al.|[2310.18169](http://arxiv.org/abs/2310.18169)|null|
618 | |**2023-10-25**|**ArTST: Arabic Text and Speech Transformer**|Hawau Olamide Toyin et.al.|[2310.16621](http://arxiv.org/abs/2310.16621)|**[link](https://github.com/mbzuai-nlp/artst)**|
619 | |**2023-10-25**|**Generative Pre-training for Speech with Flow Matching**|Alexander H. Liu et.al.|[2310.16338](http://arxiv.org/abs/2310.16338)|null|
620 | |**2023-10-23**|**DPP-TTS: Diversifying prosodic features of speech via determinantal point processes**|Seongho Joo et.al.|[2310.14663](http://arxiv.org/abs/2310.14663)|null|
621 | |**2023-10-22**|**An overview of text-to-speech systems and media applications**|Mohammad Reza Hasanabadi et.al.|[2310.14301](http://arxiv.org/abs/2310.14301)|null|
622 | |**2023-10-14**|**Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling**|Tiberiu Boros et.al.|[2310.09636](http://arxiv.org/abs/2310.09636)|**[link](https://github.com/tiberiu44/TTS-Cube)**|
623 | |**2023-10-14**|**Attentive Multi-Layer Perceptron for Non-autoregressive Generation**|Shuyang Jiang et.al.|[2310.09512](http://arxiv.org/abs/2310.09512)|**[link](https://github.com/shark-nlp/attentivemlp)**|
624 | |**2023-12-22**|**Crowdsourced and Automatic Speech Prominence Estimation**|Max Morrison et.al.|[2310.08464](http://arxiv.org/abs/2310.08464)|**[link](https://github.com/reseval/reseval)**|
625 | |**2023-10-12**|**On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition**|Nick Rossenbach et.al.|[2310.08132](http://arxiv.org/abs/2310.08132)|null|
626 | |**2023-10-12**|**Vec-Tok Speech: speech vectorization and tokenization for neural speech generation**|Xinfa Zhu et.al.|[2310.07246](http://arxiv.org/abs/2310.07246)|**[link](https://github.com/bakerbunker/vectok)**|
627 | |**2023-10-10**|**Prosody Analysis of Audiobooks**|Charuta Pethe et.al.|[2310.06930](http://arxiv.org/abs/2310.06930)|null|
628 | |**2023-10-09**|**JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions**|Detai Xin et.al.|[2310.06072](http://arxiv.org/abs/2310.06072)|null|
629 | |**2024-01-09**|**Unified speech and gesture synthesis using flow matching**|Shivam Mehta et.al.|[2310.05181](http://arxiv.org/abs/2310.05181)|null|
630 | |**2023-10-08**|**Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset**|Ze Liu et.al.|[2310.04982](http://arxiv.org/abs/2310.04982)|null|
631 | |**2023-10-11**|**LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT**|Jiaming Wang et.al.|[2310.04673](http://arxiv.org/abs/2310.04673)|null|
632 | |**2024-01-22**|**Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis**|Jae-Sung Bae et.al.|[2310.03538](http://arxiv.org/abs/2310.03538)|null|
633 | |**2023-10-07**|**The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains**|Erica Cooper et.al.|[2310.02640](http://arxiv.org/abs/2310.02640)|null|
634 | |**2023-10-02**|**Towards human-like spoken dialogue generation between AI agents from written dialogue**|Kentaro Mitsui et.al.|[2310.01088](http://arxiv.org/abs/2310.01088)|null|
635 | |**2023-10-01**|**Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech**|Dareen Alharthi et.al.|[2310.00706](http://arxiv.org/abs/2310.00706)|null|
636 | |**2024-03-11**|**Fewer-token Neural Speech Codec with Time-invariant Codes**|Yong Ren et.al.|[2310.00014](http://arxiv.org/abs/2310.00014)|**[link](https://github.com/y-ren16/ticodec)**|
637 | |**2024-01-31**|**ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech**|Wenhao Guan et.al.|[2309.17056](http://arxiv.org/abs/2309.17056)|null|
638 | |**2023-09-29**|**Low-Resource Self-Supervised Learning with SSL-Enhanced TTS**|Po-chun Hsu et.al.|[2309.17020](http://arxiv.org/abs/2309.17020)|null|
639 | |**2023-09-29**|**Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features**|Yuxiang Zhang et.al.|[2309.16954](http://arxiv.org/abs/2309.16954)|null|
640 | |**2023-12-18**|**High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models**|Chunyu Qiang et.al.|[2309.15512](http://arxiv.org/abs/2309.15512)|null|
641 | |**2024-01-09**|**BiSinger: Bilingual Singing Voice Synthesis**|Huali Zhou et.al.|[2309.14089](http://arxiv.org/abs/2309.14089)|**[link](https://github.com/BiSinger-SVS/BiSinger)**|
642 | |**2023-10-07**|**HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS**|Dake Guo et.al.|[2309.13907](http://arxiv.org/abs/2309.13907)|null|
643 | |**2023-09-24**|**VoiceLDM: Text-to-Speech with Environmental Context**|Yeonghyeon Lee et.al.|[2309.13664](http://arxiv.org/abs/2309.13664)|null|
644 | |**2023-09-24**|**Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control**|Aya Watanabe et.al.|[2309.13509](http://arxiv.org/abs/2309.13509)|null|
645 | |**2023-09-22**|**DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2309.12792](http://arxiv.org/abs/2309.12792)|null|
646 | |**2023-09-22**|**Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts**|Shun Lei et.al.|[2309.11977](http://arxiv.org/abs/2309.11977)|null|
647 | |**2023-09-21**|**The Impact of Silence on Speech Anti-Spoofing**|Yuxiang Zhang et.al.|[2309.11827](http://arxiv.org/abs/2309.11827)|null|
648 | |**2023-09-21**|**Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech**|Rui Liu et.al.|[2309.11724](http://arxiv.org/abs/2309.11724)|**[link](https://github.com/ai-s2-lab/emopp)**|
649 | |**2023-09-20**|**Speak While You Think: Streaming Speech Synthesis During Text Generation**|Avihu Dekel et.al.|[2309.11210](http://arxiv.org/abs/2309.11210)|null|
650 | |**2023-09-20**|**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**|Xinyu Zhou et.al.|[2309.11000](http://arxiv.org/abs/2309.11000)|**[link](https://github.com/XinyuZhou2000/Spoken_Dialogue)**|
651 | |**2023-09-19**|**Exploring Speech Enhancement for Low-resource Speech Synthesis**|Zhaoheng Ni et.al.|[2309.10795](http://arxiv.org/abs/2309.10795)|null|
652 | |**2023-09-19**|**Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition**|Ziyang Ma et.al.|[2309.10294](http://arxiv.org/abs/2309.10294)|null|
653 | |**2023-09-17**|**Augmenting text for spoken language understanding with Large Language Models**|Roshan Sharma et.al.|[2309.09390](http://arxiv.org/abs/2309.09390)|null|
654 | |**2023-09-16**|**FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework**|Jianzong Wang et.al.|[2309.08837](http://arxiv.org/abs/2309.08837)|null|
655 | |**2023-09-15**|**Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech**|Dariusz Piotrowski et.al.|[2309.08255](http://arxiv.org/abs/2309.08255)|null|
656 | |**2023-09-15**|**HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods**|Hyun-seo Shin et.al.|[2309.08208](http://arxiv.org/abs/2309.08208)|**[link](https://github.com/talkingnow/HM-Conformer)**|
657 | |**2023-12-27**|**PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions**|Reo Shimizu et.al.|[2309.08140](http://arxiv.org/abs/2309.08140)|null|
658 | |**2023-09-15**|**Diversity-based core-set selection for text-to-speech with linguistic and acoustic features**|Kentaro Seki et.al.|[2309.08127](http://arxiv.org/abs/2309.08127)|null|
659 | |**2023-09-14**|**Direct Text to Speech Translation System using Acoustic Units**|Victoria Mingote et.al.|[2309.07478](http://arxiv.org/abs/2309.07478)|null|
660 | |**2023-10-07**|**FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec**|Zhihao Du et.al.|[2309.07405](http://arxiv.org/abs/2309.07405)|**[link](https://github.com/alibaba-damo-academy/funcodec)**|
661 | |**2023-09-13**|**DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation**|Zhichao Wu et.al.|[2309.06787](http://arxiv.org/abs/2309.06787)|null|
662 | |**2023-09-11**|**Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP**|Jinzuomu Zhong et.al.|[2309.05423](http://arxiv.org/abs/2309.05423)|**[link](https://github.com/jzmzhong/Automatic-Prosody-Annotator-with-SSWP-CLAP)**|
663 | |**2024-01-16**|**VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching**|Yiwei Guo et.al.|[2309.05027](http://arxiv.org/abs/2309.05027)|**[link](https://github.com/X-LANCE/VoiceFlow-TTS)**|
664 | |**2023-09-08**|**Cross-Utterance Conditioned VAE for Speech Generation**|Yang Li et.al.|[2309.04156](http://arxiv.org/abs/2309.04156)|null|
665 | |**2023-09-07**|**Large-Scale Automatic Audiobook Creation**|Brendan Walsh et.al.|[2309.03926](http://arxiv.org/abs/2309.03926)|null|
666 | |**2023-09-11**|**GRASS: Unified Generation Model for Speech-to-Semantic Tasks**|Aobo Xia et.al.|[2309.02780](http://arxiv.org/abs/2309.02780)|null|
667 | |**2023-09-12**|**MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023**|Zhihang Xu et.al.|[2309.02743](http://arxiv.org/abs/2309.02743)|null|
668 | |**2023-10-12**|**PromptTTS 2: Describing and Generating Voices with Text Prompt**|Yichong Leng et.al.|[2309.02285](http://arxiv.org/abs/2309.02285)|null|
669 | |**2023-09-04**|**A Comparative Analysis of Pretrained Language Models for Text-to-Speech**|Marcel Granero-Moya et.al.|[2309.01576](http://arxiv.org/abs/2309.01576)|null|
670 | |**2023-09-02**|**DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin**|Tao Li et.al.|[2309.00883](http://arxiv.org/abs/2309.00883)|null|
671 | |**2023-12-18**|**Learning Speech Representation From Contrastive Token-Acoustic Pretraining**|Chunyu Qiang et.al.|[2309.00424](http://arxiv.org/abs/2309.00424)|null|
672 | |**2023-09-01**|**The FruitShell French synthesis system at the Blizzard 2023 Challenge**|Xin Qi et.al.|[2309.00223](http://arxiv.org/abs/2309.00223)|null|
673 | |**2023-08-31**|**QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning**|Haohan Guo et.al.|[2309.00126](http://arxiv.org/abs/2309.00126)|null|
674 | |**2024-01-23**|**SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**|Xin Zhang et.al.|[2308.16692](http://arxiv.org/abs/2308.16692)|**[link](https://github.com/0nutation/uslm)**|
675 | |**2023-08-31**|**Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis**|Weiqin Li et.al.|[2308.16593](http://arxiv.org/abs/2308.16593)|null|
676 | |**2023-08-31**|**Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information**|Jie Chen et.al.|[2308.16577](http://arxiv.org/abs/2308.16577)|null|
677 | |**2023-08-31**|**LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech**|Jie Chen et.al.|[2308.16569](http://arxiv.org/abs/2308.16569)|null|
678 | |**2023-08-30**|**CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis**|Yi Meng et.al.|[2308.16021](http://arxiv.org/abs/2308.16021)|null|
679 | |**2023-09-01**|**The DeepZen Speech Synthesis System for Blizzard Challenge 2023**|Christophe Veaux et.al.|[2308.15945](http://arxiv.org/abs/2308.15945)|null|
680 | |**2023-08-28**|**Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech**|Hyungchan Yoon et.al.|[2308.14909](http://arxiv.org/abs/2308.14909)|null|
681 | |**2023-09-04**|**Rep2wav: Noise Robust text-to-speech Using self-supervised representations**|Qiushi Zhu et.al.|[2308.14553](http://arxiv.org/abs/2308.14553)|null|
682 | |**2023-08-28**|**TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models**|Shengpeng Ji et.al.|[2308.14430](http://arxiv.org/abs/2308.14430)|**[link](https://github.com/jishengpeng/TextrolSpeech)**|
683 | |**2023-09-02**|**Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder**|Xuyuan Li et.al.|[2308.13365](http://arxiv.org/abs/2308.13365)|null|
684 | |**2023-08-24**|**Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations**|Wenbin Wang et.al.|[2308.13007](http://arxiv.org/abs/2308.13007)|null|
685 | |**2023-09-22**|**Sparks of Large Audio Models: A Survey and Outlook**|Siddique Latif et.al.|[2308.12792](http://arxiv.org/abs/2308.12792)|null|
686 | |**2023-10-25**|**SeamlessM4T: Massively Multilingual & Multimodal Machine Translation**|Seamless Communication et.al.|[2308.11596](http://arxiv.org/abs/2308.11596)|**[link](https://github.com/facebookresearch/seamless_communication)**|
687 | |**2023-08-31**|**Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models**|Heyang Xue et.al.|[2308.10428](http://arxiv.org/abs/2308.10428)|null|
688 | |**2023-08-16**|**AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis**|Hrishikesh Viswanath et.al.|[2308.08577](http://arxiv.org/abs/2308.08577)|null|
689 | |**2023-08-14**|**SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**|Xiaofei Wang et.al.|[2308.06873](http://arxiv.org/abs/2308.06873)|null|
690 | |**2023-08-12**|**Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation**|Zhichao Wang et.al.|[2308.06457](http://arxiv.org/abs/2308.06457)|**[link](https://github.com/zhichaowang970201/text-to-video)**|
691 | |**2023-09-09**|**AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining**|Haohe Liu et.al.|[2308.05734](http://arxiv.org/abs/2308.05734)|**[link](https://github.com/haoheliu/AudioLDM2)**|
692 | |**2023-08-09**|**Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay**|Leixian Shen et.al.|[2308.04703](http://arxiv.org/abs/2308.04703)|null|
693 | |**2023-08-08**|**Towards an AI to Win Ghana's National Science and Maths Quiz**|George Boateng et.al.|[2308.04333](http://arxiv.org/abs/2308.04333)|**[link](https://github.com/nsmq-ai/nsmqai)**|
694 | |**2023-08-08**|**WonderFlow: Narration-Centric Design of Animated Data Videos**|Yun Wang et.al.|[2308.04040](http://arxiv.org/abs/2308.04040)|null|
695 | |**2023-08-04**|**Let's Give a Voice to Conversational Agents in Virtual Reality**|Michele Yin et.al.|[2308.02665](http://arxiv.org/abs/2308.02665)|**[link](https://github.com/sislab-unitn/Let-s-Give-a-Voice-to-Conversational-Agents-in-VR)**|
696 | |**2023-08-03**|**Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation**|Minsu Kim et.al.|[2308.01831](http://arxiv.org/abs/2308.01831)|**[link](https://github.com/choijeongsoo/utut)**|
697 | |**2023-08-02**|**SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis**|Ramanan Sivaguru et.al.|[2308.01018](http://arxiv.org/abs/2308.01018)|null|
698 | |**2023-07-07**|**Artificial Eye for the Blind**|Abhinav Benagi et.al.|[2308.00801](http://arxiv.org/abs/2308.00801)|null|
699 | |**2023-07-31**|**Multilingual context-based pronunciation learning for Text-to-Speech**|Giulia Comini et.al.|[2307.16709](http://arxiv.org/abs/2307.16709)|null|
700 | |**2023-07-31**|**Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech**|Guangyan Zhang et.al.|[2307.16679](http://arxiv.org/abs/2307.16679)|null|
701 | |**2023-07-31**|**Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings**|Manuel Sam Ribeiro et.al.|[2307.16643](http://arxiv.org/abs/2307.16643)|null|
702 | |**2023-07-31**|**DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training**|Hyung-Seok Oh et.al.|[2307.16549](http://arxiv.org/abs/2307.16549)|**[link](https://github.com/hsoh0306/diffprosody)**|
703 | |**2023-07-31**|**VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design**|Jungil Kong et.al.|[2307.16430](http://arxiv.org/abs/2307.16430)|null|
704 | |**2023-07-30**|**Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation**|Yuanhao Chen et.al.|[2307.16199](http://arxiv.org/abs/2307.16199)|**[link](https://github.com/edward-martyr/shanghainese-tts)**|
705 | |**2023-07-29**|**METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer**|Xinfa Zhu et.al.|[2307.15951](http://arxiv.org/abs/2307.15951)|null|
706 | |**2023-12-18**|**Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding**|Chunyu Qiang et.al.|[2307.15484](http://arxiv.org/abs/2307.15484)|null|
707 | |**2023-07-20**|**SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer**|Daegyeom Kim et.al.|[2307.10550](http://arxiv.org/abs/2307.10550)|**[link](https://github.com/0913ktg/sc_vall-e)**|
708 | |**2023-07-18**|**SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs**|Yinghao Aaron Li et.al.|[2307.09435](http://arxiv.org/abs/2307.09435)|null|
709 | |**2023-09-28**|**Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts**|Ziyue Jiang et.al.|[2307.07218](http://arxiv.org/abs/2307.07218)|null|
710 | |**2023-07-13**|**Controllable Emphasis with zero data for text-to-speech**|Arnaud Joly et.al.|[2307.07062](http://arxiv.org/abs/2307.07062)|null|
711 | |**2023-07-11**|**On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis**|Siyang Wang et.al.|[2307.05132](http://arxiv.org/abs/2307.05132)|null|
712 | |**2023-07-10**|**The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task**|Kun Song et.al.|[2307.04630](http://arxiv.org/abs/2307.04630)|null|
713 | |**2023-10-07**|**ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading**|Yujia Xiao et.al.|[2307.00782](http://arxiv.org/abs/2307.00782)|null|
714 | |**2023-06-28**|**EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech**|Daria Diatlova et.al.|[2307.00024](http://arxiv.org/abs/2307.00024)|**[link](https://github.com/deepvk/emospeech)**|
715 | |**2023-06-29**|**High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units**|Junchen Lu et.al.|[2306.17005](http://arxiv.org/abs/2306.17005)|null|
716 | |**2023-06-28**|**UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data**|Heeseung Kim et.al.|[2306.16083](http://arxiv.org/abs/2306.16083)|**[link](https://github.com/gmltmd789/UnitSpeech)**|
717 | |**2023-10-19**|**Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale**|Matthew Le et.al.|[2306.15687](http://arxiv.org/abs/2306.15687)|null|
718 | |**2023-06-27**|**GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech**|Yahuan Cong et.al.|[2306.15304](http://arxiv.org/abs/2306.15304)|null|
719 | |**2023-06-25**|**DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech**|Sen Liu et.al.|[2306.14145](http://arxiv.org/abs/2306.14145)|null|
720 | |**2023-06-21**|**Visual-Aware Text-to-Speech**|Mohan Zhou et.al.|[2306.12020](http://arxiv.org/abs/2306.12020)|null|
721 | |**2023-06-21**|**Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer**|Jakub Swiatkowski et.al.|[2306.11662](http://arxiv.org/abs/2306.11662)|null|
722 | |**2023-06-16**|**Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation**|Kishor Kayyar Lakshminarayana et.al.|[2306.10152](http://arxiv.org/abs/2306.10152)|null|
723 | |**2023-06-16**|**CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages**|Frederico S. Oliveira et.al.|[2306.10097](http://arxiv.org/abs/2306.10097)|null|
724 | |**2023-06-14**|**Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation**|Zheng Liang et.al.|[2306.08588](http://arxiv.org/abs/2306.08588)|null|
725 | |**2023-06-14**|**Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects**|Xinghua Qu et.al.|[2306.08219](http://arxiv.org/abs/2306.08219)|**[link](https://github.com/hyllll/vcrs)**|
726 | |**2023-11-20**|**StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models**|Yinghao Aaron Li et.al.|[2306.07691](http://arxiv.org/abs/2306.07691)|null|
727 | |**2024-01-18**|**UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding**|Chenpeng Du et.al.|[2306.07547](http://arxiv.org/abs/2306.07547)|null|
728 | |**2023-06-13**|**PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling**|Ji-Sang Hwang et.al.|[2306.07489](http://arxiv.org/abs/2306.07489)|null|
729 | |**2023-06-09**|**Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech**|Shijun Wang et.al.|[2306.05709](http://arxiv.org/abs/2306.05709)|null|
730 | |**2023-06-08**|**VIFS: An End-to-End Variational Inference for Foley Sound Synthesis**|Junhyeok Lee et.al.|[2306.05004](http://arxiv.org/abs/2306.05004)|**[link](https://github.com/junjun3518/vifs)**|
731 | |**2023-07-11**|**Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge**|Wenhao Guan et.al.|[2306.04301](http://arxiv.org/abs/2306.04301)|null|
732 | |**2023-06-06**|**Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias**|Ziyue Jiang et.al.|[2306.03509](http://arxiv.org/abs/2306.03509)|null|
733 | |**2023-08-02**|**Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis**|Zhenhui Ye et.al.|[2306.03504](http://arxiv.org/abs/2306.03504)|null|
734 | |**2023-06-05**|**Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis**|Dengfeng Ke et.al.|[2306.02593](http://arxiv.org/abs/2306.02593)|null|
735 | |**2023-06-05**|**Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model**|Hoyeon Lee et.al.|[2306.02579](http://arxiv.org/abs/2306.02579)|null|
736 | |**2023-06-05**|**Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming**|Xinlei Niu et.al.|[2306.02568](http://arxiv.org/abs/2306.02568)|**[link](https://github.com/Berthaniu/LatentOptimalPathsBayesianDP)**|
737 | |**2023-06-02**|**Towards Robust FastSpeech 2 by Modelling Residual Multimodality**|Fabian Kögel et.al.|[2306.01442](http://arxiv.org/abs/2306.01442)|**[link](https://github.com/sony/ai-research-code)**|
738 | |**2023-05-30**|**Towards Selection of Text-to-speech Data to Augment ASR Training**|Shuo Liu et.al.|[2306.00998](http://arxiv.org/abs/2306.00998)|null|
739 | |**2023-06-01**|**EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis**|Haobin Tang et.al.|[2306.00648](http://arxiv.org/abs/2306.00648)|null|
740 | |**2023-06-01**|**The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech**|Phat Do et.al.|[2306.00535](http://arxiv.org/abs/2306.00535)|null|
741 | |**2023-05-31**|**Text-to-Speech Pipeline for Swiss German -- A comparison**|Tobias Bollinger et.al.|[2305.19750](http://arxiv.org/abs/2305.19750)|null|
742 | |**2023-05-31**|**XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech**|Linh The Nguyen et.al.|[2305.19709](http://arxiv.org/abs/2305.19709)|**[link](https://github.com/vinairesearch/xphonebert)**|
743 | |**2023-06-01**|**PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions**|Guanghou Liu et.al.|[2305.19522](http://arxiv.org/abs/2305.19522)|null|
744 | |**2023-05-30**|**Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages**|Phat Do et.al.|[2305.19396](http://arxiv.org/abs/2305.19396)|null|
745 | |**2023-05-30**|**Make-A-Voice: Unified Voice Synthesis With Discrete Representation**|Rongjie Huang et.al.|[2305.19269](http://arxiv.org/abs/2305.19269)|null|
746 | |**2023-05-30**|**STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions**|Michel Plüss et.al.|[2305.18855](http://arxiv.org/abs/2305.18855)|null|
747 | |**2023-05-30**|**LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus**|Yuma Koizumi et.al.|[2305.18802](http://arxiv.org/abs/2305.18802)|null|
748 | |**2023-10-09**|**An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization**|Fei Kong et.al.|[2305.18355](http://arxiv.org/abs/2305.18355)|**[link](https://github.com/kong13661/pia)**|
749 | |**2023-05-29**|**ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation**|Ambuj Mehrish et.al.|[2305.18028](http://arxiv.org/abs/2305.18028)|**[link](https://github.com/declare-lab/adapter-mix)**|
750 | |**2023-05-29**|**Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis**|Erik Ekstedt et.al.|[2305.17971](http://arxiv.org/abs/2305.17971)|null|
751 | |**2023-07-25**|**StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation**|Kun Song et.al.|[2305.17732](http://arxiv.org/abs/2305.17732)|null|
752 | |**2023-05-28**|**Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS**|Sewade Ogun et.al.|[2305.17724](http://arxiv.org/abs/2305.17724)|**[link](https://github.com/ogunlao/glowtts_stdp)**|
753 | |**2023-07-19**|**Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing**|Julia Kaiwen Lau et.al.|[2305.17445](http://arxiv.org/abs/2305.17445)|**[link](https://github.com/julianyonghao/fainasrtest)**|
754 | |**2023-05-26**|**DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction**|Vineet Bhat et.al.|[2305.16957](http://arxiv.org/abs/2305.16957)|null|
755 | |**2023-05-25**|**Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion**|Rui Liu et.al.|[2305.16353](http://arxiv.org/abs/2305.16353)|**[link](https://github.com/ai-s2-lab/m2s-add)**|
756 | |**2023-05-22**|**Text Generation with Speech Synthesis for ASR Data Augmentation**|Zhuangqun Huang et.al.|[2305.16333](http://arxiv.org/abs/2305.16333)|null|
757 | |**2023-05-25**|**VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation**|Tianrui Wang et.al.|[2305.16107](http://arxiv.org/abs/2305.16107)|null|
758 | |**2023-05-25**|**Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration**|Rustem Yeshpanov et.al.|[2305.15749](http://arxiv.org/abs/2305.15749)|**[link](https://github.com/is2ai/turkictts)**|
759 | |**2024-02-05**|**LAraBench: Benchmarking Arabic AI with Large Language Models**|Ahmed Abdelali et.al.|[2305.14982](http://arxiv.org/abs/2305.14982)|null|
760 | |**2023-05-23**|**EfficientSpeech: An On-Device Text to Speech Model**|Rowel Atienza et.al.|[2305.13905](http://arxiv.org/abs/2305.13905)|**[link](https://github.com/roatienza/efficientspeech)**|
761 | |**2023-05-23**|**ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models**|Minki Kang et.al.|[2305.13831](http://arxiv.org/abs/2305.13831)|null|
762 | |**2023-05-22**|**U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech**|Xin Jing et.al.|[2305.13195](http://arxiv.org/abs/2305.13195)|null|
763 | |**2023-05-25**|**EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels**|Kari Ali Noriy et.al.|[2305.13137](http://arxiv.org/abs/2305.13137)|**[link](https://github.com/knoriy/emns-dct)**|
764 | |**2023-05-22**|**ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer**|Huadai Liu et.al.|[2305.12708](http://arxiv.org/abs/2305.12708)|null|
765 | |**2023-05-21**|**VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages**|Shivam Mhaskar et.al.|[2305.12518](http://arxiv.org/abs/2305.12518)|null|
766 | |**2023-05-26**|**Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus**|Detai Xin et.al.|[2305.12442](http://arxiv.org/abs/2305.12442)|**[link](https://github.com/aria-k-alethia/laughter-synthesis)**|
767 | |**2023-05-20**|**ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios**|Yuyue Wang et.al.|[2305.12200](http://arxiv.org/abs/2305.12200)|null|
768 | |**2023-05-19**|**MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting**|Neil Shah et.al.|[2305.11926](http://arxiv.org/abs/2305.11926)|null|
769 | |**2024-02-20**|**Data Redaction from Conditional Generative Models**|Zhifeng Kong et.al.|[2305.11351](http://arxiv.org/abs/2305.11351)|null|
770 | |**2023-05-18**|**Parameter-Efficient Learning for Text-to-Speech Accent Adaptation**|Li-Jen Yang et.al.|[2305.11320](http://arxiv.org/abs/2305.11320)|**[link](https://github.com/TTS-Research/PEL-TTS)**|
771 | |**2023-05-19**|**Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation**|Martijn Bartelds et.al.|[2305.10951](http://arxiv.org/abs/2305.10951)|**[link](https://github.com/bartelds/asr-augmentation)**|
772 | |**2023-09-30**|**Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data**|Yusheng Tian et.al.|[2305.10891](http://arxiv.org/abs/2305.10891)|**[link](https://github.com/dmse4tts/dmse4tts)**|
773 | |**2023-05-18**|**FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs**|Won Jang et.al.|[2305.10823](http://arxiv.org/abs/2305.10823)|null|
774 | |**2023-05-18**|**CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training**|Zhenhui Ye et.al.|[2305.10763](http://arxiv.org/abs/2305.10763)|null|
775 | |**2023-08-29**|**a unified front-end framework for english text-to-speech synthesis**|Zelin Ying et.al.|[2305.10666](http://arxiv.org/abs/2305.10666)|null|
776 | |**2023-09-19**|**Controllable Speaking Styles Using a Large Language Model**|Atli Thor Sigurgeirsson et.al.|[2305.10321](http://arxiv.org/abs/2305.10321)|null|
777 | |**2023-05-23**|**Better speech synthesis through scaling**|James Betker et.al.|[2305.07243](http://arxiv.org/abs/2305.07243)|**[link](https://github.com/neonbjb/tortoise-tts)**|
778 | |**2023-10-29**|**CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model**|Zhen Ye et.al.|[2305.06908](http://arxiv.org/abs/2305.06908)|**[link](https://github.com/zhenye234/CoMoSpeech)**|
779 | |**2023-05-08**|**Accented Text-to-Speech Synthesis with Limited Data**|Xuehao Zhou et.al.|[2305.04816](http://arxiv.org/abs/2305.04816)|null|
780 | |**2023-05-03**|**M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis**|Jinlong Xue et.al.|[2305.02269](http://arxiv.org/abs/2305.02269)|null|
781 | |**2023-05-30**|**A Review of Deep Learning Techniques for Speech Processing**|Ambuj Mehrish et.al.|[2305.00359](http://arxiv.org/abs/2305.00359)|null|
782 | |**2023-04-26**|**Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis**|Ye-Xin Lu et.al.|[2304.13270](http://arxiv.org/abs/2304.13270)|null|
783 | |**2023-04-25**|**Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge**|Chenpeng Du et.al.|[2304.13121](http://arxiv.org/abs/2304.13121)|null|
784 | |**2023-04-24**|**Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model**|Kenichi Fujita et.al.|[2304.11976](http://arxiv.org/abs/2304.11976)|null|
785 | |**2023-04-23**|**DiffVoice: Text-to-Speech with Latent Diffusion**|Zhijun Liu et.al.|[2304.11750](http://arxiv.org/abs/2304.11750)|null|
786 | |**2023-04-23**|**SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model**|Jianzong Wang et.al.|[2304.11547](http://arxiv.org/abs/2304.11547)|null|
787 | |**2023-05-30**|**NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**|Kai Shen et.al.|[2304.09116](http://arxiv.org/abs/2304.09116)|null|
788 | |**2023-04-16**|**A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers**|Juan Zuluaga-Gomez et.al.|[2304.07842](http://arxiv.org/abs/2304.07842)|null|
789 | |**2023-04-13**|**Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis**|Shun Lei et.al.|[2304.06359](http://arxiv.org/abs/2304.06359)|null|
790 | |**2023-04-10**|**Enhancing Speech-to-Speech Translation with Multiple TTS Targets**|Jiatong Shi et.al.|[2304.04618](http://arxiv.org/abs/2304.04618)|null|
791 | |**2023-04-07**|**ArmanTTS single-speaker Persian dataset**|Mohammd Hasan Shamgholi et.al.|[2304.03585](http://arxiv.org/abs/2304.03585)|null|
792 | |**2023-04-03**|**Ensemble prosody prediction for expressive speech synthesis**|Tian Huey Teh et.al.|[2304.00714](http://arxiv.org/abs/2304.00714)|null|
793 | |**2023-03-29**|**AraSpot: Arabic Spoken Command Spotting**|Mahmoud Salhab et.al.|[2303.16621](http://arxiv.org/abs/2303.16621)|**[link](https://github.com/msalhab96/araspot)**|
794 | |**2023-03-28**|**Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages**|Seongyeon Park et.al.|[2303.15669](http://arxiv.org/abs/2303.15669)|**[link](https://github.com/cnaigithub/SpeechDewarping)**|
795 | |**2023-03-27**|**Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis**|Karren Yang et.al.|[2303.14885](http://arxiv.org/abs/2303.14885)|null|
796 | |**2023-03-24**|**Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis**|Takuhiro Kaneko et.al.|[2303.13909](http://arxiv.org/abs/2303.13909)|null|
797 | |**2023-04-02**|**A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI**|Chenshuang Zhang et.al.|[2303.13336](http://arxiv.org/abs/2303.13336)|null|
798 | |**2023-03-20**|**Code-Switching Text Generation and Injection in Mandarin-English ASR**|Haibin Yu et.al.|[2303.10949](http://arxiv.org/abs/2303.10949)|null|
799 | |**2023-03-14**|**Controlling High-Dimensional Data With Sparse Input**|Dan Andrei Iliescu et.al.|[2303.09446](http://arxiv.org/abs/2303.09446)|null|
800 | |**2023-03-09**|**Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports**|Hyunseung Chung et.al.|[2303.09395](http://arxiv.org/abs/2303.09395)|**[link](https://github.com/tclife/text_to_ecg)**|
801 | |**2023-03-15**|**Cross-speaker Emotion Transfer by Manipulating Speech Style Latents**|Suhee Jo et.al.|[2303.08329](http://arxiv.org/abs/2303.08329)|null|
802 | |**2023-03-14**|**QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis**|Haobin Tang et.al.|[2303.07682](http://arxiv.org/abs/2303.07682)|null|
803 | |**2023-03-10**|**An End-to-End Neural Network for Image-to-Audio Transformation**|Liu Chen et.al.|[2303.06078](http://arxiv.org/abs/2303.06078)|null|
804 | |**2023-03-09**|**Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation**|Qi Chen et.al.|[2303.05322](http://arxiv.org/abs/2303.05322)|**[link](https://github.com/moon0316/t2a)**|
805 | |**2023-03-07**|**Do Prosody Transfer Models Transfer Prosody?**|Atli Thor Sigurgeirsson et.al.|[2303.04289](http://arxiv.org/abs/2303.04289)|null|
806 | |**2023-03-07**|**Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling**|Ziqiang Zhang et.al.|[2303.03926](http://arxiv.org/abs/2303.03926)|null|
807 | |**2023-03-02**|**Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding**|Yingting Li et.al.|[2303.03267](http://arxiv.org/abs/2303.03267)|**[link](https://github.com/declare-lab/speech-adapters)**|
808 | |**2023-03-08**|**FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**|Ruiqing Xue et.al.|[2303.02939](http://arxiv.org/abs/2303.02939)|null|
809 | |**2023-08-14**|**Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations**|Yuma Koizumi et.al.|[2303.01664](http://arxiv.org/abs/2303.01664)|null|
810 | |**2023-03-11**|**Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities**|Shijun Wang et.al.|[2303.01508](http://arxiv.org/abs/2303.01508)|null|
811 | |**2023-12-17**|**ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations**|Neil Shah et.al.|[2303.01261](http://arxiv.org/abs/2303.01261)|null|
812 | |**2023-03-02**|**LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion**|Chunfeng Wang et.al.|[2303.01086](http://arxiv.org/abs/2303.01086)|null|
813 | |**2023-03-02**|**Leveraging Large Text Corpora for End-to-End Speech Summarization**|Kohei Matsuura et.al.|[2303.00978](http://arxiv.org/abs/2303.00978)|null|
814 | |**2023-03-01**|**DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction**|Raviteja Anantha et.al.|[2303.00171](http://arxiv.org/abs/2303.00171)|null|
815 | |**2023-02-28**|**ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus**|Ajinkya Kulkarni et.al.|[2303.00069](http://arxiv.org/abs/2303.00069)|null|
816 | |**2023-02-28**|**Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners**|Jocelyn Huang et.al.|[2302.14523](http://arxiv.org/abs/2302.14523)|null|
817 | |**2023-06-12**|**CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis**|Ji-Hoon Kim et.al.|[2302.14370](http://arxiv.org/abs/2302.14370)|null|
818 | |**2023-05-19**|**UniFLG: Unified Facial Landmark Generator from Text or Speech**|Kentaro Mitsui et.al.|[2302.14337](http://arxiv.org/abs/2302.14337)|null|
819 | |**2023-02-27**|**Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech**|Jiyoung Lee et.al.|[2302.13700](http://arxiv.org/abs/2302.13700)|**[link](https://github.com/naver-ai/facetts)**|
820 | |**2023-02-27**|**Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech**|Dong Yang et.al.|[2302.13652](http://arxiv.org/abs/2302.13652)|null|
821 | |**2023-02-27**|**Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow**|Yoonhyung Lee et.al.|[2302.13458](http://arxiv.org/abs/2302.13458)|null|
822 | |**2023-06-06**|**PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS**|Junhyeok Lee et.al.|[2302.12391](http://arxiv.org/abs/2302.12391)|**[link](https://github.com/anonymous-pits/pits)**|
823 | |**2023-02-21**|**Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition**|Leyuan Qu et.al.|[2302.09723](http://arxiv.org/abs/2302.09723)|null|
824 | |**2023-02-23**|**QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion**|Houjian Guo et.al.|[2302.08296](http://arxiv.org/abs/2302.08296)|**[link](https://github.com/quickvc/quickvoice-conversion)**|
825 | |**2023-02-13**|**Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages**|Sudhanshu Srivastava et.al.|[2302.06227](http://arxiv.org/abs/2302.06227)|null|
826 | |**2023-02-08**|**A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech**|Li-Wei Chen et.al.|[2302.04215](http://arxiv.org/abs/2302.04215)|**[link](https://github.com/b04901014/mqtts)**|
827 | |**2023-02-07**|**Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**|Eugene Kharitonov et.al.|[2302.03540](http://arxiv.org/abs/2302.03540)|null|
828 | |**2023-02-15**|**MAC: A unified framework boosting low resource automatic speech recognition**|Zeping Min et.al.|[2302.03498](http://arxiv.org/abs/2302.03498)|null|
829 | |**2023-06-25**|**InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt**|Dongchao Yang et.al.|[2301.13662](http://arxiv.org/abs/2301.13662)|**[link](https://github.com/yangdongchao/academicodec)**|
830 | |**2023-03-01**|**UzbekTagger: The rule-based POS tagger for Uzbek language**|Maksud Sharipov et.al.|[2301.12711](http://arxiv.org/abs/2301.12711)|null|
831 | |**2023-05-27**|**Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining**|Takaaki Saeki et.al.|[2301.12596](http://arxiv.org/abs/2301.12596)|**[link](https://github.com/takaaki-saeki/zm-text-tts)**|
832 | |**2023-01-31**|**Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker**|Navjot Kaur et.al.|[2301.12331](http://arxiv.org/abs/2301.12331)|**[link](https://github.com/chocobearz/speech_timing)**|
833 | |**2023-01-26**|**On granularity of prosodic representations in expressive text-to-speech**|Mikolaj Babianski et.al.|[2301.11446](http://arxiv.org/abs/2301.11446)|null|
834 | |**2023-01-26**|**Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study**|Massa Baali et.al.|[2301.09099](http://arxiv.org/abs/2301.09099)|**[link](https://github.com/espnet/espnet/tree/master/egs2/qasr_tts/tts1)**|
835 | |**2023-01-20**|**Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions**|Yinghao Aaron Li et.al.|[2301.08810](http://arxiv.org/abs/2301.08810)|null|
836 | |**2023-01-11**|**Modelling low-resource accents without accent-specific TTS frontend**|Georgi Tinchev et.al.|[2301.04606](http://arxiv.org/abs/2301.04606)|null|
837 | |**2022-12-11**|**BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm**|Yu-Wen Chen et.al.|[2301.04120](http://arxiv.org/abs/2301.04120)|**[link](https://github.com/yuwchen/baspro)**|
838 | |**2023-01-10**|**UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion**|Haogeng Liu et.al.|[2301.03801](http://arxiv.org/abs/2301.03801)|null|
839 | |**2023-01-10**|**Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation**|Abdullah Shahid et.al.|[2301.03751](http://arxiv.org/abs/2301.03751)|null|
840 | |**2023-09-19**|**Applying Automated Machine Translation to Educational Video Courses**|Linden Wang et.al.|[2301.03141](http://arxiv.org/abs/2301.03141)|null|
841 | |**2023-01-06**|**Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition**|David M. Chan et.al.|[2301.02736](http://arxiv.org/abs/2301.02736)|null|
842 | |**2023-01-05**|**Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers**|Chengyi Wang et.al.|[2301.02111](http://arxiv.org/abs/2301.02111)|**[link](https://github.com/microsoft/unilm)**|
843 | |**2022-12-11**|**MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset**|Kailin Liang et.al.|[2301.00657](http://arxiv.org/abs/2301.00657)|**[link](https://github.com/ssmlkl/mntts2)**|
844 | |**2022-12-30**|**ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech**|Zehua Chen et.al.|[2212.14518](http://arxiv.org/abs/2212.14518)|null|
845 | |**2022-12-29**|**StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models**|Yinghao Aaron Li et.al.|[2212.14227](http://arxiv.org/abs/2212.14227)|**[link](https://github.com/yl4579/StyleTTS-VC)**|
846 | |**2022-12-22**|**HMM-based data augmentation for E2E systems for building conversational speech synthesis systems**|Ishika Gupta et.al.|[2212.11982](http://arxiv.org/abs/2212.11982)|null|
847 | |**2022-12-21**|**ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement**|Wei-Ning Hsu et.al.|[2212.11377](http://arxiv.org/abs/2212.11377)|null|
848 | |**2022-12-20**|**TTS-Guided Training for Accent Conversion Without Parallel Data**|Yi Zhou et.al.|[2212.10204](http://arxiv.org/abs/2212.10204)|null|
849 | |**2023-06-28**|**Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling**|Tuomo Raitio et.al.|[2212.10075](http://arxiv.org/abs/2212.10075)|null|
850 | |**2022-12-16**|**Speech Aware Dialog System Technology Challenge (DSTC11)**|Hagen Soltau et.al.|[2212.08704](http://arxiv.org/abs/2212.08704)|null|
851 | |**2022-12-16**|**Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder**|Yusuke Yasuda et.al.|[2212.08329](http://arxiv.org/abs/2212.08329)|null|
852 | |**2022-12-16**|**Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language**|Yusuke Yasuda et.al.|[2212.08321](http://arxiv.org/abs/2212.08321)|null|
853 | |**2022-12-15**|**RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis**|Shinhyeok Oh et.al.|[2212.07939](http://arxiv.org/abs/2212.07939)|**[link](https://github.com/shinhyeokoh/rwen)**|
854 | |**2022-12-14**|**Probing Deep Speaker Embeddings for Speaker-related Tasks**|Zifeng Zhao et.al.|[2212.07068](http://arxiv.org/abs/2212.07068)|null|
855 | |**2022-12-08**|**SpeechLMScore: Evaluating speech generation using speech language model**|Soumi Maiti et.al.|[2212.04559](http://arxiv.org/abs/2212.04559)|**[link](https://github.com/espnet/espnet)**|
856 | |**2023-04-04**|**Learning to Dub Movies via Hierarchical Prosody Models**|Gaoxiang Cong et.al.|[2212.04054](http://arxiv.org/abs/2212.04054)|**[link](https://github.com/galaxycong/hpmdubbing)**|
857 | |**2022-12-07**|**Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning**|Ankur Debnath et.al.|[2212.03558](http://arxiv.org/abs/2212.03558)|null|
858 | |**2022-12-07**|**Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue**|Daxin Tan et.al.|[2212.03398](http://arxiv.org/abs/2212.03398)|null|
859 | |**2022-12-06**|**UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis**|Yi Lei et.al.|[2212.01546](http://arxiv.org/abs/2212.01546)|null|
860 | |**2022-11-30**|**SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech**|Byoung Jin Choi et.al.|[2211.16866](http://arxiv.org/abs/2211.16866)|null|
861 | |**2022-11-29**|**Controllable speech synthesis by learning discrete phoneme-level prosodic representations**|Nikolaos Ellinas et.al.|[2211.16307](http://arxiv.org/abs/2211.16307)|null|
862 | |**2023-05-25**|**Evaluating and reducing the distance between synthetic and real speech distributions**|Christoph Minixhofer et.al.|[2211.16049](http://arxiv.org/abs/2211.16049)|null|
863 | |**2022-11-26**|**Contextual Expressive Text-to-Speech**|Jianhong Tu et.al.|[2211.14548](http://arxiv.org/abs/2211.14548)|null|
864 | |**2022-12-05**|**Efficient Incremental Text-to-Speech on GPUs**|Muyang Du et.al.|[2211.13939](http://arxiv.org/abs/2211.13939)|null|
865 | |**2023-03-21**|**Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?**|Xuan Shi et.al.|[2211.13868](http://arxiv.org/abs/2211.13868)|**[link](https://github.com/nii-yamagishilab/midi-to-audio)**|
866 | |**2022-11-23**|**IMaSC -- ICFOSS Malayalam Speech Corpus**|Deepa P Gopinath et.al.|[2211.12796](http://arxiv.org/abs/2211.12796)|null|
867 | |**2022-11-22**|**PromptTTS: Controllable Text-to-Speech with Text Descriptions**|Zhifang Guo et.al.|[2211.12171](http://arxiv.org/abs/2211.12171)|null|
868 | |**2022-11-04**|**Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech**|Xin Zhang et.al.|[2211.09731](http://arxiv.org/abs/2211.09731)|null|
869 | |**2023-02-17**|**Towards Building Text-To-Speech Systems for the Next Billion Users**|Gokul Karthik Kumar et.al.|[2211.09536](http://arxiv.org/abs/2211.09536)|**[link](https://github.com/gokulkarthik/text2speech)**|
870 | |**2023-02-16**|**EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance**|Yiwei Guo et.al.|[2211.09496](http://arxiv.org/abs/2211.09496)|null|
871 | |**2022-11-17**|**Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation**|Chunyu Qiang et.al.|[2211.09495](http://arxiv.org/abs/2211.09495)|null|
872 | |**2022-11-17**|**NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis**|Hyeong-Seok Choi et.al.|[2211.09407](http://arxiv.org/abs/2211.09407)|null|
873 | |**2023-03-14**|**Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models**|Minki Kang et.al.|[2211.09383](http://arxiv.org/abs/2211.09383)|null|
874 | |**2023-01-04**|**Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation**|Xin Yuan et.al.|[2211.09365](http://arxiv.org/abs/2211.09365)|null|
875 | |**2022-11-14**|**SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech**|Perry Lam et.al.|[2211.07283](http://arxiv.org/abs/2211.07283)|null|
876 | |**2023-05-24**|**Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing**|Jacob J Webber et.al.|[2211.06989](http://arxiv.org/abs/2211.06989)|null|
877 | |**2023-05-29**|**OverFlow: Putting flows on top of neural transducers for better TTS**|Shivam Mehta et.al.|[2211.06892](http://arxiv.org/abs/2211.06892)|**[link](https://github.com/coqui-ai/TTS)**|
878 | |**2023-05-29**|**Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations**|Yoori Oh et.al.|[2211.06160](http://arxiv.org/abs/2211.06160)|null|
879 | |**2022-12-04**|**ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech**|Xiaoran Fan et.al.|[2211.03545](http://arxiv.org/abs/2211.03545)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**|
880 | |**2022-11-07**|**Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder**|Jan Melechovsky et.al.|[2211.03316](http://arxiv.org/abs/2211.03316)|**[link](https://github.com/dapwner/cvae-tacotron)**|
881 | |**2022-11-06**|**Parallel Attention Forcing for Machine Translation**|Qingyun Dou et.al.|[2211.03237](http://arxiv.org/abs/2211.03237)|null|
882 | |**2022-11-06**|**An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space**|Jihwan Lee et.al.|[2211.03078](http://arxiv.org/abs/2211.03078)|null|
883 | |**2022-11-04**|**NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS**|Dongchao Yang et.al.|[2211.02448](http://arxiv.org/abs/2211.02448)|null|
884 | |**2022-11-04**|**Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts**|Detai Xin et.al.|[2211.02336](http://arxiv.org/abs/2211.02336)|null|
885 | |**2023-04-16**|**Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS**|Ziqi Liang et.al.|[2211.01948](http://arxiv.org/abs/2211.01948)|null|
886 | |**2022-11-01**|**Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages**|Anusha Prakash et.al.|[2211.01338](http://arxiv.org/abs/2211.01338)|null|
887 | |**2023-05-28**|**DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP**|Kun Song et.al.|[2211.01087](http://arxiv.org/abs/2211.01087)|null|
888 | |**2022-11-22**|**Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement**|Wei Song et.al.|[2211.00967](http://arxiv.org/abs/2211.00967)|null|
889 | |**2022-11-01**|**Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers**|Cheng-Ping Hsieh et.al.|[2211.00585](http://arxiv.org/abs/2211.00585)|**[link](https://github.com/NVIDIA/NeMo)**|
890 | |**2023-06-11**|**Generating Multilingual Gender-Ambiguous Text-to-Speech Voices**|Konstantinos Markopoulos et.al.|[2211.00375](http://arxiv.org/abs/2211.00375)|null|
891 | |**2023-05-07**|**Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features**|Alexandra Vioni et.al.|[2211.00342](http://arxiv.org/abs/2211.00342)|null|
892 | |**2022-11-02**|**Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS**|Kun Song et.al.|[2210.17349](http://arxiv.org/abs/2210.17349)|null|
893 | |**2024-02-27**|**Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation**|Nikolaos Ellinas et.al.|[2210.17264](http://arxiv.org/abs/2210.17264)|null|
894 | |**2022-10-31**|**Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection**|Luigi Attorresi et.al.|[2210.17222](http://arxiv.org/abs/2210.17222)|null|
895 | |**2022-10-31**|**Structured State Space Decoder for Speech Recognition and Synthesis**|Koichi Miyazaki et.al.|[2210.17098](http://arxiv.org/abs/2210.17098)|null|
896 | |**2022-10-28**|**Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders**|Jason Fong et.al.|[2210.16045](http://arxiv.org/abs/2210.16045)|null|
897 | |**2023-02-21**|**Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform**|Masaya Kawamura et.al.|[2210.15975](http://arxiv.org/abs/2210.15975)|**[link](https://github.com/masayakawamura/mb-istft-vits)**|
898 | |**2023-02-22**|**Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis**|Yuma Shirahata et.al.|[2210.15964](http://arxiv.org/abs/2210.15964)|null|
899 | |**2022-10-28**|**Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation**|Nobuyuki Morioka et.al.|[2210.15868](http://arxiv.org/abs/2210.15868)|null|
900 | |**2023-03-15**|**Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech**|Takaaki Saeki et.al.|[2210.15447](http://arxiv.org/abs/2210.15447)|null|
901 | |**2022-10-27**|**Explicit Intensity Control for Accented Text-to-speech**|Rui Liu et.al.|[2210.15364](http://arxiv.org/abs/2210.15364)|null|
902 | |**2022-10-27**|**FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis**|Yifan Hu et.al.|[2210.15360](http://arxiv.org/abs/2210.15360)|**[link](https://github.com/walker-hyf/fctalker)**|
903 | |**2022-10-26**|**Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection**|Kentaro Seki et.al.|[2210.14850](http://arxiv.org/abs/2210.14850)|null|
904 | |**2022-10-25**|**Semi-Supervised Learning Based on Reference Model for Low-resource TTS**|Xulong Zhang et.al.|[2210.14723](http://arxiv.org/abs/2210.14723)|null|
905 | |**2022-10-26**|**Cover Reproducible Steganography via Deep Generative Models**|Kejiang Chen et.al.|[2210.14632](http://arxiv.org/abs/2210.14632)|null|
906 | |**2022-10-26**|**Improving Speech-to-Speech Translation Through Unlabeled Text**|Xuan-Phi Nguyen et.al.|[2210.14514](http://arxiv.org/abs/2210.14514)|null|
907 | |**2022-10-26**|**The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge**|Yuhao Liang et.al.|[2210.14448](http://arxiv.org/abs/2210.14448)|null|
908 | |**2022-10-25**|**Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data**|Xulong Zhang et.al.|[2210.13803](http://arxiv.org/abs/2210.13803)|null|
909 | |**2023-09-17**|**HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation**|Chunhui Wang et.al.|[2210.12740](http://arxiv.org/abs/2210.12740)|null|
910 | |**2022-10-21**|**Low-Resource Multilingual and Zero-Shot Multispeaker TTS**|Florian Lux et.al.|[2210.12223](http://arxiv.org/abs/2210.12223)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
911 | |**2022-10-21**|**Adaptive re-calibration of channel-wise features for Adversarial Audio Classification**|Vardhan Dongre et.al.|[2210.11722](http://arxiv.org/abs/2210.11722)|null|
912 | |**2022-10-20**|**Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS**|Chunyu Qiang et.al.|[2210.11429](http://arxiv.org/abs/2210.11429)|null|
913 | |**2022-10-17**|**Towards Relation Extraction From Speech**|Tongtong Wu et.al.|[2210.08759](http://arxiv.org/abs/2210.08759)|**[link](https://github.com/wutong8023/speechre)**|
914 | |**2023-02-08**|**Generating Synthetic Speech from SpokenVocab for Speech Translation**|Jinming Zhao et.al.|[2210.08174](http://arxiv.org/abs/2210.08174)|**[link](https://github.com/mingzi151/spokenvocab)**|
915 | |**2022-10-17**|**LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge**|Yan Jia et.al.|[2210.07749](http://arxiv.org/abs/2210.07749)|null|
916 | |**2022-10-20**|**Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy**|Sarina Meyer et.al.|[2210.07002](http://arxiv.org/abs/2210.07002)|**[link](https://github.com/digitalphonetics/speaker-anonymization)**|
917 | |**2022-10-13**|**Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar**|Aolan Sun et.al.|[2210.06877](http://arxiv.org/abs/2210.06877)|null|
918 | |**2022-10-12**|**Can we use Common Voice to train a Multi-Speaker TTS system?**|Sewade Ogun et.al.|[2210.06370](http://arxiv.org/abs/2210.06370)|null|
919 | |**2023-06-01**|**SQuId: Measuring Speech Naturalness in Many Languages**|Thibault Sellam et.al.|[2210.06324](http://arxiv.org/abs/2210.06324)|null|
920 | |**2022-11-22**|**Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech**|Byoung Jin Choi et.al.|[2210.05979](http://arxiv.org/abs/2210.05979)|null|
921 | |**2022-10-06**|**An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era**|Andreas Triantafyllopoulos et.al.|[2210.03538](http://arxiv.org/abs/2210.03538)|null|
922 | |**2022-09-29**|**Facial Landmark Predictions with Applications to Metaverse**|Qiao Han et.al.|[2209.14698](http://arxiv.org/abs/2209.14698)|**[link](https://github.com/sweatybridge/text-to-anime)**|
923 | |**2022-09-26**|**Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech**|Yusuke Nakai et.al.|[2209.12549](http://arxiv.org/abs/2209.12549)|null|
924 | |**2022-09-22**|**EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models**|Perry Lam et.al.|[2209.10890](http://arxiv.org/abs/2209.10890)|null|
925 | |**2022-09-22**|**MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline**|Yifan Hu et.al.|[2209.10848](http://arxiv.org/abs/2209.10848)|**[link](https://github.com/walker-hyf/mntts)**|
926 | |**2022-09-22**|**Controllable Accented Text-to-Speech Synthesis**|Rui Liu et.al.|[2209.10804](http://arxiv.org/abs/2209.10804)|null|
927 | |**2022-09-16**|**TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection**|Davide Salvi et.al.|[2209.08000](http://arxiv.org/abs/2209.08000)|null|
928 | |**2022-09-14**|**Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset**|Michael Chinen et.al.|[2209.06358](http://arxiv.org/abs/2209.06358)|null|
929 | |**2022-09-08**|**SANIP: Shopping Assistant and Navigation for the visually impaired**|Shubham Deshmukh et.al.|[2209.03570](http://arxiv.org/abs/2209.03570)|null|
930 | |**2022-09-07**|**Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech**|Huu-Tien Dang et.al.|[2209.02971](http://arxiv.org/abs/2209.02971)|null|
931 | |**2022-09-02**|**Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model**|Jennifer Drexler Fox et.al.|[2209.01250](http://arxiv.org/abs/2209.01250)|null|
932 | |**2022-08-28**|**Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks**|Lev Finkelstein et.al.|[2208.13183](http://arxiv.org/abs/2208.13183)|null|
933 | |**2022-10-04**|**Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale**|Aditya Agarwal et.al.|[2208.09796](http://arxiv.org/abs/2208.09796)|null|
934 | |**2022-08-21**|**Visualising Model Training via Vowel Space for Text-To-Speech Systems**|Binu Abeysinghe et.al.|[2208.09775](http://arxiv.org/abs/2208.09775)|**[link](https://github.com/babe269/performant)**|
935 | |**2022-08-15**|**Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0**|Mohammed Salah Al-Radhi et.al.|[2208.07122](http://arxiv.org/abs/2208.07122)|null|
936 | |**2022-12-28**|**Speech Synthesis with Mixed Emotions**|Kun Zhou et.al.|[2208.05890](http://arxiv.org/abs/2208.05890)|null|
937 | |**2022-08-03**|**A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis**|Qibing Bai et.al.|[2208.02189](http://arxiv.org/abs/2208.02189)|null|
938 | |**2022-07-29**|**Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation**|Giulia Comini et.al.|[2207.14607](http://arxiv.org/abs/2207.14607)|null|
939 | |**2022-07-25**|**Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis**|Raul Fernandez et.al.|[2207.12262](http://arxiv.org/abs/2207.12262)|null|
940 | |**2022-07-01**|**A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese**|Song Zhang et.al.|[2207.12089](http://arxiv.org/abs/2207.12089)|null|
941 | |**2022-07-20**|**When Is TTS Augmentation Through a Pivot Language Useful?**|Nathaniel Robinson et.al.|[2207.09889](http://arxiv.org/abs/2207.09889)|**[link](https://github.com/n8rob/multilingual_tts_augmentation)**|
942 | |**2022-07-11**|**LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech**|Harshvardhan Anand et.al.|[2207.07118](http://arxiv.org/abs/2207.07118)|null|
943 | |**2022-07-13**|**ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech**|Rongjie Huang et.al.|[2207.06389](http://arxiv.org/abs/2207.06389)|**[link](https://github.com/Rongjiehuang/ProDiff)**|
944 | |**2022-07-13**|**Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech**|Zhengxi Liu et.al.|[2207.06088](http://arxiv.org/abs/2207.06088)|null|
945 | |**2022-07-13**|**SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate**|Nabarun Goswami et.al.|[2207.06011](http://arxiv.org/abs/2207.06011)|null|
946 | |**2022-07-13**|**Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS**|Yookyung Shin et.al.|[2207.06000](http://arxiv.org/abs/2207.06000)|null|
947 | |**2022-07-13**|**A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System**|Yi-Chiao Wu et.al.|[2207.05913](http://arxiv.org/abs/2207.05913)|null|
948 | |**2022-07-12**|**Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition**|Rodolfo Zevallos et.al.|[2207.05498](http://arxiv.org/abs/2207.05498)|null|
949 | |**2022-07-12**|**End-to-end speech recognition modeling from de-identified data**|Martin Flechl et.al.|[2207.05469](http://arxiv.org/abs/2207.05469)|null|
950 | |**2022-07-11**|**Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data**|Naoki Makishima et.al.|[2207.04659](http://arxiv.org/abs/2207.04659)|null|
951 | |**2022-07-11**|**DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders**|Yanqing Liu et.al.|[2207.04646](http://arxiv.org/abs/2207.04646)|null|
952 | |**2023-01-02**|**Dreamento: an open-source dream engineering toolbox for sleep EEG wearables**|Mahdad Jafarzadeh Esfahani et.al.|[2207.03977](http://arxiv.org/abs/2207.03977)|**[link](https://github.com/dreamento/dreamento)**|
953 | |**2022-07-07**|**BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus**|Josh Meyer et.al.|[2207.03546](http://arxiv.org/abs/2207.03546)|**[link](https://github.com/alpoktem/bible2speechdb)**|
954 | |**2022-07-05**|**Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion**|Yi Lei et.al.|[2207.01832](http://arxiv.org/abs/2207.01832)|null|
955 | |**2022-07-04**|**BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model**|Brooke Stephenson et.al.|[2207.01718](http://arxiv.org/abs/2207.01718)|null|
956 | |**2022-07-04**|**Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)**|Ariadna Sanchez et.al.|[2207.01547](http://arxiv.org/abs/2207.01547)|null|
957 | |**2022-07-04**|**Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)**|Ziyao Zhang et.al.|[2207.01507](http://arxiv.org/abs/2207.01507)|null|
958 | |**2023-03-13**|**DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech**|Keon Lee et.al.|[2207.01063](http://arxiv.org/abs/2207.01063)|**[link](https://github.com/keonlee9420/DailyTalk)**|
959 | |**2022-07-02**|**Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need**|Daniel Korzekwa et.al.|[2207.00774](http://arxiv.org/abs/2207.00774)|null|
960 | |**2022-07-01**|**Building African Voices**|Perez Ogayo et.al.|[2207.00688](http://arxiv.org/abs/2207.00688)|**[link](https://github.com/neulab/africanvoices)**|
961 | |**2022-07-01**|**Automatic Evaluation of Speaker Similarity**|Deja Kamil et.al.|[2207.00344](http://arxiv.org/abs/2207.00344)|null|
962 | |**2022-08-03**|**Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding**|Wei-Ping Huang et.al.|[2206.15427](http://arxiv.org/abs/2206.15427)|null|
963 | |**2022-06-30**|**R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS**|Kyle Kastner et.al.|[2206.15276](http://arxiv.org/abs/2206.15276)|null|
964 | |**2022-07-01**|**Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems**|Hyun-Wook Yoon et.al.|[2206.15067](http://arxiv.org/abs/2206.15067)|null|
965 | |**2022-06-30**|**TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder**|Eunwoo Song et.al.|[2206.14984](http://arxiv.org/abs/2206.14984)|null|
966 | |**2022-06-29**|**Improving Deliberation by Text-Only and Semi-Supervised Training**|Ke Hu et.al.|[2206.14716](http://arxiv.org/abs/2206.14716)|null|
967 | |**2022-06-29**|**Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody**|Peter Makarov et.al.|[2206.14643](http://arxiv.org/abs/2206.14643)|null|
968 | |**2022-06-28**|**Expressive, Variable, and Controllable Duration Modelling in TTS**|Ammar Abbas et.al.|[2206.14165](http://arxiv.org/abs/2206.14165)|null|
969 | |**2022-06-28**|**Comparison of Speech Representations for the MOS Prediction System**|Aki Kunikoshi et.al.|[2206.13817](http://arxiv.org/abs/2206.13817)|null|
970 | |**2022-06-22**|**A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data**|Raviraj Joshi et.al.|[2206.13240](http://arxiv.org/abs/2206.13240)|null|
971 | |**2022-06-25**|**Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations**|Chin-Cheng Hsu et.al.|[2206.12662](http://arxiv.org/abs/2206.12662)|null|
972 | |**2022-10-21**|**Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech**|Florian Lux et.al.|[2206.12229](http://arxiv.org/abs/2206.12229)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
973 | |**2022-06-24**|**SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech**|Hyunjae Cho et.al.|[2206.12132](http://arxiv.org/abs/2206.12132)|null|
974 | |**2022-06-24**|**End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue**|Kentaro Mitsui et.al.|[2206.12040](http://arxiv.org/abs/2206.12040)|null|
975 | |**2022-05-29**|**Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning**|Sameea Naeem et.al.|[2206.11860](http://arxiv.org/abs/2206.11860)|null|
976 | |**2022-06-21**|**Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS**|Kenta Udagawa et.al.|[2206.10256](http://arxiv.org/abs/2206.10256)|null|
977 | |**2022-06-24**|**Towards Optimizing OCR for Accessibility**|Peya Mowar et.al.|[2206.10254](http://arxiv.org/abs/2206.10254)|null|
978 | |**2022-06-16**|**Automatic Prosody Annotation with Pre-Trained Text-Speech Model**|Ziqian Dai et.al.|[2206.07956](http://arxiv.org/abs/2206.07956)|**[link](https://github.com/daisyqk/automatic-prosody-annotation)**|
979 | |**2022-11-16**|**NatiQ: An End-to-end Text-to-Speech System for Arabic**|Ahmed Abdelali et.al.|[2206.07373](http://arxiv.org/abs/2206.07373)|null|
980 | |**2022-06-15**|**Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning**|Rui Liu et.al.|[2206.07229](http://arxiv.org/abs/2206.07229)|**[link](https://github.com/ttslr/strengthnet)**|
981 | |**2022-12-12**|**A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation**|Junhui Zhang et.al.|[2206.04922](http://arxiv.org/abs/2206.04922)|null|
982 | |**2022-06-09**|**Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos**|Alexander Waibel et.al.|[2206.04523](http://arxiv.org/abs/2206.04523)|null|
983 | |**2022-06-07**|**FlexLip: A Controllable Text-to-Lip System**|Dan Oneata et.al.|[2206.03206](http://arxiv.org/abs/2206.03206)|null|
984 | |**2022-10-11**|**UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder**|Jiachen Lian et.al.|[2206.02512](http://arxiv.org/abs/2206.02512)|null|
985 | |**2023-10-19**|**Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech**|Ziyue Jiang et.al.|[2206.02147](http://arxiv.org/abs/2206.02147)|**[link](https://github.com/zain-jiang/dict-tts)**|
986 | |**2022-11-02**|**AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation**|Kun Song et.al.|[2206.00208](http://arxiv.org/abs/2206.00208)|null|
987 | |**2022-05-31**|**Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish**|Alp Öktem et.al.|[2205.15599](http://arxiv.org/abs/2205.15599)|**[link](https://github.com/collectivat-dev/ladino_tts)**|
988 | |**2023-11-20**|**StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis**|Yinghao Aaron Li et.al.|[2205.15439](http://arxiv.org/abs/2205.15439)|**[link](https://github.com/yl4579/StyleTTS)**|
989 | |**2022-05-30**|**Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data**|Sungwon Kim et.al.|[2205.15370](http://arxiv.org/abs/2205.15370)|null|
990 | |**2022-05-26**|**QSpeech: Low-Qubit Quantum Speech Application Toolkit**|Zhenhou Hong et.al.|[2205.13221](http://arxiv.org/abs/2205.13221)|**[link](https://github.com/zhenhouhong/qspeech)**|
991 | |**2022-11-10**|**T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation**|Paul-Ambroise Duquenne et.al.|[2205.12216](http://arxiv.org/abs/2205.12216)|null|
992 | |**2022-05-20**|**PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit**|Hui Zhang et.al.|[2205.12007](http://arxiv.org/abs/2205.12007)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**|
993 | |**2022-05-24**|**TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS**|Xulong Zhang et.al.|[2205.11824](http://arxiv.org/abs/2205.11824)|null|
994 | |**2022-10-12**|**GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech**|Rongjie Huang et.al.|[2205.07211](http://arxiv.org/abs/2205.07211)|**[link](https://github.com/Rongjiehuang/GenerSpeech)**|
995 | |**2022-05-13**|**Talking Face Generation with Multilingual TTS**|Hyoung-Kyu Song et.al.|[2205.06421](http://arxiv.org/abs/2205.06421)|null|
996 | |**2022-05-10**|**NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality**|Xu Tan et.al.|[2205.04421](http://arxiv.org/abs/2205.04421)|**[link](https://github.com/microsoft/NeuralSpeech)**|
997 | |**2022-05-09**|**Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech**|Yang Li et.al.|[2205.04120](http://arxiv.org/abs/2205.04120)|**[link](https://github.com/neurowave-ai/cucvae-tts)**|
998 | |**2022-05-09**|**ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence**|Sangshin Oh et.al.|[2205.04104](http://arxiv.org/abs/2205.04104)|null|
999 | |**2022-07-14**|**Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss**|Efthymios Georgiou et.al.|[2204.13437](http://arxiv.org/abs/2204.13437)|null|
1000 | |**2022-04-25**|**SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech**|Zhenhui Ye et.al.|[2204.11792](http://arxiv.org/abs/2204.11792)|null|
1001 | |**2022-04-22**|**LibriS2S: A German-English Speech-to-Speech Translation Corpus**|Pedro Jeuris et.al.|[2204.10593](http://arxiv.org/abs/2204.10593)|**[link](https://github.com/pedrodke/libris2s)**|
1002 | |**2022-07-05**|**Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation**|Ryo Terashima et.al.|[2204.10020](http://arxiv.org/abs/2204.10020)|null|
1003 | |**2022-04-21**|**FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis**|Rongjie Huang et.al.|[2204.09934](http://arxiv.org/abs/2204.09934)|**[link](https://github.com/Rongjiehuang/FastDiff)**|
1004 | |**2022-04-20**|**Audio Deep Fake Detection System with Neural Stitching for ADD 2022**|Rui Yan et.al.|[2204.08720](http://arxiv.org/abs/2204.08720)|null|
1005 | |**2022-04-14**|**Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech**|Cong Zhang et.al.|[2204.07228](http://arxiv.org/abs/2204.07228)|null|
1006 | |**2022-12-09**|**Study of Indian English Pronunciation Variabilities relative to Received Pronunciation**|Priyanshi Pal et.al.|[2204.06502](http://arxiv.org/abs/2204.06502)|null|
1007 | |**2022-04-12**|**Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch**|Hanbin Bae et.al.|[2204.05753](http://arxiv.org/abs/2204.05753)|null|
1008 | |**2023-01-30**|**The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance**|Lin Zhang et.al.|[2204.05177](http://arxiv.org/abs/2204.05177)|null|
1009 | |**2022-10-27**|**Fine-grained Noise Control for Multispeaker Speech Synthesis**|Karolos Nikitaras et.al.|[2204.05070](http://arxiv.org/abs/2204.05070)|null|
1010 | |**2022-08-31**|**Karaoker: Alignment-free singing voice synthesis with speech training data**|Panos Kakoulidis et.al.|[2204.04127](http://arxiv.org/abs/2204.04127)|null|
1011 | |**2022-08-15**|**Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech**|Jae-Sung Bae et.al.|[2204.04004](http://arxiv.org/abs/2204.04004)|null|
1012 | |**2022-04-07**|**Arabic Text-To-Speech (TTS) Data Preparation**|Hala Al Masri et.al.|[2204.03255](http://arxiv.org/abs/2204.03255)|null|
1013 | |**2022-04-07**|**Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis**|Yutian Wang et.al.|[2204.03238](http://arxiv.org/abs/2204.03238)|null|
1014 | |**2022-08-24**|**SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis**|Georgia Maniati et.al.|[2204.03040](http://arxiv.org/abs/2204.03040)|null|
1015 | |**2022-09-13**|**Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation**|Sravya Popuri et.al.|[2204.02967](http://arxiv.org/abs/2204.02967)|null|
1016 | |**2022-07-02**|**Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification**|Jin Woo Lee et.al.|[2204.02639](http://arxiv.org/abs/2204.02639)|null|
1017 | |**2023-08-28**|**Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech**|Hyungchan Yoon et.al.|[2204.02172](http://arxiv.org/abs/2204.02172)|null|
1018 | |**2022-09-07**|**Deliberation Model for On-Device Spoken Language Understanding**|Duc Le et.al.|[2204.01893](http://arxiv.org/abs/2204.01893)|null|
1019 | |**2022-12-14**|**Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck**|Youngsik Eom et.al.|[2204.01387](http://arxiv.org/abs/2204.01387)|null|
1020 | |**2022-11-11**|**Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis**|Yixuan Zhou et.al.|[2204.00990](http://arxiv.org/abs/2204.00990)|null|
1021 | |**2022-06-30**|**VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature**|Chenpeng Du et.al.|[2204.00768](http://arxiv.org/abs/2204.00768)|null|
1022 | |**2022-04-01**|**AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios**|Yihan Wu et.al.|[2204.00436](http://arxiv.org/abs/2204.00436)|null|
1023 | |**2022-04-01**|**Text-To-Speech Data Augmentation for Low Resource Speech Recognition**|Rodolfo Zevallos et.al.|[2204.00291](http://arxiv.org/abs/2204.00291)|null|
1024 | |**2022-07-19**|**Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech**|Guangyan Zhang et.al.|[2203.17190](http://arxiv.org/abs/2203.17190)|null|
1025 | |**2022-03-31**|**An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer**|Wenlin Dai et.al.|[2203.16954](http://arxiv.org/abs/2203.16954)|**[link](https://github.com/thuhcsi/flattn)**|
1026 | |**2022-07-11**|**WavThruVec: Latent speech representation as intermediate features for neural speech synthesis**|Hubert Siuzdak et.al.|[2203.16930](http://arxiv.org/abs/2203.16930)|null|
1027 | |**2022-03-31**|**A Character-level Span-based Model for Mandarin Prosodic Structure Prediction**|Xueyuan Chen et.al.|[2203.16922](http://arxiv.org/abs/2203.16922)|**[link](https://github.com/thuhcsi/spanpsp)**|
1028 | |**2022-07-01**|**JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech**|Dan Lim et.al.|[2203.16852](http://arxiv.org/abs/2203.16852)|**[link](https://github.com/imdanboy/jets)**|
1029 | |**2022-03-31**|**Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset**|Zehui Yang et.al.|[2203.16844](http://arxiv.org/abs/2203.16844)|null|
1030 | |**2022-03-31**|**NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism**|Jingbei Li et.al.|[2203.16838](http://arxiv.org/abs/2203.16838)|**[link](https://github.com/thuhcsi/neufa)**|
1031 | |**2022-03-31**|**Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition**|Anirudh Gupta et.al.|[2203.16823](http://arxiv.org/abs/2203.16823)|null|
1032 | |**2022-04-21**|**Does Audio Deepfake Detection Generalize?**|Nicolas M. Müller et.al.|[2203.16263](http://arxiv.org/abs/2203.16263)|null|
1033 | |**2022-03-30**|**End to End Lip Synchronization with a Temporal AutoEncoder**|Yoav Shalev et.al.|[2203.16224](http://arxiv.org/abs/2203.16224)|**[link](https://github.com/itsyoavshalev/end-to-end-lip-synchronization-with-a-temporal-autoencoder)**|
1034 | |**2022-08-15**|**Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition**|Junrui Ni et.al.|[2203.15796](http://arxiv.org/abs/2203.15796)|**[link](https://github.com/lwang114/unsuptts)**|
1035 | |**2022-06-29**|**DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning**|Takaaki Saeki et.al.|[2203.15683](http://arxiv.org/abs/2203.15683)|null|
1036 | |**2022-11-05**|**Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation**|Rendi Chevi et.al.|[2203.15643](http://arxiv.org/abs/2203.15643)|**[link](https://github.com/rendchevi/nix-tts)**|
1037 | |**2022-10-06**|**Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus**|Minchan Kim et.al.|[2203.15447](http://arxiv.org/abs/2203.15447)|null|
1038 | |**2022-07-11**|**VoiceMe: Personalized voice generation in TTS**|Pol van Rijn et.al.|[2203.15379](http://arxiv.org/abs/2203.15379)|**[link](https://github.com/polvanrijn/voiceme)**|
1039 |
1040 | [contributors-shield]: https://img.shields.io/github/contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1041 | [contributors-url]: https://github.com/Vincentqyw/cv-arxiv-daily/graphs/contributors
1042 | [forks-shield]: https://img.shields.io/github/forks/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1043 | [forks-url]: https://github.com/Vincentqyw/cv-arxiv-daily/network/members
1044 | [stars-shield]: https://img.shields.io/github/stars/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1045 | [stars-url]: https://github.com/Vincentqyw/cv-arxiv-daily/stargazers
1046 | [issues-shield]: https://img.shields.io/github/issues/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1047 | [issues-url]: https://github.com/Vincentqyw/cv-arxiv-daily/issues
1048 |
1049 |
--------------------------------------------------------------------------------