├── docs
    ├── tts-arxiv-daily-wechat.json
    ├── _config.yml
    └── index.md
├── requirements.txt
├── assets
    ├── 4-ga-2-1.png
    ├── 4-ga-3-1.png
    ├── 4-ga-5-1.png
    ├── 4-ga-7.png
    ├── 4-ga-8.png
    ├── 4-ga-9.png
    └── 5-pages-1.png
├── .github
    ├── ISSUE_TEMPLATE
    │   ├── config.yml
    │   ├── feature_request.md
    │   ├── question.md
    │   └── bug_report.md
    ├── PULL_REQUEST_TEMPLATE.md
    └── workflows
    │   ├── cv-arxiv-daily.yml
    │   └── update_paper_links.yml
├── .gitignore
├── config.yaml
├── CODE_OF_CONDUCT.md
├── LICENSE
└── daily_arxiv.py


/docs/tts-arxiv-daily-wechat.json:
--------------------------------------------------------------------------------
1 | {}


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | arxiv
3 | pyyaml


--------------------------------------------------------------------------------
/assets/4-ga-2-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-2-1.png


--------------------------------------------------------------------------------
/assets/4-ga-3-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-3-1.png


--------------------------------------------------------------------------------
/assets/4-ga-5-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-5-1.png


--------------------------------------------------------------------------------
/assets/4-ga-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-7.png


--------------------------------------------------------------------------------
/assets/4-ga-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-8.png


--------------------------------------------------------------------------------
/assets/4-ga-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/4-ga-9.png


--------------------------------------------------------------------------------
/assets/5-pages-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/wetdog/TTS-arxiv-daily/master/assets/5-pages-1.png


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/config.yml:
--------------------------------------------------------------------------------
1 | # Configuration: https://help.github.com/en/github/building-a-strong-community/configuring-issue-templates-for-your-repository
2 | 
3 | blank_issues_enabled: false
4 | 


--------------------------------------------------------------------------------
/.github/PULL_REQUEST_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | 
3 | <!-- Add a more detailed description of the changes if needed. -->
4 | 
5 | ## Related Issue
6 | 
7 | <!-- If your PR refers to a related issue, link it here. -->
8 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | CMakeLists.txt.user
 2 | CMakeLists_modified.txt
 3 | 
 4 | .DS_Store
 5 | 
 6 | build/
 7 | 
 8 | lib/
 9 | bin/
10 | 
11 | cmake_modules/
12 | cmake-build-debug/
13 | .idea/
14 | .vscode/
15 | *.pyc
16 | 
17 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: 🚀 Feature request
 3 | about: Suggest an idea for this project 🏖
 4 | title: ""
 5 | labels: enhancement
 6 | assignees:
 7 | ---
 8 | 
 9 | ## 🚀 Feature Request
10 | 
11 | <!-- A clear and concise description of the feature proposal. -->
12 | 
13 | ## 📎 Additional context
14 | 
15 | <!-- Add any other context or screenshots about the feature request here. -->
16 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: ❓ Question
 3 | about: Ask a question about this project 🎓
 4 | title: ""
 5 | labels: question
 6 | assignees:
 7 | ---
 8 | 
 9 | ## Checklist
10 | 
11 | <!-- Mark with an `x` all the checkboxes that apply (like `[x]`) -->
12 | 
13 | - [ ] I've searched the project's [`issues`]
14 | 
15 | ## ❓ Question
16 | 
17 | <!-- What is your question -->
18 | 
19 | How can I [...]?
20 | 
21 | Is it possible to [...]?
22 | 
23 | ## 📎 Additional context
24 | 
25 | <!-- Add any other context or screenshots about the feature request here. -->
26 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: 🐛 Bug report
 3 | about: If something isn't working 🔧
 4 | title: ""
 5 | labels: bug
 6 | assignees:
 7 | ---
 8 | 
 9 | ## 🐛 Bug Report
10 | 
11 | <!-- A clear and concise description of what the bug is. -->
12 | 
13 | ## 🔬 How To Reproduce
14 | 
15 | Steps to reproduce the behavior:
16 | 
17 | 1. ...
18 | 
19 | ### Environment
20 | 
21 | - OS: [e.g. Linux / Windows / macOS]
22 | - Python version, get it with:
23 | 
24 | ```bash
25 | python --version
26 | ```
27 | 
28 | ## 📎 Additional context
29 | 
30 | <!-- Add any other context about the problem here. -->
31 | 


--------------------------------------------------------------------------------
/docs/_config.yml:
--------------------------------------------------------------------------------
 1 | title: CV Arxiv Daily
 2 | description: Automatically Update CV Papers Daily using Github Actions (Update Every 8th hours)
 3 | show_downloads: true
 4 | 
 5 | github:
 6 |   zip_url: https://github.com/Vincentqyw/cv-arxiv-daily
 7 |   another_url: https://github.com/Vincentqyw/cv-arxiv-daily
 8 | 
 9 | ## add remote theme
10 | remote_theme: jekyll/minima@v2.5.1
11 | 
12 | # minima:
13 | skin: dark
14 | social_links:
15 |   twitter: AlphaRealcat
16 |   github: vincentqyw
17 | 
18 | plugins:
19 |  - jekyll-remote-theme # add this line to the plugins list if you already have one
20 |  - jekyll-feed
21 |  - jekyll-seo-tag
22 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | # TODO: add papers by configuration file
 2 | base_url: "https://arxiv.paperswithcode.com/api/v0/papers/"
 3 | user_name: "wetdog"
 4 | repo_name: "cv-arxiv-daily"
 5 | show_authors: True
 6 | show_links: True
 7 | show_badge: True
 8 | max_results: 10
 9 | 
10 | publish_readme: True
11 | publish_gitpage: True
12 | publish_wechat: False
13 | 
14 | # file paths
15 | json_readme_path: './docs/tts-arxiv-daily.json'
16 | json_gitpage_path: './docs/tts-arxiv-daily-web.json'
17 | json_wechat_path: './docs/tts-arxiv-daily-wechat.json'
18 | 
19 | md_readme_path: 'README.md'
20 | md_gitpage_path: './docs/index.md'
21 | md_wechat_path: './docs/wechat.md'
22 | 
23 | # keywords to search
24 | keywords:
25 |     "TTS":
26 |         filters: ["TTS", "Text to speech"]
27 | 


--------------------------------------------------------------------------------
/.github/workflows/cv-arxiv-daily.yml:
--------------------------------------------------------------------------------
 1 | # This is a basic workflow to help you get started with Actions
 2 | 
 3 | name: Run Arxiv Papers Daily
 4 | 
 5 | # Controls when the workflow will run
 6 | on:
 7 |   # Allows you to run this workflow manually from the Actions tab
 8 |   workflow_dispatch:
 9 |   schedule:
10 |     - cron:  "0 0/12 * * *"  #'*/60 * * * *'
11 |   # Triggers the workflow on push or pull request events but only for the main branch
12 | #   push:
13 | #     branches:
14 | #     - main
15 | 
16 | env:
17 | 
18 |   GITHUB_USER_NAME: wetdog
19 |   GITHUB_USER_EMAIL: jose091@gmail.com
20 |   
21 |   
22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel
23 | jobs:
24 |   # This workflow contains a single job called "build"
25 |   build:
26 |     name: update
27 |     # The type of runner that the job will run on
28 |     runs-on: ubuntu-latest
29 |     
30 |     # Steps represent a sequence of tasks that will be executed as part of the job
31 |     steps:
32 |       - name: Checkout
33 |         uses: actions/checkout@v3
34 |         
35 |       - name: Set up Python Env
36 |         uses: actions/setup-python@v4
37 |         with:
38 |           python-version: '3.10'
39 |           #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
40 |       - name: Install dependencies
41 |         run: |
42 |           python -m pip install --upgrade pip
43 |           pip install arxiv
44 |           pip install requests
45 |           pip install pyyaml
46 |           
47 |       - name: Run daily arxiv 
48 |         run: |
49 |           python daily_arxiv.py
50 |           
51 |       - name: Push new cv-arxiv-daily.md
52 |         uses: github-actions-x/commit@v2.9
53 |         with:
54 |           github-token: ${{ secrets.GITHUB_TOKEN }}
55 |           commit-message: "Github Action Automatic Update TTS Arxiv Papers"
56 |           files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md
57 |           rebase: 'true'
58 |           name: ${{ env.GITHUB_USER_NAME }}
59 |           email: ${{ env.GITHUB_USER_EMAIL }}
60 | 


--------------------------------------------------------------------------------
/.github/workflows/update_paper_links.yml:
--------------------------------------------------------------------------------
 1 | # This is a basic workflow to help you get started with Actions
 2 | 
 3 | name: Run Update Paper Links Weekly
 4 | 
 5 | # Controls when the workflow will run
 6 | on:
 7 |   # Allows you to run this workflow manually from the Actions tab
 8 |   workflow_dispatch:
 9 |   schedule:
10 |     - cron:  "0 8 * * 1"  #Run At 08:00 on Monday
11 |   # Triggers the workflow on push or pull request events but only for the main branch
12 | #   push:
13 | #     branches:
14 | #     - main
15 | 
16 | env:
17 | 
18 |   GITHUB_USER_NAME: wetdog
19 |   GITHUB_USER_EMAIL: jose091@gmail.com
20 |   
21 |   
22 | # A workflow run is made up of one or more jobs that can run sequentially or in parallel
23 | jobs:
24 |   # This workflow contains a single job called "build"
25 |   build:
26 |     name: update
27 |     # The type of runner that the job will run on
28 |     runs-on: ubuntu-latest
29 |     
30 |     # Steps represent a sequence of tasks that will be executed as part of the job
31 |     steps:
32 |       - name: Checkout
33 |         uses: actions/checkout@v3
34 |         
35 |       - name: Set up Python Env
36 |         uses: actions/setup-python@v4
37 |         with:
38 |           python-version: '3.10'
39 |           #architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
40 |       - name: Install dependencies
41 |         run: |
42 |           python -m pip install --upgrade pip
43 |           pip install arxiv
44 |           pip install requests
45 |           pip install pyyaml
46 |           
47 |       - name: Run daily arxiv 
48 |         run: |
49 |           python daily_arxiv.py --update_paper_links
50 |           
51 |       - name: Push new tts-arxiv-daily.md
52 |         uses: github-actions-x/commit@v2.9
53 |         with:
54 |           github-token: ${{ secrets.GITHUB_TOKEN }}
55 |           commit-message: "Github Action Automatic Update CV Arxiv Papers"
56 |           files: README.md docs/tts-arxiv-daily.json docs/tts-arxiv-daily-web.json docs/index.md docs/tts-arxiv-daily-wechat.json docs/wechat.md
57 |           rebase: 'true'
58 |           name: ${{ env.GITHUB_USER_NAME }}
59 |           email: ${{ env.GITHUB_USER_EMAIL }}
60 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, religion, or sexual identity
 10 | and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the
 26 |   overall community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or
 31 |   advances of any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email
 35 |   address, without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at
 63 | alpharealcat@gmail.com.
 64 | All complaints will be reviewed and investigated promptly and fairly.
 65 | 
 66 | All community leaders are obligated to respect the privacy and security of the
 67 | reporter of any incident.
 68 | 
 69 | ## Enforcement Guidelines
 70 | 
 71 | Community leaders will follow these Community Impact Guidelines in determining
 72 | the consequences for any action they deem in violation of this Code of Conduct:
 73 | 
 74 | ### 1. Correction
 75 | 
 76 | **Community Impact**: Use of inappropriate language or other behavior deemed
 77 | unprofessional or unwelcome in the community.
 78 | 
 79 | **Consequence**: A private, written warning from community leaders, providing
 80 | clarity around the nature of the violation and an explanation of why the
 81 | behavior was inappropriate. A public apology may be requested.
 82 | 
 83 | ### 2. Warning
 84 | 
 85 | **Community Impact**: A violation through a single incident or series
 86 | of actions.
 87 | 
 88 | **Consequence**: A warning with consequences for continued behavior. No
 89 | interaction with the people involved, including unsolicited interaction with
 90 | those enforcing the Code of Conduct, for a specified period of time. This
 91 | includes avoiding interactions in community spaces as well as external channels
 92 | like social media. Violating these terms may lead to a temporary or
 93 | permanent ban.
 94 | 
 95 | ### 3. Temporary Ban
 96 | 
 97 | **Community Impact**: A serious violation of community standards, including
 98 | sustained inappropriate behavior.
 99 | 
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 | 
106 | ### 4. Permanent Ban
107 | 
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior,  harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 | 
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 | 
115 | ## Attribution
116 | 
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 | 
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 | 
124 | [homepage]: https://www.contributor-covenant.org
125 | 
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/daily_arxiv.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | import json
  4 | import arxiv
  5 | import yaml
  6 | import logging
  7 | import argparse
  8 | import datetime
  9 | import requests
 10 | 
 11 | logging.basicConfig(format='[%(asctime)s %(levelname)s] %(message)s',
 12 |                     datefmt='%m/%d/%Y %H:%M:%S',
 13 |                     level=logging.INFO)
 14 | 
 15 | base_url = "https://arxiv.paperswithcode.com/api/v0/papers/"
 16 | github_url = "https://api.github.com/search/repositories"
 17 | arxiv_url = "http://arxiv.org/"
 18 | 
 19 | def load_config(config_file:str) -> dict:
 20 |     '''
 21 |     config_file: input config file path
 22 |     return: a dict of configuration
 23 |     '''
 24 |     # make filters pretty
 25 |     def pretty_filters(**config) -> dict:
 26 |         keywords = dict()
 27 |         EXCAPE = '\"'
 28 |         QUOTA = '' # NO-USE
 29 |         OR = 'OR' # TODO
 30 |         def parse_filters(filters:list):
 31 |             ret = ''
 32 |             for idx in range(0,len(filters)):
 33 |                 filter = filters[idx]
 34 |                 if len(filter.split()) > 1:
 35 |                     ret += (EXCAPE + filter + EXCAPE)  
 36 |                 else:
 37 |                     ret += (QUOTA + filter + QUOTA)   
 38 |                 if idx != len(filters) - 1:
 39 |                     ret += OR
 40 |             return ret
 41 |         for k,v in config['keywords'].items():
 42 |             keywords[k] = parse_filters(v['filters'])
 43 |         return keywords
 44 |     with open(config_file,'r') as f:
 45 |         config = yaml.load(f,Loader=yaml.FullLoader) 
 46 |         config['kv'] = pretty_filters(**config)
 47 |         logging.info(f'config = {config}')
 48 |     return config 
 49 | 
 50 | def get_authors(authors, first_author = False):
 51 |     output = str()
 52 |     if first_author == False:
 53 |         output = ", ".join(str(author) for author in authors)
 54 |     else:
 55 |         output = authors[0]
 56 |     return output
 57 | def sort_papers(papers):
 58 |     output = dict()
 59 |     keys = list(papers.keys())
 60 |     keys.sort(reverse=True)
 61 |     for key in keys:
 62 |         output[key] = papers[key]
 63 |     return output    
 64 | import requests
 65 | 
 66 | def get_code_link(qword:str) -> str:
 67 |     """
 68 |     This short function was auto-generated by ChatGPT. 
 69 |     I only renamed some params and added some comments.
 70 |     @param qword: query string, eg. arxiv ids and paper titles
 71 |     @return paper_code in github: string, if not found, return None
 72 |     """
 73 |     # query = f"arxiv:{arxiv_id}"
 74 |     query = f"{qword}"
 75 |     params = {
 76 |         "q": query,
 77 |         "sort": "stars",
 78 |         "order": "desc"
 79 |     }
 80 |     r = requests.get(github_url, params=params)
 81 |     results = r.json()
 82 |     code_link = None
 83 |     if results["total_count"] > 0:
 84 |         code_link = results["items"][0]["html_url"]
 85 |     return code_link
 86 |   
 87 | def get_daily_papers(topic,query="slam", max_results=2):
 88 |     """
 89 |     @param topic: str
 90 |     @param query: str
 91 |     @return paper_with_code: dict
 92 |     """
 93 |     # output 
 94 |     content = dict() 
 95 |     content_to_web = dict()
 96 |     search_engine = arxiv.Search(
 97 |         query = query,
 98 |         max_results = max_results,
 99 |         sort_by = arxiv.SortCriterion.SubmittedDate
100 |     )
101 | 
102 |     for result in search_engine.results():
103 | 
104 |         paper_id            = result.get_short_id()
105 |         paper_title         = result.title
106 |         paper_url           = result.entry_id
107 |         code_url            = base_url + paper_id #TODO
108 |         paper_abstract      = result.summary.replace("\n"," ")
109 |         paper_authors       = get_authors(result.authors)
110 |         paper_first_author  = get_authors(result.authors,first_author = True)
111 |         primary_category    = result.primary_category
112 |         publish_time        = result.published.date()
113 |         update_time         = result.updated.date()
114 |         comments            = result.comment
115 | 
116 |         logging.info(f"Time = {update_time} title = {paper_title} author = {paper_first_author}")
117 | 
118 |         # eg: 2108.09112v1 -> 2108.09112
119 |         ver_pos = paper_id.find('v')
120 |         if ver_pos == -1:
121 |             paper_key = paper_id
122 |         else:
123 |             paper_key = paper_id[0:ver_pos]    
124 |         paper_url = arxiv_url + 'abs/' + paper_key
125 |         
126 |         try:
127 |             # source code link    
128 |             r = requests.get(code_url).json()
129 |             repo_url = None
130 |             if "official" in r and r["official"]:
131 |                 repo_url = r["official"]["url"]
132 |             # TODO: not found, two more chances  
133 |             # else: 
134 |             #    repo_url = get_code_link(paper_title)
135 |             #    if repo_url is None:
136 |             #        repo_url = get_code_link(paper_key)
137 |             if repo_url is not None:
138 |                 content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|**[link]({})**|\n".format(
139 |                        update_time,paper_title,paper_first_author,paper_key,paper_url,repo_url)
140 |                 content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({}), Code: **[{}]({})**".format(
141 |                        update_time,paper_title,paper_first_author,paper_url,paper_url,repo_url,repo_url)
142 | 
143 |             else:
144 |                 content[paper_key] = "|**{}**|**{}**|{} et.al.|[{}]({})|null|\n".format(
145 |                        update_time,paper_title,paper_first_author,paper_key,paper_url)
146 |                 content_to_web[paper_key] = "- {}, **{}**, {} et.al., Paper: [{}]({})".format(
147 |                        update_time,paper_title,paper_first_author,paper_url,paper_url)
148 | 
149 |             # TODO: select useful comments
150 |             comments = None
151 |             if comments != None:
152 |                 content_to_web[paper_key] += f", {comments}\n"
153 |             else:
154 |                 content_to_web[paper_key] += f"\n"
155 | 
156 |         except Exception as e:
157 |             logging.error(f"exception: {e} with id: {paper_key}")
158 | 
159 |     data = {topic:content}
160 |     data_web = {topic:content_to_web}
161 |     return data,data_web 
162 | 
163 | def update_paper_links(filename):
164 |     '''
165 |     weekly update paper links in json file 
166 |     '''
167 |     def parse_arxiv_string(s):
168 |         parts = s.split("|")
169 |         date = parts[1].strip()
170 |         title = parts[2].strip()
171 |         authors = parts[3].strip()
172 |         arxiv_id = parts[4].strip()
173 |         code = parts[5].strip()
174 |         arxiv_id = re.sub(r'v\d+', '', arxiv_id)
175 |         return date,title,authors,arxiv_id,code
176 | 
177 |     with open(filename,"r") as f:
178 |         content = f.read()
179 |         if not content:
180 |             m = {}
181 |         else:
182 |             m = json.loads(content)
183 |             
184 |         json_data = m.copy() 
185 | 
186 |         for keywords,v in json_data.items():
187 |             logging.info(f'keywords = {keywords}')
188 |             for paper_id,contents in v.items():
189 |                 contents = str(contents)
190 | 
191 |                 update_time, paper_title, paper_first_author, paper_url, code_url = parse_arxiv_string(contents)
192 | 
193 |                 contents = "|{}|{}|{}|{}|{}|\n".format(update_time,paper_title,paper_first_author,paper_url,code_url)
194 |                 json_data[keywords][paper_id] = str(contents)
195 |                 logging.info(f'paper_id = {paper_id}, contents = {contents}')
196 |                 
197 |                 valid_link = False if '|null|' in contents else True
198 |                 if valid_link:
199 |                     continue
200 |                 try:
201 |                     code_url = base_url + paper_id #TODO
202 |                     r = requests.get(code_url).json()
203 |                     repo_url = None
204 |                     if "official" in r and r["official"]:
205 |                         repo_url = r["official"]["url"]
206 |                         if repo_url is not None:
207 |                             new_cont = contents.replace('|null|',f'|**[link]({repo_url})**|')
208 |                             logging.info(f'ID = {paper_id}, contents = {new_cont}')
209 |                             json_data[keywords][paper_id] = str(new_cont)
210 | 
211 |                 except Exception as e:
212 |                     logging.error(f"exception: {e} with id: {paper_id}")
213 |         # dump to json file
214 |         with open(filename,"w") as f:
215 |             json.dump(json_data,f)
216 | 
217 | def update_json_file(filename,data_dict):
218 |     '''
219 |     daily update json file using data_dict
220 |     '''
221 |     with open(filename,"r") as f:
222 |         content = f.read()
223 |         if not content:
224 |             m = {}
225 |         else:
226 |             m = json.loads(content)
227 |             
228 |     json_data = m.copy() 
229 |     
230 |     # update papers in each keywords         
231 |     for data in data_dict:
232 |         for keyword in data.keys():
233 |             papers = data[keyword]
234 | 
235 |             if keyword in json_data.keys():
236 |                 json_data[keyword].update(papers)
237 |             else:
238 |                 json_data[keyword] = papers
239 | 
240 |     with open(filename,"w") as f:
241 |         json.dump(json_data,f)
242 |     
243 | def json_to_md(filename,md_filename,
244 |                task = '',
245 |                to_web = False, 
246 |                use_title = True, 
247 |                use_tc = True,
248 |                show_badge = True,
249 |                use_b2t = True):
250 |     """
251 |     @param filename: str
252 |     @param md_filename: str
253 |     @return None
254 |     """
255 |     def pretty_math(s:str) -> str:
256 |         ret = ''
257 |         match = re.search(r"\$.*\$", s)
258 |         if match == None:
259 |             return s
260 |         math_start,math_end = match.span()
261 |         space_trail = space_leading = ''
262 |         if s[:math_start][-1] != ' ' and '*' != s[:math_start][-1]: space_trail = ' ' 
263 |         if s[math_end:][0] != ' ' and '*' != s[math_end:][0]: space_leading = ' ' 
264 |         ret += s[:math_start] 
265 |         ret += f'{space_trail}${match.group()[1:-1].strip()}${space_leading}' 
266 |         ret += s[math_end:]
267 |         return ret
268 |   
269 |     DateNow = datetime.date.today()
270 |     DateNow = str(DateNow)
271 |     DateNow = DateNow.replace('-','.')
272 |     
273 |     with open(filename,"r") as f:
274 |         content = f.read()
275 |         if not content:
276 |             data = {}
277 |         else:
278 |             data = json.loads(content)
279 | 
280 |     # clean README.md if daily already exist else create it
281 |     with open(md_filename,"w+") as f:
282 |         pass
283 | 
284 |     # write data into README.md
285 |     with open(md_filename,"a+") as f:
286 | 
287 |         if (use_title == True) and (to_web == True):
288 |             f.write("---\n" + "layout: default\n" + "---\n\n")
289 |         
290 |         if show_badge == True:
291 |             f.write(f"[![Contributors][contributors-shield]][contributors-url]\n")
292 |             f.write(f"[![Forks][forks-shield]][forks-url]\n")
293 |             f.write(f"[![Stargazers][stars-shield]][stars-url]\n")
294 |             f.write(f"[![Issues][issues-shield]][issues-url]\n\n")    
295 |                 
296 |         if use_title == True:
297 |             #f.write(("<p align="center"><h1 align="center"><br><ins>CV-ARXIV-DAILY"
298 |             #         "</ins><br>Automatically Update CV Papers Daily</h1></p>\n"))
299 |             f.write("## Updated on " + DateNow + "\n")
300 |         else:
301 |             f.write("> Updated on " + DateNow + "\n")
302 | 
303 |         # TODO: add usage
304 |         f.write("> Usage instructions: [here](./docs/README.md#usage)\n\n")
305 |         f.write("> This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily)\n\n")
306 | 
307 |         #Add: table of contents
308 |         if use_tc == True:
309 |             f.write("<details>\n")
310 |             f.write("  <summary>Table of Contents</summary>\n")
311 |             f.write("  <ol>\n")
312 |             for keyword in data.keys():
313 |                 day_content = data[keyword]
314 |                 if not day_content:
315 |                     continue
316 |                 kw = keyword.replace(' ','-')      
317 |                 f.write(f"    <li><a href=#{kw.lower()}>{keyword}</a></li>\n")
318 |             f.write("  </ol>\n")
319 |             f.write("</details>\n\n")
320 |         
321 |         for keyword in data.keys():
322 |             day_content = data[keyword]
323 |             if not day_content:
324 |                 continue
325 |             # the head of each part
326 |             f.write(f"## {keyword}\n\n")
327 | 
328 |             if use_title == True :
329 |                 if to_web == False:
330 |                     f.write("|Publish Date|Title|Authors|PDF|Code|\n" + "|---|---|---|---|---|\n")
331 |                 else:
332 |                     f.write("| Publish Date | Title | Authors | PDF | Code |\n")
333 |                     f.write("|:---------|:-----------------------|:---------|:------|:------|\n")
334 | 
335 |             # sort papers by date
336 |             day_content = sort_papers(day_content)
337 |         
338 |             for _,v in day_content.items():
339 |                 if v is not None:
340 |                     f.write(pretty_math(v)) # make latex pretty
341 | 
342 |             f.write(f"\n")
343 |             
344 |             #Add: back to top
345 |             if use_b2t:
346 |                 top_info = f"#Updated on {DateNow}"
347 |                 top_info = top_info.replace(' ','-').replace('.','')
348 |                 f.write(f"<p align=right>(<a href={top_info.lower()}>back to top</a>)</p>\n\n")
349 |             
350 |         if show_badge == True:
351 |             # we don't like long string, break it!
352 |             f.write((f"[contributors-shield]: https://img.shields.io/github/"
353 |                      f"contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge\n"))
354 |             f.write((f"[contributors-url]: https://github.com/Vincentqyw/"
355 |                      f"cv-arxiv-daily/graphs/contributors\n"))
356 |             f.write((f"[forks-shield]: https://img.shields.io/github/forks/Vincentqyw/"
357 |                      f"cv-arxiv-daily.svg?style=for-the-badge\n"))
358 |             f.write((f"[forks-url]: https://github.com/Vincentqyw/"
359 |                      f"cv-arxiv-daily/network/members\n"))
360 |             f.write((f"[stars-shield]: https://img.shields.io/github/stars/Vincentqyw/"
361 |                      f"cv-arxiv-daily.svg?style=for-the-badge\n"))
362 |             f.write((f"[stars-url]: https://github.com/Vincentqyw/"
363 |                      f"cv-arxiv-daily/stargazers\n"))
364 |             f.write((f"[issues-shield]: https://img.shields.io/github/issues/Vincentqyw/"
365 |                      f"cv-arxiv-daily.svg?style=for-the-badge\n"))
366 |             f.write((f"[issues-url]: https://github.com/Vincentqyw/"
367 |                      f"cv-arxiv-daily/issues\n\n"))
368 |                 
369 |     logging.info(f"{task} finished")        
370 | 
371 | def demo(**config):
372 |     # TODO: use config
373 |     data_collector = []
374 |     data_collector_web= []
375 |     
376 |     keywords = config['kv']
377 |     max_results = config['max_results']
378 |     publish_readme = config['publish_readme']
379 |     publish_gitpage = config['publish_gitpage']
380 |     publish_wechat = config['publish_wechat']
381 |     show_badge = config['show_badge']
382 | 
383 |     b_update = config['update_paper_links']
384 |     logging.info(f'Update Paper Link = {b_update}')
385 |     if config['update_paper_links'] == False:
386 |         logging.info(f"GET daily papers begin")
387 |         for topic, keyword in keywords.items():
388 |             logging.info(f"Keyword: {topic}")
389 |             data, data_web = get_daily_papers(topic, query = keyword,
390 |                                             max_results = max_results)
391 |             data_collector.append(data)
392 |             data_collector_web.append(data_web)
393 |             print("\n")
394 |         logging.info(f"GET daily papers end")
395 | 
396 |     # 1. update README.md file
397 |     if publish_readme:
398 |         json_file = config['json_readme_path']
399 |         md_file   = config['md_readme_path']
400 |         # update paper links
401 |         if config['update_paper_links']:
402 |             update_paper_links(json_file)
403 |         else:    
404 |             # update json data
405 |             update_json_file(json_file,data_collector)
406 |         # json data to markdown
407 |         json_to_md(json_file,md_file, task ='Update Readme', \
408 |             show_badge = show_badge)
409 | 
410 |     # 2. update docs/index.md file (to gitpage)
411 |     if publish_gitpage:
412 |         json_file = config['json_gitpage_path']
413 |         md_file   = config['md_gitpage_path']
414 |         # TODO: duplicated update paper links!!!
415 |         if config['update_paper_links']:
416 |             update_paper_links(json_file)
417 |         else:    
418 |             update_json_file(json_file,data_collector)
419 |         json_to_md(json_file, md_file, task ='Update GitPage', \
420 |             to_web = True, show_badge = show_badge, \
421 |             use_tc=False, use_b2t=False)
422 | 
423 |     # 3. Update docs/wechat.md file
424 |     if publish_wechat:
425 |         json_file = config['json_wechat_path']
426 |         md_file   = config['md_wechat_path']
427 |         # TODO: duplicated update paper links!!!
428 |         if config['update_paper_links']:
429 |             update_paper_links(json_file)
430 |         else:    
431 |             update_json_file(json_file, data_collector_web)
432 |         json_to_md(json_file, md_file, task ='Update Wechat', \
433 |             to_web=False, use_title= False, show_badge = show_badge)
434 | 
435 | if __name__ == "__main__":
436 |     parser = argparse.ArgumentParser()
437 |     parser.add_argument('--config_path',type=str, default='config.yaml',
438 |                             help='configuration file path')
439 |     parser.add_argument('--update_paper_links', default=False,
440 |                         action="store_true",help='whether to update paper links etc.')                        
441 |     args = parser.parse_args()
442 |     config = load_config(args.config_path)
443 |     config = {**config, 'update_paper_links':args.update_paper_links}
444 |     demo(**config)
445 | 


--------------------------------------------------------------------------------
/docs/index.md:
--------------------------------------------------------------------------------
   1 | ---
   2 | layout: default
   3 | ---
   4 | 
   5 | [![Contributors][contributors-shield]][contributors-url]
   6 | [![Forks][forks-shield]][forks-url]
   7 | [![Stargazers][stars-shield]][stars-url]
   8 | [![Issues][issues-shield]][issues-url]
   9 | 
  10 | ## Updated on 2025.12.22
  11 | > Usage instructions: [here](./docs/README.md#usage)
  12 | 
  13 | > This page is modified from [here](https://github.com/Vincentqyw/cv-arxiv-daily)
  14 | 
  15 | ## TTS
  16 | 
  17 | | Publish Date | Title | Authors | PDF | Code |
  18 | |:---------|:-----------------------|:---------|:------|:------|
  19 | |**2025-07-23**|**BoSS: Beyond-Semantic Speech**|Qing Wang et.al.|[2507.17563](http://arxiv.org/abs/2507.17563)|null|
  20 | |**2025-07-21**|**A2TTS: TTS for Low Resource Indian Languages**|Ayush Singh Bhadoriya et.al.|[2507.15272](http://arxiv.org/abs/2507.15272)|null|
  21 | |**2025-07-22**|**Hear Your Code Fail, Voice-Assisted Debugging for Python**|Sayed Mahbub Hasan Amiri et.al.|[2507.15007](http://arxiv.org/abs/2507.15007)|null|
  22 | |**2025-07-20**|**DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis**|Yinghao Aaron Li et.al.|[2507.14988](http://arxiv.org/abs/2507.14988)|null|
  23 | |**2025-07-17**|**NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech**|Maksim Borisov et.al.|[2507.13155](http://arxiv.org/abs/2507.13155)|null|
  24 | |**2025-07-17**|**Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication**|Tianyu Song et.al.|[2507.13052](http://arxiv.org/abs/2507.13052)|null|
  25 | |**2025-07-17**|**Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes**|Zhou Feng et.al.|[2507.12932](http://arxiv.org/abs/2507.12932)|null|
  26 | |**2025-07-16**|**Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations**|Yichen Han et.al.|[2507.12197](http://arxiv.org/abs/2507.12197)|null|
  27 | |**2025-07-16**|**EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis**|Haoxun Li et.al.|[2507.12015](http://arxiv.org/abs/2507.12015)|null|
  28 | |**2025-07-15**|**Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection**|Ivan Viakhirev et.al.|[2507.11777](http://arxiv.org/abs/2507.11777)|null|
  29 | |**2025-07-15**|**P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge**|Marvin Sach et.al.|[2507.11306](http://arxiv.org/abs/2507.11306)|null|
  30 | |**2025-07-14**|**Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition**|Mengzhe Geng et.al.|[2507.10827](http://arxiv.org/abs/2507.10827)|null|
  31 | |**2025-07-14**|**An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments**|Mikko Korkiakoski et.al.|[2507.10469](http://arxiv.org/abs/2507.10469)|null|
  32 | |**2025-07-12**|**ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching**|Han Zhu et.al.|[2507.09318](http://arxiv.org/abs/2507.09318)|null|
  33 | |**2025-07-12**|**Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning**|Dominika Woszczyk et.al.|[2507.09310](http://arxiv.org/abs/2507.09310)|null|
  34 | |**2025-07-12**|**ClaritySpeech: Dementia Obfuscation in Speech**|Dominika Woszczyk et.al.|[2507.09282](http://arxiv.org/abs/2507.09282)|null|
  35 | |**2025-07-11**|**Exploiting Leaderboards for Large-Scale Distribution of Malicious Models**|Anshuman Suri et.al.|[2507.08983](http://arxiv.org/abs/2507.08983)|null|
  36 | |**2025-07-11**|**Unlocking Speech Instruction Data Potential with Query Rewriting**|Yonghua Hei et.al.|[2507.08603](http://arxiv.org/abs/2507.08603)|null|
  37 | |**2025-07-11**|**MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling**|Jingjing Tang et.al.|[2507.08530](http://arxiv.org/abs/2507.08530)|null|
  38 | |**2025-07-11**|**Active Learning for Text-to-Speech Synthesis with Informative Sample Collection**|Kentaro Seki et.al.|[2507.08319](http://arxiv.org/abs/2507.08319)|null|
  39 | |**2025-07-10**|**SecureSpeech: Prompt-based Speaker and Content Protection**|Belinda Soh Hui Hui et.al.|[2507.07799](http://arxiv.org/abs/2507.07799)|null|
  40 | |**2025-07-09**|**Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents**|Zackary Rackauckas et.al.|[2507.06483](http://arxiv.org/abs/2507.06483)|null|
  41 | |**2025-07-08**|**Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis**|Xintong Hu et.al.|[2507.06116](http://arxiv.org/abs/2507.06116)|null|
  42 | |**2025-07-08**|**Differentiable Reward Optimization for LLM based TTS system**|Changfeng Gao et.al.|[2507.05911](http://arxiv.org/abs/2507.05911)|null|
  43 | |**2025-07-08**|**OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model**|Chen Wang et.al.|[2507.05177](http://arxiv.org/abs/2507.05177)|null|
  44 | |**2025-07-07**|**Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2507.04598](http://arxiv.org/abs/2507.04598)|null|
  45 | |**2025-07-06**|**TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet**|Jaeseok Jeong et.al.|[2507.04349](http://arxiv.org/abs/2507.04349)|null|
  46 | |**2025-07-05**|**PresentAgent: Multimodal Agent for Presentation Video Generation**|Jingwei Shi et.al.|[2507.04036](http://arxiv.org/abs/2507.04036)|null|
  47 | |**2025-07-05**|**Prosody Labeling with Phoneme-BERT and Speech Foundation Models**|Tomoki Koriyama et.al.|[2507.03912](http://arxiv.org/abs/2507.03912)|null|
  48 | |**2025-07-05**|**Traceable TTS: Toward Watermark-Free TTS with Strong Traceability**|Yuxiang Zhao et.al.|[2507.03887](http://arxiv.org/abs/2507.03887)|null|
  49 | |**2025-07-03**|**Open-Source System for Multilingual Translation and Cloned Speech Synthesis**|Mateo Cámara et.al.|[2507.02530](http://arxiv.org/abs/2507.02530)|null|
  50 | |**2025-07-03**|**JoyTTS: LLM-based Spoken Chatbot With Voice Cloning**|Fangru Zhou et.al.|[2507.02380](http://arxiv.org/abs/2507.02380)|null|
  51 | |**2025-07-02**|**A Dataset for Automatic Assessment of TTS Quality in Spanish**|Alejandro Sosa Welford et.al.|[2507.01805](http://arxiv.org/abs/2507.01805)|null|
  52 | |**2025-07-02**|**SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech**|Cheng Zhuangfei et.al.|[2507.01348](http://arxiv.org/abs/2507.01348)|null|
  53 | |**2025-07-02**|**Multi-interaction TTS toward professional recording reproduction**|Hiroki Kanagawa et.al.|[2507.00808](http://arxiv.org/abs/2507.00808)|null|
  54 | |**2025-06-30**|**Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis**|Paul Mayer et.al.|[2507.00227](http://arxiv.org/abs/2507.00227)|null|
  55 | |**2025-06-30**|**JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching**|Mingi Kwon et.al.|[2506.23552](http://arxiv.org/abs/2506.23552)|null|
  56 | |**2025-06-29**|**You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties**|Paige Tuttösí et.al.|[2506.23367](http://arxiv.org/abs/2506.23367)|null|
  57 | |**2025-06-23**|**IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech**|Siyi Zhou et.al.|[2506.21619](http://arxiv.org/abs/2506.21619)|null|
  58 | |**2025-06-25**|**An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS**|Marie Kunešová et.al.|[2506.20190](http://arxiv.org/abs/2506.20190)|null|
  59 | |**2025-06-24**|**TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems**|Christoph Minixhofer et.al.|[2506.19441](http://arxiv.org/abs/2506.19441)|null|
  60 | |**2025-06-23**|**Selecting N-lowest scores for training MOS prediction models**|Yuto Kondo et.al.|[2506.18326](http://arxiv.org/abs/2506.18326)|null|
  61 | |**2025-06-23**|**Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting**|Yuto Kondo et.al.|[2506.18307](http://arxiv.org/abs/2506.18307)|null|
  62 | |**2025-06-23**|**JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles**|Yuto Kondo et.al.|[2506.18296](http://arxiv.org/abs/2506.18296)|null|
  63 | |**2025-06-20**|**RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching**|Hyun Joon Park et.al.|[2506.16741](http://arxiv.org/abs/2506.16741)|null|
  64 | |**2025-06-20**|**LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization**|Daejin Jo et.al.|[2506.16738](http://arxiv.org/abs/2506.16738)|null|
  65 | |**2025-06-19**|**Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement**|Tuan-Nam Nguyen et.al.|[2506.16580](http://arxiv.org/abs/2506.16580)|null|
  66 | |**2025-06-19**|**InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems**|Kexin Huang et.al.|[2506.16381](http://arxiv.org/abs/2506.16381)|**[link](https://github.com/kexinhuang19/instructttseval)**|
  67 | |**2025-06-19**|**Optimizing Multilingual Text-To-Speech with Accents & Emotions**|Pranav Pawar et.al.|[2506.16310](http://arxiv.org/abs/2506.16310)|null|
  68 | |**2025-06-18**|**TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data**|Kentaro Seki et.al.|[2506.15614](http://arxiv.org/abs/2506.15614)|null|
  69 | |**2025-06-18**|**PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction**|Shufan Li et.al.|[2506.15556](http://arxiv.org/abs/2506.15556)|null|
  70 | |**2025-06-18**|**EmojiVoice: Towards long-term controllable expressivity in robot speech**|Paige Tuttösí et.al.|[2506.15085](http://arxiv.org/abs/2506.15085)|null|
  71 | |**2025-06-17**|**Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification**|Yiyang Zhao et.al.|[2506.14226](http://arxiv.org/abs/2506.14226)|null|
  72 | |**2025-06-16**|**EmoNews: A Spoken Dialogue System for Expressive News Conversations**|Ryuki Matsuura et.al.|[2506.13894](http://arxiv.org/abs/2506.13894)|null|
  73 | |**2025-06-16**|**ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching**|Han Zhu et.al.|[2506.13053](http://arxiv.org/abs/2506.13053)|null|
  74 | |**2025-06-14**|**StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling**|Hui Wang et.al.|[2506.12570](http://arxiv.org/abs/2506.12570)|null|
  75 | |**2025-06-14**|**Phonikud: Hebrew Grapheme-to-Phoneme Conversion for Real-Time Text-to-Speech**|Yakov Kolani et.al.|[2506.12311](http://arxiv.org/abs/2506.12311)|null|
  76 | |**2025-06-11**|**S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamlessly Speech-Text Alignment and Streaming Speech Decoder**|Yu Pan et.al.|[2506.11160](http://arxiv.org/abs/2506.11160)|null|
  77 | |**2025-06-16**|**A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data**|Cheng-Kang Chou et.al.|[2506.11130](http://arxiv.org/abs/2506.11130)|null|
  78 | |**2025-06-10**|**GUIRoboTron-Speech: Towards Automated GUI Agents Based on Speech Instructions**|Wenkang Han et.al.|[2506.11127](http://arxiv.org/abs/2506.11127)|null|
  79 | |**2025-06-10**|**ASRJam: Human-Friendly AI Speech Jamming to Prevent Automated Phone Scams**|Freddie Grabovski et.al.|[2506.11125](http://arxiv.org/abs/2506.11125)|null|
  80 | |**2025-06-12**|**Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs**|Hayato Futami et.al.|[2506.10299](http://arxiv.org/abs/2506.10299)|null|
  81 | |**2025-06-11**|**UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching**|Neta Glazer et.al.|[2506.09874](http://arxiv.org/abs/2506.09874)|null|
  82 | |**2025-06-15**|**EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection**|Christoph Schuhmann et.al.|[2506.09827](http://arxiv.org/abs/2506.09827)|null|
  83 | |**2025-06-11**|**Ming-Omni: A Unified Multimodal Model for Perception and Generation**|Inclusion AI et.al.|[2506.09344](http://arxiv.org/abs/2506.09344)|**[link](https://github.com/inclusionai/ming)**|
  84 | |**2025-06-10**|**A Review on Score-based Generative Models for Audio Applications**|Ge Zhu et.al.|[2506.08457](http://arxiv.org/abs/2506.08457)|null|
  85 | |**2025-06-09**|**Seeing Voices: Generating A-Roll Video from Audio with Mirage**|Aditi Sundararaman et.al.|[2506.08279](http://arxiv.org/abs/2506.08279)|null|
  86 | |**2025-06-09**|**Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation**|Rui Hu et.al.|[2506.07646](http://arxiv.org/abs/2506.07646)|null|
  87 | |**2025-06-07**|**SynHate: Detecting Hate Speech in Synthetic Deepfake Audio**|Rishabh Ranjan et.al.|[2506.06772](http://arxiv.org/abs/2506.06772)|null|
  88 | |**2025-06-09**|**Voice Impression Control in Zero-Shot TTS**|Keinichi Fujita et.al.|[2506.05688](http://arxiv.org/abs/2506.05688)|null|
  89 | |**2025-06-05**|**Grapheme-Coherent Phonemic and Prosodic Annotation of Speech by Implicit and Explicit Grapheme Conditioning**|Hien Ohnaka et.al.|[2506.04527](http://arxiv.org/abs/2506.04527)|null|
  90 | |**2025-06-04**|**Can we reconstruct a dysarthric voice with the large speech model Parler TTS?**|Ariadna Sanchez et.al.|[2506.04397](http://arxiv.org/abs/2506.04397)|null|
  91 | |**2025-06-04**|**HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset**|Ryan Langman et.al.|[2506.04152](http://arxiv.org/abs/2506.04152)|null|
  92 | |**2025-06-04**|**UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation**|Jinting Wang et.al.|[2506.04134](http://arxiv.org/abs/2506.04134)|null|
  93 | |**2025-06-04**|**A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions**|Chung-Chun Wang et.al.|[2506.04077](http://arxiv.org/abs/2506.04077)|null|
  94 | |**2025-06-04**|**Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages**|Utkarsh Pathak et.al.|[2506.03884](http://arxiv.org/abs/2506.03884)|null|
  95 | |**2025-06-04**|**Mark My Words: A Robust Multilingual Model for Punctuation in Text and Speech Transcripts**|Sidharth Pulipaka et.al.|[2506.03793](http://arxiv.org/abs/2506.03793)|null|
  96 | |**2025-06-04**|**BitTTS: Highly Compact Text-to-Speech Using 1.58-bit Quantization and Weight Indexing**|Masaya Kawamura et.al.|[2506.03515](http://arxiv.org/abs/2506.03515)|null|
  97 | |**2025-06-03**|**Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation**|Yongqi Wang et.al.|[2506.02997](http://arxiv.org/abs/2506.02997)|null|
  98 | |**2025-06-03**|**Towards a Japanese Full-duplex Spoken Dialogue System**|Atsumoto Ohashi et.al.|[2506.02979](http://arxiv.org/abs/2506.02979)|null|
  99 | |**2025-06-03**|**CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech**|Helin Wang et.al.|[2506.02863](http://arxiv.org/abs/2506.02863)|null|
 100 | |**2025-06-03**|**Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions**|Xiaoxue Gao et.al.|[2506.02742](http://arxiv.org/abs/2506.02742)|null|
 101 | |**2025-06-02**|**SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction**|Saurabh Agrawal et.al.|[2506.02082](http://arxiv.org/abs/2506.02082)|null|
 102 | |**2025-06-02**|**Zero-Shot Text-to-Speech for Vietnamese**|Thi Vu et.al.|[2506.01322](http://arxiv.org/abs/2506.01322)|null|
 103 | |**2025-06-02**|**CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction**|Yudong Lu et.al.|[2506.01268](http://arxiv.org/abs/2506.01268)|null|
 104 | |**2025-06-02**|**WCTC-Biasing: Retraining-free Contextual Biasing ASR with Wildcard CTC-based Keyword Spotting and Inter-layer Biasing**|Yu Nakagome et.al.|[2506.01263](http://arxiv.org/abs/2506.01263)|null|
 105 | |**2025-06-01**|**DS-TTS: Zero-Shot Speaker Style Adaptation from Voice Clips via Dynamic Dual-Style Feature Modulation**|Ming Meng et.al.|[2506.01020](http://arxiv.org/abs/2506.01020)|null|
 106 | |**2025-06-01**|**Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models**|Kyowoon Lee et.al.|[2506.00832](http://arxiv.org/abs/2506.00832)|null|
 107 | |**2025-05-30**|**Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation**|Wenrui Liu et.al.|[2505.24496](http://arxiv.org/abs/2505.24496)|null|
 108 | |**2025-05-30**|**DS-Codec: Dual-Stage Training with Mirror-to-NonMirror Architecture Switching for Speech Codec**|Peijie Chen et.al.|[2505.24314](http://arxiv.org/abs/2505.24314)|null|
 109 | |**2025-05-29**|**Can Emotion Fool Anti-spoofing?**|Aurosweta Mahapatra et.al.|[2505.23962](http://arxiv.org/abs/2505.23962)|null|
 110 | |**2025-05-29**|**Few-Shot Speech Deepfake Detection Adaptation with Gaussian Processes**|Neta Glazer et.al.|[2505.23619](http://arxiv.org/abs/2505.23619)|null|
 111 | |**2025-05-29**|**EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge**|Ruskin Raj Manku et.al.|[2505.23009](http://arxiv.org/abs/2505.23009)|**[link](https://github.com/boson-ai/emergenttts-eval-public)**|
 112 | |**2025-05-29**|**LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting**|Pai Zhu et.al.|[2505.22995](http://arxiv.org/abs/2505.22995)|null|
 113 | |**2025-05-28**|**Tell me Habibi, is it Real or Fake?**|Kartik Kuckreja et.al.|[2505.22581](http://arxiv.org/abs/2505.22581)|null|
 114 | |**2025-05-28**|**A Linguistically Motivated Analysis of Intonational Phrasing in Text-to-Speech Systems: Revealing Gaps in Syntactic Sensitivity**|Charlotte Pouw et.al.|[2505.22236](http://arxiv.org/abs/2505.22236)|null|
 115 | |**2025-05-27**|**Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech**|Nam-Gyu Kim et.al.|[2505.20868](http://arxiv.org/abs/2505.20868)|null|
 116 | |**2025-05-26**|**Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling**|Qixi Zheng et.al.|[2505.19931](http://arxiv.org/abs/2505.19931)|null|
 117 | |**2025-05-26**|**DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech**|Deok-Hyeon Cho et.al.|[2505.19687](http://arxiv.org/abs/2505.19687)|null|
 118 | |**2025-05-26**|**KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization**|Zhaolin Li et.al.|[2505.19679](http://arxiv.org/abs/2505.19679)|null|
 119 | |**2025-05-26**|**Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling**|Haiyang Sun et.al.|[2505.19669](http://arxiv.org/abs/2505.19669)|null|
 120 | |**2025-05-26**|**Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment**|Jeongsoo Choi et.al.|[2505.19595](http://arxiv.org/abs/2505.19595)|**[link](https://github.com/zhikangniu/a-dma)**|
 121 | |**2025-05-25**|**SpeakStream: Streaming Text-to-Speech with Interleaved Data**|Richard He Bai et.al.|[2505.19206](http://arxiv.org/abs/2505.19206)|null|
 122 | |**2025-05-25**|**CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning**|Renyuan Li et.al.|[2505.19119](http://arxiv.org/abs/2505.19119)|null|
 123 | |**2025-05-25**|**Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis**|Minsu Kim et.al.|[2505.18972](http://arxiv.org/abs/2505.18972)|null|
 124 | |**2025-05-27**|**RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations**|Ashwin Sankar et.al.|[2505.18609](http://arxiv.org/abs/2505.18609)|null|
 125 | |**2025-05-24**|**MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt**|Zhichao Wu et.al.|[2505.18453](http://arxiv.org/abs/2505.18453)|null|
 126 | |**2025-05-23**|**What You Read Isn't What You Hear: Linguistic Sensitivity in Deepfake Speech Detection**|Binh Nguyen et.al.|[2505.17513](http://arxiv.org/abs/2505.17513)|null|
 127 | |**2025-05-23**|**UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information**|Rui Wang et.al.|[2505.17426](http://arxiv.org/abs/2505.17426)|null|
 128 | |**2025-05-23**|**Speechless: Speech Instruction Training Without Speech for Low Resource Languages**|Alan Dao et.al.|[2505.17417](http://arxiv.org/abs/2505.17417)|null|
 129 | |**2025-05-22**|**Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2**|Zackary Rackauckas et.al.|[2505.17320](http://arxiv.org/abs/2505.17320)|null|
 130 | |**2025-05-21**|**Voicing Personas: Rewriting Persona Descriptions into Style Prompts for Controllable Text-to-Speech**|Yejin Lee et.al.|[2505.17093](http://arxiv.org/abs/2505.17093)|null|
 131 | |**2025-05-22**|**From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition**|Tianduo Wang et.al.|[2505.16972](http://arxiv.org/abs/2505.16972)|**[link](https://github.com/tianduowang/speech-bt)**|
 132 | |**2025-05-21**|**MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling**|Yifan Cheng et.al.|[2505.15772](http://arxiv.org/abs/2505.15772)|null|
 133 | |**2025-05-21**|**Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information**|Nicholas Sanders et.al.|[2505.15667](http://arxiv.org/abs/2505.15667)|null|
 134 | |**2025-05-21**|**Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models**|Zirui Song et.al.|[2505.15406](http://arxiv.org/abs/2505.15406)|**[link](https://github.com/mbzuai-nlp/audiojailbreak)**|
 135 | |**2025-05-20**|**FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation**|Yutong Liu et.al.|[2505.14351](http://arxiv.org/abs/2505.14351)|null|
 136 | |**2025-05-21**|**AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models**|Guangke Chen et.al.|[2505.14103](http://arxiv.org/abs/2505.14103)|null|
 137 | |**2025-05-20**|**SeamlessEdit: Background Noise Aware Zero-Shot Speech Editing with in-Context Enhancement**|Kuan-Yu Chen et.al.|[2505.14066](http://arxiv.org/abs/2505.14066)|null|
 138 | |**2025-05-22**|**Improving Noise Robustness of LLM-based Zero-shot TTS via Discrete Acoustic Token Denoising**|Ye-Xin Lu et.al.|[2505.13830](http://arxiv.org/abs/2505.13830)|null|
 139 | |**2025-05-19**|**OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching**|Hieu-Nghia Huynh-Nguyen et.al.|[2505.12800](http://arxiv.org/abs/2505.12800)|null|
 140 | |**2025-05-18**|**Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis**|Dong Yang et.al.|[2505.12226](http://arxiv.org/abs/2505.12226)|null|
 141 | |**2025-05-16**|**Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese**|Xihuai Wang et.al.|[2505.11200](http://arxiv.org/abs/2505.11200)|null|
 142 | |**2025-05-16**|**BanglaFake: Constructing and Evaluating a Specialized Bengali Deepfake Audio Dataset**|Istiaq Ahmed Fahad et.al.|[2505.10885](http://arxiv.org/abs/2505.10885)|**[link](https://github.com/KamruzzamanAsif/BanglaFake)**|
 143 | |**2025-05-15**|**UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech**|Jiaxuan Liu et.al.|[2505.10599](http://arxiv.org/abs/2505.10599)|null|
 144 | |**2025-05-12**|**MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder**|Bowen Zhang et.al.|[2505.07916](http://arxiv.org/abs/2505.07916)|null|
 145 | |**2025-05-12**|**Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications**|Biel Tura Vecino et.al.|[2505.07701](http://arxiv.org/abs/2505.07701)|null|
 146 | |**2025-05-10**|**VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback**|Eason Chen et.al.|[2505.06676](http://arxiv.org/abs/2505.06676)|null|
 147 | |**2025-05-10**|**Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation**|Abbas Bertina et.al.|[2505.06599](http://arxiv.org/abs/2505.06599)|null|
 148 | |**2025-05-15**|**FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech**|Linhan Ma et.al.|[2505.05159](http://arxiv.org/abs/2505.05159)|null|
 149 | |**2025-05-08**|**Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations**|Linrong Pan et.al.|[2505.05056](http://arxiv.org/abs/2505.05056)|null|
 150 | |**2025-05-08**|**A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration**|Shaja Arul Selvamani et.al.|[2505.04885](http://arxiv.org/abs/2505.04885)|null|
 151 | |**2025-05-07**|**Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment**|Xueyao Zhang et.al.|[2505.04113](http://arxiv.org/abs/2505.04113)|null|
 152 | |**2025-05-06**|**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**|Zuwei Long et.al.|[2505.03739](http://arxiv.org/abs/2505.03739)|**[link](https://github.com/vita-mllm/vita-audio)**|
 153 | |**2025-05-05**|**Generating Narrated Lecture Videos from Slides with Synchronized Highlights**|Alexander Holmberg et.al.|[2505.02966](http://arxiv.org/abs/2505.02966)|null|
 154 | |**2025-05-05**|**Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play**|Yemin Shi et.al.|[2505.02707](http://arxiv.org/abs/2505.02707)|**[link](https://github.com/maitrix-org/voila)**|
 155 | |**2025-04-30**|**Sadeed: Advancing Arabic Diacritization Through Small Language Model**|Zeina Aldallal et.al.|[2504.21635](http://arxiv.org/abs/2504.21635)|null|
 156 | |**2025-04-29**|**ClonEval: An Open Voice Cloning Benchmark**|Iwona Christop et.al.|[2504.20581](http://arxiv.org/abs/2504.20581)|null|
 157 | |**2025-05-02**|**Towards Flow-Matching-based TTS without Classifier-Free Guidance**|Yuzhe Liang et.al.|[2504.20334](http://arxiv.org/abs/2504.20334)|null|
 158 | |**2025-04-27**|**Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget**|Xin Li et.al.|[2504.19146](http://arxiv.org/abs/2504.19146)|**[link](https://github.com/MYZY-AI/Muyan-TTS)**|
 159 | |**2025-04-22**|**A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models**|Gengxian Cao et.al.|[2504.15552](http://arxiv.org/abs/2504.15552)|null|
 160 | |**2025-04-18**|**ChatNekoHacker: Real-Time Fan Engagement with Conversational Agents**|Takuya Sera et.al.|[2504.13793](http://arxiv.org/abs/2504.13793)|null|
 161 | |**2025-04-22**|**EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting**|Guanrou Yang et.al.|[2504.12867](http://arxiv.org/abs/2504.12867)|null|
 162 | |**2025-04-15**|**GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture**|Yaodong Song et.al.|[2504.12339](http://arxiv.org/abs/2504.12339)|null|
 163 | |**2025-04-15**|**Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation**|Yan Rong et.al.|[2504.11002](http://arxiv.org/abs/2504.11002)|null|
 164 | |**2025-04-15**|**Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy**|Botao Zhao et.al.|[2504.10819](http://arxiv.org/abs/2504.10819)|null|
 165 | |**2025-04-14**|**Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis**|Yifan Yang et.al.|[2504.10352](http://arxiv.org/abs/2504.10352)|null|
 166 | |**2025-04-14**|**AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis**|Dan Luo et.al.|[2504.10309](http://arxiv.org/abs/2504.10309)|null|
 167 | |**2025-04-11**|**Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation**|Haowei Lou et.al.|[2504.08274](http://arxiv.org/abs/2504.08274)|null|
 168 | |**2025-04-10**|**Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis**|Yizhong Geng et.al.|[2504.07858](http://arxiv.org/abs/2504.07858)|null|
 169 | |**2025-04-10**|**SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow**|Kaidi Wang et.al.|[2504.07776](http://arxiv.org/abs/2504.07776)|null|
 170 | |**2025-04-08**|**AVENet: Disentangling Features by Approximating Average Features for Voice Conversion**|Wenyu Wang et.al.|[2504.05833](http://arxiv.org/abs/2504.05833)|null|
 171 | |**2025-04-07**|**SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation**|Stephen Brade et.al.|[2504.05106](http://arxiv.org/abs/2504.05106)|null|
 172 | |**2025-04-04**|**RWKVTTS: Yet another TTS based on RWKV-7**|Lin yueyu et.al.|[2504.03289](http://arxiv.org/abs/2504.03289)|**[link](https://github.com/yynil/rwkvtts)**|
 173 | |**2025-04-09**|**F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization**|Xiaohui Sun et.al.|[2504.02407](http://arxiv.org/abs/2504.02407)|null|
 174 | |**2025-04-02**|**TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection**|Zhiming Ma et.al.|[2503.24115](http://arxiv.org/abs/2503.24115)|**[link](https://github.com/jimmyma99/teleantifraud)**|
 175 | |**2025-03-30**|**Speculative End-Turn Detector for Efficient Speech Chatbot Assistant**|Hyunjong Ok et.al.|[2503.23439](http://arxiv.org/abs/2503.23439)|null|
 176 | |**2025-03-29**|**SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System**|Hyeongju Kim et.al.|[2503.23108](http://arxiv.org/abs/2503.23108)|null|
 177 | |**2025-03-26**|**Dual Audio-Centric Modality Coupling for Talking Head Generation**|Ao Fu et.al.|[2503.22728](http://arxiv.org/abs/2503.22728)|null|
 178 | |**2025-03-28**|**DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation**|Haomin Zhang et.al.|[2503.22265](http://arxiv.org/abs/2503.22265)|null|
 179 | |**2025-03-26**|**Text-Driven Voice Conversion via Latent State-Space Modeling**|Wen Li et.al.|[2503.20999](http://arxiv.org/abs/2503.20999)|null|
 180 | |**2025-03-28**|**FireRedTTS-1S: An Upgraded Streamable Foundation Text-to-Speech System**|Hao-Han Guo et.al.|[2503.20499](http://arxiv.org/abs/2503.20499)|null|
 181 | |**2025-03-21**|**Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication**|Yiwen Xu et.al.|[2503.17479](http://arxiv.org/abs/2503.17479)|null|
 182 | |**2025-03-10**|**VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection**|Kunal Chavan et.al.|[2503.16488](http://arxiv.org/abs/2503.16488)|null|
 183 | |**2025-03-19**|**MoonCast: High-Quality Zero-Shot Podcast Generation**|Zeqian Ju et.al.|[2503.14345](http://arxiv.org/abs/2503.14345)|**[link](https://github.com/jzq2000/mooncast)**|
 184 | |**2025-03-14**|**MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation**|Sungwoo Cho et.al.|[2503.11026](http://arxiv.org/abs/2503.11026)|null|
 185 | |**2025-03-11**|**An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR**|Sewade Ogun et.al.|[2503.08954](http://arxiv.org/abs/2503.08954)|null|
 186 | |**2025-03-03**|**Direct Speech to Speech Translation: A Review**|Mohammad Sarim et.al.|[2503.04799](http://arxiv.org/abs/2503.04799)|null|
 187 | |**2025-03-06**|**LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM**|Sambal Shikhar et.al.|[2503.04724](http://arxiv.org/abs/2503.04724)|null|
 188 | |**2025-03-06**|**Scaling Rich Style-Prompted Text-to-Speech Datasets**|Anuj Diwan et.al.|[2503.04713](http://arxiv.org/abs/2503.04713)|**[link](https://github.com/ajd12342/paraspeechcaps)**|
 189 | |**2025-03-04**|**InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training**|Dingdong Wang et.al.|[2503.02769](http://arxiv.org/abs/2503.02769)|null|
 190 | |**2025-03-03**|**Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens**|Xinsheng Wang et.al.|[2503.01710](http://arxiv.org/abs/2503.01710)|**[link](https://github.com/sparkaudio/spark-tts)**|
 191 | |**2025-03-02**|**UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation**|Alexander H. Liu et.al.|[2503.00733](http://arxiv.org/abs/2503.00733)|null|
 192 | |**2025-03-12**|**Telephone Surveys Meet Conversational AI: Evaluating a LLM-Based Telephone Survey System at Scale**|Max M. Lang et.al.|[2502.20140](http://arxiv.org/abs/2502.20140)|null|
 193 | |**2025-02-26**|**Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis**|Ziyue Jiang et.al.|[2502.18924](http://arxiv.org/abs/2502.18924)|null|
 194 | |**2025-03-08**|**Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding**|Tianyun Liu et.al.|[2502.18889](http://arxiv.org/abs/2502.18889)|null|
 195 | |**2025-02-24**|**Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM**|Jiatong Shi et.al.|[2502.16897](http://arxiv.org/abs/2502.16897)|null|
 196 | |**2025-02-17**|**NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing**|Yifan Liang et.al.|[2502.12002](http://arxiv.org/abs/2502.12002)|null|
 197 | |**2025-02-16**|**SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer**|Zhengyan Sheng et.al.|[2502.11094](http://arxiv.org/abs/2502.11094)|null|
 198 | |**2025-02-14**|**VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect**|Qingyuan Fei et.al.|[2502.10329](http://arxiv.org/abs/2502.10329)|null|
 199 | |**2025-02-13**|**TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument**|Kyungsu Kim et.al.|[2502.08939](http://arxiv.org/abs/2502.08939)|**[link](https://github.com/kyungsukim42/tokensynth)**|
 200 | |**2025-03-02**|**ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech**|Xin Wang et.al.|[2502.08857](http://arxiv.org/abs/2502.08857)|null|
 201 | |**2025-02-11**|**LoRP-TTS: Low-Rank Personalized Text-To-Speech**|Łukasz Bondaruk et.al.|[2502.07562](http://arxiv.org/abs/2502.07562)|null|
 202 | |**2025-02-11**|**Advanced Zero-Shot Text-to-Speech for Background Removal and Preservation with Controllable Masked Speech Prediction**|Leying Zhang et.al.|[2502.07345](http://arxiv.org/abs/2502.07345)|null|
 203 | |**2025-02-11**|**Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement**|Xueyao Zhang et.al.|[2502.07243](http://arxiv.org/abs/2502.07243)|null|
 204 | |**2025-02-10**|**Synthetic Audio Helps for Cognitive State Tasks**|Adil Soubki et.al.|[2502.06922](http://arxiv.org/abs/2502.06922)|**[link](https://github.com/adil-soubki/sad-training)**|
 205 | |**2025-02-19**|**Speech to Speech Translation with Translatotron: A State of the Art Review**|Jules R. Kala et.al.|[2502.05980](http://arxiv.org/abs/2502.05980)|null|
 206 | |**2025-02-09**|**BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting**|Mohammad Jahid Ibna Basher et.al.|[2502.05729](http://arxiv.org/abs/2502.05729)|null|
 207 | |**2025-02-08**|**Gender Bias in Instruction-Guided Speech Synthesis Models**|Chun-Yi Kuan et.al.|[2502.05649](http://arxiv.org/abs/2502.05649)|null|
 208 | |**2025-02-08**|**IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System**|Wei Deng et.al.|[2502.05512](http://arxiv.org/abs/2502.05512)|null|
 209 | |**2025-02-05**|**Metis: A Foundation Speech Generation Model with Masked Generative Pre-training**|Yuancheng Wang et.al.|[2502.03128](http://arxiv.org/abs/2502.03128)|**[link](https://github.com/open-mmlab/amphion)**|
 210 | |**2025-02-05**|**Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech**|Jixun Yao et.al.|[2502.02950](http://arxiv.org/abs/2502.02950)|null|
 211 | |**2025-02-04**|**Developing multilingual speech synthesis system for Ojibwe, Mi'kmaq, and Maliseet**|Shenran Wang et.al.|[2502.02703](http://arxiv.org/abs/2502.02703)|null|
 212 | |**2025-02-04**|**Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation**|Peidong Wang et.al.|[2502.02683](http://arxiv.org/abs/2502.02683)|null|
 213 | |**2025-02-02**|**EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis**|Junuk Cha et.al.|[2502.00654](http://arxiv.org/abs/2502.00654)|null|
 214 | |**2025-01-31**|**VisualSpeech: Enhance Prosody with Visual Context in TTS**|Shumin Que et.al.|[2501.19258](http://arxiv.org/abs/2501.19258)|null|
 215 | |**2025-01-29**|**BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights**|Chan-Jan Hsu et.al.|[2501.17790](http://arxiv.org/abs/2501.17790)|null|
 216 | |**2025-01-28**|**Compact Neural TTS Voices for Accessibility**|Kunal Jain et.al.|[2501.17332](http://arxiv.org/abs/2501.17332)|null|
 217 | |**2025-01-26**|**Overview of the Amphion Toolkit (v0.2)**|Jiaqi Li et.al.|[2501.15442](http://arxiv.org/abs/2501.15442)|**[link](https://github.com/open-mmlab/amphion)**|
 218 | |**2025-01-24**|**Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models**|Tianrui Wang et.al.|[2501.14273](http://arxiv.org/abs/2501.14273)|null|
 219 | |**2025-01-24**|**Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation**|Wen Huang et.al.|[2501.14240](http://arxiv.org/abs/2501.14240)|null|
 220 | |**2025-01-24**|**LoCoML: A Framework for Real-World ML Inference Pipelines**|Kritin Maddireddy et.al.|[2501.14165](http://arxiv.org/abs/2501.14165)|null|
 221 | |**2025-01-23**|**Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement**|Jae-Sung Bae et.al.|[2501.13372](http://arxiv.org/abs/2501.13372)|null|
 222 | |**2025-01-21**|**A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data**|Minh Tran et.al.|[2501.12501](http://arxiv.org/abs/2501.12501)|null|
 223 | |**2025-01-15**|**Speech Synthesis along Perceptual Voice Quality Dimensions**|Frederik Rautenberg et.al.|[2501.08791](http://arxiv.org/abs/2501.08791)|null|
 224 | |**2025-01-15**|**Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification**|Li Zhang et.al.|[2501.08691](http://arxiv.org/abs/2501.08691)|null|
 225 | |**2025-01-15**|**Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement**|Qianniu Chen et.al.|[2501.08566](http://arxiv.org/abs/2501.08566)|null|
 226 | |**2025-01-19**|**MathReader : Text-to-Speech for Mathematical Documents**|Sieun Hyeon et.al.|[2501.07088](http://arxiv.org/abs/2501.07088)|**[link](https://github.com/hyeonsieun/mathreader)**|
 227 | |**2025-01-10**|**TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer**|Vladimir Bataev et.al.|[2501.06320](http://arxiv.org/abs/2501.06320)|null|
 228 | |**2025-01-10**|**MinMo: A Multimodal Large Language Model for Seamless Voice Interaction**|Qian Chen et.al.|[2501.06282](http://arxiv.org/abs/2501.06282)|null|
 229 | |**2025-01-10**|**PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control**|Shaozuo Zhang et.al.|[2501.06276](http://arxiv.org/abs/2501.06276)|null|
 230 | |**2025-01-10**|**Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron**|Kishor Kayyar Lakshminarayana et.al.|[2501.05976](http://arxiv.org/abs/2501.05976)|null|
 231 | |**2025-01-10**|**MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model**|Matthew Baas et.al.|[2501.05787](http://arxiv.org/abs/2501.05787)|null|
 232 | |**2025-01-09**|**Probing Speaker-specific Features in Speaker Representations**|Aemon Yat Fei Chiu et.al.|[2501.05310](http://arxiv.org/abs/2501.05310)|null|
 233 | |**2025-01-08**|**Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model**|Sanjana Sankar et.al.|[2501.04799](http://arxiv.org/abs/2501.04799)|null|
 234 | |**2025-01-08**|**DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions**|Weidong Chen et.al.|[2501.04256](http://arxiv.org/abs/2501.04256)|null|
 235 | |**2025-01-02**|**FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles**|Tian-Hao Zhang et.al.|[2501.03181](http://arxiv.org/abs/2501.03181)|null|
 236 | |**2025-01-02**|**RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer**|Seongho Hong et.al.|[2501.01182](http://arxiv.org/abs/2501.01182)|**[link](https://github.com/seongho608/ringformer)**|
 237 | |**2025-01-02**|**Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT**|Dongyang Dai et.al.|[2501.01102](http://arxiv.org/abs/2501.01102)|null|
 238 | |**2025-01-06**|**Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study**|Mykola Maslych et.al.|[2501.00168](http://arxiv.org/abs/2501.00168)|null|
 239 | |**2024-12-28**|**Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting**|Wooseok Han et.al.|[2412.20155](http://arxiv.org/abs/2412.20155)|null|
 240 | |**2024-12-26**|**"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities**|Jiawei Yu et.al.|[2412.19102](http://arxiv.org/abs/2412.19102)|null|
 241 | |**2024-12-26**|**Indonesian-English Code-Switching Speech Synthesizer Utilizing Multilingual STEN-TTS and Bert LID**|Ahmad Alfani Handoyo et.al.|[2412.19043](http://arxiv.org/abs/2412.19043)|null|
 242 | |**2024-12-25**|**Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset**|Neil Shah et.al.|[2412.18839](http://arxiv.org/abs/2412.18839)|null|
 243 | |**2024-12-24**|**GenPod: Constructive News Framing in AI-Generated Podcasts More Effectively Reduces Negative Emotions Than Non-Constructive Framing**|Wen Ku et.al.|[2412.18300](http://arxiv.org/abs/2412.18300)|null|
 244 | |**2024-12-22**|**Why Do Speech Language Models Fail to Generate Semantically Coherent Outputs? A Modality Evolving Perspective**|Hankun Wang et.al.|[2412.17048](http://arxiv.org/abs/2412.17048)|null|
 245 | |**2024-12-22**|**Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis**|Ye-Xin Lu et.al.|[2412.16977](http://arxiv.org/abs/2412.16977)|null|
 246 | |**2024-12-22**|**Autoregressive Speech Synthesis with Next-Distribution Prediction**|Xinfa Zhu et.al.|[2412.16846](http://arxiv.org/abs/2412.16846)|null|
 247 | |**2024-12-23**|**Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers**|Yifan Yang et.al.|[2412.16102](http://arxiv.org/abs/2412.16102)|null|
 248 | |**2024-12-19**|**Scale This, Not That: Investigating Key Dataset Attributes for Efficient Speech Enhancement Scaling**|Leying Zhang et.al.|[2412.14890](http://arxiv.org/abs/2412.14890)|null|
 249 | |**2024-12-17**|**Synthetic Speech Classification: IEEE Signal Processing Cup 2022 challenge**|Mahieyin Rahmun et.al.|[2412.13279](http://arxiv.org/abs/2412.13279)|**[link](https://github.com/AGenCyLab/SPCUP2022)**|
 250 | |**2024-12-17**|**Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion**|Syed Zohaib Hassan et.al.|[2412.12710](http://arxiv.org/abs/2412.12710)|null|
 251 | |**2024-12-17**|**Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes**|Kuiyuan Zhang et.al.|[2412.12619](http://arxiv.org/abs/2412.12619)|null|
 252 | |**2024-12-17**|**Hierarchical Control of Emotion Rendering in Speech Synthesis**|Sho Inoue et.al.|[2412.12498](http://arxiv.org/abs/2412.12498)|**[link](https://github.com/shinshoji01/hed-project-page)**|
 253 | |**2024-12-19**|**ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis**|Xiangheng He et.al.|[2412.11795](http://arxiv.org/abs/2412.11795)|null|
 254 | |**2024-12-17**|**Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech**|Rui Liu et.al.|[2412.11409](http://arxiv.org/abs/2412.11409)|**[link](https://github.com/ai-s2-lab/m2se-vtts)**|
 255 | |**2024-12-16**|**Efficient Generative Modeling with Residual Vector Quantization-Based Tokens**|Jaehyeon Kim et.al.|[2412.10208](http://arxiv.org/abs/2412.10208)|null|
 256 | |**2024-12-13**|**AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation**|Xiyuan Gao et.al.|[2412.10103](http://arxiv.org/abs/2412.10103)|null|
 257 | |**2024-12-13**|**CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder**|Jianwei Cui et.al.|[2412.08918](http://arxiv.org/abs/2412.08918)|null|
 258 | |**2024-12-11**|**Multimodal Latent Language Modeling with Next-Token Diffusion**|Yutao Sun et.al.|[2412.08635](http://arxiv.org/abs/2412.08635)|**[link](https://github.com/microsoft/unilm/tree/master/LatentLM)**|
 259 | |**2024-12-11**|**A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction**|Sowmya Cheripally et.al.|[2412.08312](http://arxiv.org/abs/2412.08312)|null|
 260 | |**2024-12-11**|**A Preliminary Analysis of Automatic Word and Syllable Prominence Detection in Non-Native Speech With Text-to-Speech Prosody Embeddings**|Anindita Mondal et.al.|[2412.08283](http://arxiv.org/abs/2412.08283)|null|
 261 | |**2024-12-11**|**LatentSpeech: Latent Diffusion for Text-To-Speech Generation**|Haowei Lou et.al.|[2412.08117](http://arxiv.org/abs/2412.08117)|null|
 262 | |**2024-12-11**|**Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration**|Haowei Lou et.al.|[2412.08112](http://arxiv.org/abs/2412.08112)|null|
 263 | |**2024-12-09**|**Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey**|Tianxin Xie et.al.|[2412.06602](http://arxiv.org/abs/2412.06602)|**[link](https://github.com/imxtx/awesome-controllabe-speech-synthesis)**|
 264 | |**2024-12-12**|**EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations**|Weizhen Bian et.al.|[2412.06581](http://arxiv.org/abs/2412.06581)|null|
 265 | |**2024-12-01**|**Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor**|Ashwin Baluja et.al.|[2412.05315](http://arxiv.org/abs/2412.05315)|null|
 266 | |**2024-12-04**|**DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles**|Jiaxuan Liu et.al.|[2412.03388](http://arxiv.org/abs/2412.03388)|null|
 267 | |**2024-12-03**|**GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot**|Aohan Zeng et.al.|[2412.02612](http://arxiv.org/abs/2412.02612)|**[link](https://github.com/thudm/glm-4-voice)**|
 268 | |**2024-11-19**|**A Context-Based Numerical Format Prediction for a Text-To-Speech System**|Yaser Darwesh et.al.|[2412.00028](http://arxiv.org/abs/2412.00028)|null|
 269 | |**2024-11-27**|**Continual Learning in Machine Speech Chain Using Gradient Episodic Memory**|Geoffrey Tyndall et.al.|[2411.18320](http://arxiv.org/abs/2411.18320)|null|
 270 | |**2024-11-27**|**SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation**|Wenyi Yu et.al.|[2411.18138](http://arxiv.org/abs/2411.18138)|null|
 271 | |**2024-11-26**|**WavChat: A Survey of Spoken Dialogue Models**|Shengpeng Ji et.al.|[2411.13577](http://arxiv.org/abs/2411.13577)|**[link](https://github.com/jishengpeng/wavchat)**|
 272 | |**2024-12-02**|**I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception**|Jiawei Zhang et.al.|[2411.13314](http://arxiv.org/abs/2411.13314)|null|
 273 | |**2024-11-20**|**Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM**|Jiawei Yu et.al.|[2411.13159](http://arxiv.org/abs/2411.13159)|null|
 274 | |**2024-11-19**|**Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation**|Praveen Srinivasa Varadhan et.al.|[2411.12719](http://arxiv.org/abs/2411.12719)|null|
 275 | |**2024-11-19**|**Leveraging Virtual Reality and AI Tutoring for Language Learning: A Case Study of a Virtual Campus Environment with OpenAI GPT Integration with Unity 3D**|Adithya TG et.al.|[2411.12619](http://arxiv.org/abs/2411.12619)|null|
 276 | |**2024-11-18**|**ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram**|Xiao-Hang Jiang et.al.|[2411.11258](http://arxiv.org/abs/2411.11258)|null|
 277 | |**2024-11-12**|**Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models**|Dongrui Han et.al.|[2411.07563](http://arxiv.org/abs/2411.07563)|null|
 278 | |**2024-11-11**|**Enhancing Accessibility in Special Libraries: A Study on AI-Powered Assistive Technologies for Patrons with Disabilities**|Snehasish Paul Shivali Chauhan et.al.|[2411.06970](http://arxiv.org/abs/2411.06970)|null|
 279 | |**2024-11-10**|**Debatts: Zero-Shot Debating Text-to-Speech Synthesis**|Yiqiao Huang et.al.|[2411.06540](http://arxiv.org/abs/2411.06540)|null|
 280 | |**2024-11-07**|**CUIfy the XR: An Open-Source Package to Embed LLM-powered Conversational Agents in XR**|Kadir Burak Buldu et.al.|[2411.04671](http://arxiv.org/abs/2411.04671)|null|
 281 | |**2024-11-04**|**EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector**|Deok-Hyeon Cho et.al.|[2411.02625](http://arxiv.org/abs/2411.02625)|**[link](https://github.com/Choddeok/EmoSpherepp)**|
 282 | |**2024-11-09**|**Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis**|Shijia Liao et.al.|[2411.01156](http://arxiv.org/abs/2411.01156)|**[link](https://github.com/fishaudio/fish-speech)**|
 283 | |**2024-10-31**|**Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?**|Ioannis Tsiamas et.al.|[2410.24019](http://arxiv.org/abs/2410.24019)|null|
 284 | |**2024-10-30**|**Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis**|Théodor Lemerle et.al.|[2410.23320](http://arxiv.org/abs/2410.23320)|**[link](https://github.com/theodorblackbird/lina-speech)**|
 285 | |**2024-10-29**|**Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech**|Eric Battenberg et.al.|[2410.22179](http://arxiv.org/abs/2410.22179)|null|
 286 | |**2024-10-29**|**Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding**|Bohan Li et.al.|[2410.21951](http://arxiv.org/abs/2410.21951)|null|
 287 | |**2024-10-29**|**RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis**|Kehan Sui et.al.|[2410.21641](http://arxiv.org/abs/2410.21641)|null|
 288 | |**2024-10-28**|**Asynchronous Tool Usage for Real-Time Agents**|Antonio A. Ginart et.al.|[2410.21620](http://arxiv.org/abs/2410.21620)|null|
 289 | |**2024-10-28**|**Enhancing TTS Stability in Hebrew using Discrete Semantic Units**|Ella Zeldes et.al.|[2410.21502](http://arxiv.org/abs/2410.21502)|null|
 290 | |**2024-10-28**|**Mitigating Unauthorized Speech Synthesis for Voice Protection**|Zhisheng Zhang et.al.|[2410.20742](http://arxiv.org/abs/2410.20742)|**[link](https://github.com/wxzyd123/pivotal_objective_perturbation)**|
 291 | |**2024-10-27**|**Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation**|Maohao Shen et.al.|[2410.20336](http://arxiv.org/abs/2410.20336)|null|
 292 | |**2024-10-24**|**Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis**|Suparna De et.al.|[2410.19199](http://arxiv.org/abs/2410.19199)|null|
 293 | |**2024-10-24**|**STTATTS: Unified Speech-To-Text And Text-To-Speech Model**|Hawau Olamide Toyin et.al.|[2410.18607](http://arxiv.org/abs/2410.18607)|**[link](https://github.com/mbzuai-nlp/sttatts)**|
 294 | |**2024-10-24**|**Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts**|ChaeHun Park et.al.|[2410.18444](http://arxiv.org/abs/2410.18444)|null|
 295 | |**2024-10-23**|**ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams**|Srija Anand et.al.|[2410.17901](http://arxiv.org/abs/2410.17901)|null|
 296 | |**2024-10-22**|**Continuous Speech Tokenizer in Text To Speech**|Yixing Li et.al.|[2410.17081](http://arxiv.org/abs/2410.17081)|null|
 297 | |**2024-10-22**|**Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap**|Guanrou Yang et.al.|[2410.16726](http://arxiv.org/abs/2410.16726)|null|
 298 | |**2024-10-21**|**Continuous Speech Synthesis using per-token Latent Diffusion**|Arnon Turetzky et.al.|[2410.16048](http://arxiv.org/abs/2410.16048)|null|
 299 | |**2024-10-18**|**A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages**|Sujitha Sathiyamoorthy et.al.|[2410.14197](http://arxiv.org/abs/2410.14197)|null|
 300 | |**2024-10-18**|**Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech**|Shuwei He et.al.|[2410.14101](http://arxiv.org/abs/2410.14101)|**[link](https://github.com/ms2ku-vtts/ms2ku-vtts)**|
 301 | |**2024-10-17**|**Enhancing Crowdsourced Audio for Text-to-Speech Models**|José Giraldo et.al.|[2410.13357](http://arxiv.org/abs/2410.13357)|null|
 302 | |**2024-10-17**|**DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech**|Jan Melechovsky et.al.|[2410.13342](http://arxiv.org/abs/2410.13342)|null|
 303 | |**2024-10-17**|**DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2410.13288](http://arxiv.org/abs/2410.13288)|null|
 304 | |**2024-10-17**|**Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation**|Sreyan Ghosh et.al.|[2410.13198](http://arxiv.org/abs/2410.13198)|null|
 305 | |**2024-10-16**|**ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs**|Rui-Chen Zheng et.al.|[2410.12359](http://arxiv.org/abs/2410.12359)|null|
 306 | |**2024-10-14**|**IsoChronoMeter: A simple and effective isochronic translation evaluation metric**|Nikolai Rozanov et.al.|[2410.11127](http://arxiv.org/abs/2410.11127)|null|
 307 | |**2024-10-14**|**DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization**|Yingahao Aaron Li et.al.|[2410.11097](http://arxiv.org/abs/2410.11097)|null|
 308 | |**2024-10-12**|**Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling**|Rui Liu et.al.|[2410.09524](http://arxiv.org/abs/2410.09524)|null|
 309 | |**2024-10-10**|**Unsupervised Data Validation Methods for Efficient Model Training**|Yurii Paniv et.al.|[2410.07880](http://arxiv.org/abs/2410.07880)|null|
 310 | |**2024-10-15**|**F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching**|Yushen Chen et.al.|[2410.06885](http://arxiv.org/abs/2410.06885)|**[link](https://github.com/SWivid/F5-TTS)**|
 311 | |**2024-10-09**|**Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch**|Teodora Răgman et.al.|[2410.06787](http://arxiv.org/abs/2410.06787)|null|
 312 | |**2024-10-09**|**Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS**|Onkar Kishor Susladkar et.al.|[2410.06608](http://arxiv.org/abs/2410.06608)|null|
 313 | |**2024-10-09**|**Can DeepFake Speech be Reliably Detected?**|Hongbin Liu et.al.|[2410.06572](http://arxiv.org/abs/2410.06572)|null|
 314 | |**2024-10-07**|**SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech**|Minchan Kim et.al.|[2410.04690](http://arxiv.org/abs/2410.04690)|null|
 315 | |**2024-10-06**|**HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis**|Yuto Nishimura et.al.|[2410.04380](http://arxiv.org/abs/2410.04380)|null|
 316 | |**2024-10-10**|**SONAR: A Synthetic AI-Audio Detection Framework and Benchmark**|Xiang Li et.al.|[2410.04324](http://arxiv.org/abs/2410.04324)|**[link](https://github.com/jessegator/sonar)**|
 317 | |**2024-10-05**|**Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System**|Ze Li et.al.|[2410.04017](http://arxiv.org/abs/2410.04017)|null|
 318 | |**2024-10-01**|**Recent Advances in Speech Language Models: A Survey**|Wenqian Cui et.al.|[2410.03751](http://arxiv.org/abs/2410.03751)|null|
 319 | |**2024-10-04**|**Generative Semantic Communication for Text-to-Speech Synthesis**|Jiahao Zheng et.al.|[2410.03459](http://arxiv.org/abs/2410.03459)|null|
 320 | |**2024-10-04**|**Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens**|Jinzheng Zhao et.al.|[2410.03298](http://arxiv.org/abs/2410.03298)|null|
 321 | |**2024-10-04**|**Narrative Player: Reviving Data Narratives with Visuals**|Zekai Shao et.al.|[2410.03268](http://arxiv.org/abs/2410.03268)|null|
 322 | |**2024-10-04**|**MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech**|Taejun Bak et.al.|[2410.03192](http://arxiv.org/abs/2410.03192)|null|
 323 | |**2024-10-01**|**Augmentation through Laundering Attacks for Audio Spoof Detection**|Hashim Ali et.al.|[2410.01108](http://arxiv.org/abs/2410.01108)|null|
 324 | |**2024-10-01**|**Zero-Shot Text-to-Speech from Continuous Text Streams**|Trung Dang et.al.|[2410.00767](http://arxiv.org/abs/2410.00767)|null|
 325 | |**2024-10-01**|**EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control**|Haozhe Chen et.al.|[2410.00316](http://arxiv.org/abs/2410.00316)|**[link](https://github.com/tonychenxyz/emoknob)**|
 326 | |**2024-09-30**|**Word-wise intonation model for cross-language TTS systems**|Tomilov A. A. et.al.|[2409.20374](http://arxiv.org/abs/2409.20374)|null|
 327 | |**2024-09-27**|**Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech**|Youngjae Kim et.al.|[2409.18622](http://arxiv.org/abs/2409.18622)|null|
 328 | |**2024-09-26**|**Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control**|Ryuichi Yamamoto et.al.|[2409.17452](http://arxiv.org/abs/2409.17452)|null|
 329 | |**2024-09-25**|**Exploring synthetic data for cross-speaker style transfer in style representation based TTS**|Lucas H. Ueda et.al.|[2409.17364](http://arxiv.org/abs/2409.17364)|null|
 330 | |**2024-09-25**|**Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions**|Kun Zhou et.al.|[2409.16681](http://arxiv.org/abs/2409.16681)|null|
 331 | |**2024-09-25**|**Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation**|Siyin Wang et.al.|[2409.16644](http://arxiv.org/abs/2409.16644)|null|
 332 | |**2024-09-24**|**FastTalker: Jointly Generating Speech and Conversational Gestures from Text**|Zixin Guo et.al.|[2409.16404](http://arxiv.org/abs/2409.16404)|null|
 333 | |**2024-09-24**|**Beyond Text-to-Text: An Overview of Multimodal and Generative Artificial Intelligence for Education Using Topic Modeling**|Ville Heilala et.al.|[2409.16376](http://arxiv.org/abs/2409.16376)|null|
 334 | |**2024-09-24**|**Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech**|Yunji Chu et.al.|[2409.16203](http://arxiv.org/abs/2409.16203)|null|
 335 | |**2024-09-24**|**NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers**|Nohil Park et.al.|[2409.15760](http://arxiv.org/abs/2409.15760)|null|
 336 | |**2024-09-24**|**VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance**|Jiheum Yeom et.al.|[2409.15759](http://arxiv.org/abs/2409.15759)|null|
 337 | |**2024-09-24**|**StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis**|Zhiyong Chen et.al.|[2409.15741](http://arxiv.org/abs/2409.15741)|null|
 338 | |**2024-09-23**|**A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection**|Lam Pham et.al.|[2409.15180](http://arxiv.org/abs/2409.15180)|null|
 339 | |**2024-09-23**|**LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation**|Hieu-Thi Luong et.al.|[2409.14743](http://arxiv.org/abs/2409.14743)|null|
 340 | |**2024-09-20**|**Zero-shot Cross-lingual Voice Transfer for TTS**|Fadi Biadsy et.al.|[2409.13910](http://arxiv.org/abs/2409.13910)|null|
 341 | |**2024-09-20**|**On the Feasibility of Fully AI-automated Vishing Attacks**|João Figueiredo et.al.|[2409.13793](http://arxiv.org/abs/2409.13793)|null|
 342 | |**2024-09-19**|**Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space**|Sebastião Quintas et.al.|[2409.12745](http://arxiv.org/abs/2409.12745)|null|
 343 | |**2024-09-19**|**Preference Alignment Improves Language Model-Based TTS**|Jinchuan Tian et.al.|[2409.12403](http://arxiv.org/abs/2409.12403)|null|
 344 | |**2024-09-18**|**Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference**|Edresson Casanova et.al.|[2409.12117](http://arxiv.org/abs/2409.12117)|null|
 345 | |**2024-09-18**|**Exploring an Inter-Pausal Unit (IPU) based Approach for Indic End-to-End TTS Systems**|Anusha Prakash et.al.|[2409.11915](http://arxiv.org/abs/2409.11915)|null|
 346 | |**2024-09-18**|**DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech**|Xin Qi et.al.|[2409.11835](http://arxiv.org/abs/2409.11835)|null|
 347 | |**2024-09-18**|**Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation**|Haohan Guo et.al.|[2409.11630](http://arxiv.org/abs/2409.11630)|null|
 348 | |**2024-09-17**|**SpMis: An Investigation of Synthetic Spoken Misinformation Detection**|Peizhuo Liu et.al.|[2409.11308](http://arxiv.org/abs/2409.11308)|null|
 349 | |**2024-09-19**|**The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives**|Samee Arif et.al.|[2409.11261](http://arxiv.org/abs/2409.11261)|**[link](https://github.com/ulrs0/The-Art-of-Story-Telling)**|
 350 | |**2024-09-17**|**Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora**|Francesco Nespoli et.al.|[2409.11107](http://arxiv.org/abs/2409.11107)|null|
 351 | |**2024-09-16**|**Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization**|Xiaoxue Gao et.al.|[2409.10157](http://arxiv.org/abs/2409.10157)|null|
 352 | |**2024-09-16**|**StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion**|Yinghao Aaron Li et.al.|[2409.10058](http://arxiv.org/abs/2409.10058)|null|
 353 | |**2024-09-15**|**Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning**|Siqi Sun et.al.|[2409.09891](http://arxiv.org/abs/2409.09891)|null|
 354 | |**2024-09-14**|**E1 TTS: Simple and Fast Non-Autoregressive TTS**|Zhijun Liu et.al.|[2409.09351](http://arxiv.org/abs/2409.09351)|null|
 355 | |**2024-09-14**|**Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation**|Changjin Han et.al.|[2409.09311](http://arxiv.org/abs/2409.09311)|null|
 356 | |**2024-09-14**|**SafeEar: Content Privacy-Preserving Audio Deepfake Detection**|Xinfeng Li et.al.|[2409.09272](http://arxiv.org/abs/2409.09272)|**[link](https://github.com/LetterLiGo/SafeEar)**|
 357 | |**2024-09-13**|**AccentBox: Towards High-Fidelity Zero-Shot Accent Generation**|Jinzuomu Zhong et.al.|[2409.09098](http://arxiv.org/abs/2409.09098)|null|
 358 | |**2024-09-17**|**HLTCOE JHU Submission to the Voice Privacy Challenge 2024**|Henry Li Xinyuan et.al.|[2409.08913](http://arxiv.org/abs/2409.08913)|null|
 359 | |**2024-09-13**|**Text-To-Speech Synthesis In The Wild**|Jee-weon Jung et.al.|[2409.08711](http://arxiv.org/abs/2409.08711)|null|
 360 | |**2024-09-14**|**Exploring Accessibility Trends and Challenges in Mobile App Development: A Study of Stack Overflow Questions**|Amila Indika et.al.|[2409.07945](http://arxiv.org/abs/2409.07945)|null|
 361 | |**2024-09-12**|**Full-text Error Correction for Chinese Speech Recognition with Large Language Model**|Zhiyuan Tang et.al.|[2409.07790](http://arxiv.org/abs/2409.07790)|null|
 362 | |**2024-09-11**|**SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis**|Helin Wang et.al.|[2409.07556](http://arxiv.org/abs/2409.07556)|**[link](https://github.com/WangHelin1997/SSR-Speech)**|
 363 | |**2024-09-11**|**D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack**|Hong-Hanh Nguyen-Le et.al.|[2409.07390](http://arxiv.org/abs/2409.07390)|null|
 364 | |**2024-09-11**|**Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT**|Kazuki Yamauchi et.al.|[2409.07265](http://arxiv.org/abs/2409.07265)|null|
 365 | |**2024-09-11**|**Zero-Shot Text-to-Speech as Golden Speech Generator: A Systematic Framework and its Applicability in Automatic Pronunciation Assessment**|Tien-Hong Lo et.al.|[2409.07151](http://arxiv.org/abs/2409.07151)|null|
 366 | |**2024-09-10**|**Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models**|Xin Jing et.al.|[2409.06451](http://arxiv.org/abs/2409.06451)|null|
 367 | |**2024-09-10**|**What happens to diffusion model likelihood when your model is conditional?**|Mattias Cross et.al.|[2409.06364](http://arxiv.org/abs/2409.06364)|null|
 368 | |**2024-09-10**|**VoiceWukong: Benchmarking Deepfake Voice Detection**|Ziwei Yan et.al.|[2409.06348](http://arxiv.org/abs/2409.06348)|null|
 369 | |**2024-09-09**|**AS-Speech: Adaptive Style For Speech Synthesis**|Zhipeng Li et.al.|[2409.05730](http://arxiv.org/abs/2409.05730)|null|
 370 | |**2024-09-09**|**IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS**|Ashwin Sankar et.al.|[2409.05356](http://arxiv.org/abs/2409.05356)|**[link](https://github.com/ai4bharat/indicvoices-r)**|
 371 | |**2024-09-10**|**Disentangling the Prosody and Semantic Information with Pre-trained Model for In-Context Learning based Zero-Shot Voice Conversion**|Zhengyang Chen et.al.|[2409.05004](http://arxiv.org/abs/2409.05004)|null|
 372 | |**2024-09-01**|**Sample-Efficient Diffusion for Text-To-Speech Synthesis**|Justin Lovelace et.al.|[2409.03717](http://arxiv.org/abs/2409.03717)|**[link](https://github.com/justinlovelace/sesd)**|
 373 | |**2024-09-10**|**LAST: Language Model Aware Speech Tokenization**|Arnon Turetzky et.al.|[2409.03701](http://arxiv.org/abs/2409.03701)|null|
 374 | |**2024-09-05**|**FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications**|Hao-Han Guo et.al.|[2409.03283](http://arxiv.org/abs/2409.03283)|null|
 375 | |**2024-09-04**|**Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems**|Jeongmin Liu et.al.|[2409.02517](http://arxiv.org/abs/2409.02517)|null|
 376 | |**2024-09-03**|**VoxHakka: A Dialectally Diverse Multi-speaker Text-to-Speech System for Taiwanese Hakka**|Li-Wei Chen et.al.|[2409.01548](http://arxiv.org/abs/2409.01548)|null|
 377 | |**2024-09-02**|**A multilingual training strategy for low resource Text to Speech**|Asma Amalas et.al.|[2409.01217](http://arxiv.org/abs/2409.01217)|null|
 378 | |**2024-09-02**|**A Framework for Synthetic Audio Conversations Generation using Large Language Models**|Kaung Myat Kyaw et.al.|[2409.00946](http://arxiv.org/abs/2409.00946)|null|
 379 | |**2024-09-02**|**SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis**|Haohan Guo et.al.|[2409.00933](http://arxiv.org/abs/2409.00933)|**[link](https://github.com/hhguo/socodec)**|
 380 | |**2024-09-01**|**MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer**|Yuancheng Wang et.al.|[2409.00750](http://arxiv.org/abs/2409.00750)|null|
 381 | |**2024-08-30**|**SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection**|Ismail Rasim Ulgen et.al.|[2408.17432](http://arxiv.org/abs/2408.17432)|null|
 382 | |**2024-08-30**|**AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge**|Kirill Borodin et.al.|[2408.17352](http://arxiv.org/abs/2408.17352)|null|
 383 | |**2024-08-30**|**Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model**|Zhen Ye et.al.|[2408.17175](http://arxiv.org/abs/2408.17175)|**[link](https://github.com/zhenye234/xcodec)**|
 384 | |**2024-08-30**|**Utilizing Speaker Profiles for Impersonation Audio Detection**|Hao Gu et.al.|[2408.17009](http://arxiv.org/abs/2408.17009)|null|
 385 | |**2024-08-29**|**Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis**|Zehai Tu et.al.|[2408.16373](http://arxiv.org/abs/2408.16373)|null|
 386 | |**2024-08-28**|**Multi-modal Adversarial Training for Zero-Shot Voice Cloning**|John Janiczek et.al.|[2408.15916](http://arxiv.org/abs/2408.15916)|null|
 387 | |**2024-08-29**|**Easy, Interpretable, Effective: openSMILE for voice deepfake detection**|Octavian Pascu et.al.|[2408.15775](http://arxiv.org/abs/2408.15775)|null|
 388 | |**2024-08-28**|**VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling**|Yixuan Zhou et.al.|[2408.15676](http://arxiv.org/abs/2408.15676)|**[link](https://github.com/thuhcsi/voxinstruct)**|
 389 | |**2024-08-28**|**VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech**|Heeseung Kim et.al.|[2408.14739](http://arxiv.org/abs/2408.14739)|null|
 390 | |**2024-08-27**|**StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech**|Haowei Lou et.al.|[2408.14713](http://arxiv.org/abs/2408.14713)|null|
 391 | |**2024-08-27**|**DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance**|Jinhyeok Yang et.al.|[2408.14423](http://arxiv.org/abs/2408.14423)|null|
 392 | |**2024-08-26**|**Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard**|Wonjune Kang et.al.|[2408.13970](http://arxiv.org/abs/2408.13970)|null|
 393 | |**2024-08-28**|**SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2408.13893](http://arxiv.org/abs/2408.13893)|null|
 394 | |**2024-08-22**|**Positional Description for Numerical Normalization**|Deepanshu Gupta et.al.|[2408.12430](http://arxiv.org/abs/2408.12430)|null|
 395 | |**2024-08-22**|**VoiceX: A Text-To-Speech Framework for Custom Voices**|Silvan Mertes et.al.|[2408.12170](http://arxiv.org/abs/2408.12170)|null|
 396 | |**2024-08-13**|**Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation**|Yinghao Aaron Li et.al.|[2408.11849](http://arxiv.org/abs/2408.11849)|null|
 397 | |**2024-08-20**|**EELE: Exploring Efficient and Extensible LoRA Integration in Emotional Text-to-Speech**|Xin Qi et.al.|[2408.10852](http://arxiv.org/abs/2408.10852)|null|
 398 | |**2024-08-20**|**SSL-TTS: Leveraging Self-Supervised Embeddings and kNN Retrieval for Zero-Shot Multi-speaker TTS**|Karl El Hajal et.al.|[2408.10771](http://arxiv.org/abs/2408.10771)|null|
 399 | |**2024-08-20**|**Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting**|Hyun Jin Park et.al.|[2408.10463](http://arxiv.org/abs/2408.10463)|null|
 400 | |**2024-08-17**|**Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition**|Samuele Cornell et.al.|[2408.09215](http://arxiv.org/abs/2408.09215)|**[link](https://github.com/popcornell/ASRLightningFT)**|
 401 | |**2024-08-14**|**PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation**|Sang-Hoon Lee et.al.|[2408.07547](http://arxiv.org/abs/2408.07547)|**[link](https://github.com/sh-lee-prml/periodwave)**|
 402 | |**2024-08-13**|**SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis**|Osamu Take et.al.|[2408.06858](http://arxiv.org/abs/2408.06858)|**[link](https://github.com/sarulab-speech/saslaw)**|
 403 | |**2024-08-13**|**PRESENT: Zero-Shot Text-to-Prosody Control**|Perry Lam et.al.|[2408.06827](http://arxiv.org/abs/2408.06827)|**[link](https://github.com/iamanigeeit/present)**|
 404 | |**2024-08-12**|**FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks**|Min Ma et.al.|[2408.06227](http://arxiv.org/abs/2408.06227)|null|
 405 | |**2024-08-11**|**VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing**|Chunyu Qiang et.al.|[2408.05758](http://arxiv.org/abs/2408.05758)|null|
 406 | |**2024-08-06**|**Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training**|Hawraz A. Ahmad et.al.|[2408.03887](http://arxiv.org/abs/2408.03887)|null|
 407 | |**2024-08-03**|**ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features**|Peng Cheng et.al.|[2408.01808](http://arxiv.org/abs/2408.01808)|**[link](https://github.com/TASER2023/TASER)**|
 408 | |**2024-08-01**|**Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation**|Xinhan Di et.al.|[2408.00284](http://arxiv.org/abs/2408.00284)|null|
 409 | |**2024-07-18**|**Handling Numeric Expressions in Automatic Speech Recognition**|Christian Huber et.al.|[2408.00004](http://arxiv.org/abs/2408.00004)|null|
 410 | |**2024-07-31**|**On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition**|Nick Rossenbach et.al.|[2407.21476](http://arxiv.org/abs/2407.21476)|null|
 411 | |**2024-07-29**|**Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks**|Mahmoud Salhab et.al.|[2407.18571](http://arxiv.org/abs/2407.18571)|null|
 412 | |**2024-07-25**|**On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures**|Nick Rossenbach et.al.|[2407.17997](http://arxiv.org/abs/2407.17997)|null|
 413 | |**2024-07-24**|**Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model**|Jan Lehečka et.al.|[2407.17167](http://arxiv.org/abs/2407.17167)|null|
 414 | |**2024-07-23**|**Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments**|Pai Zhu et.al.|[2407.16840](http://arxiv.org/abs/2407.16840)|null|
 415 | |**2024-07-19**|**Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2**|Chun Xu et.al.|[2407.14212](http://arxiv.org/abs/2407.14212)|null|
 416 | |**2024-07-18**|**Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models**|Weiqin Li et.al.|[2407.13509](http://arxiv.org/abs/2407.13509)|null|
 417 | |**2024-07-22**|**TTSDS -- Text-to-Speech Distribution Score**|Christoph Minixhofer et.al.|[2407.12707](http://arxiv.org/abs/2407.12707)|**[link](https://github.com/ttsds/ttsds)**|
 418 | |**2024-07-17**|**Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech**|Haibin Wu et.al.|[2407.12229](http://arxiv.org/abs/2407.12229)|**[link](https://github.com/hbwu-ntu/emoctrltts-eval)**|
 419 | |**2024-07-16**|**A Language Modeling Approach to Diacritic-Free Hebrew TTS**|Amit Roth et.al.|[2407.12206](http://arxiv.org/abs/2407.12206)|null|
 420 | |**2024-07-17**|**Learning High-Frequency Functions Made Easy with Sinusoidal Positional Encoding**|Chuanhao Sun et.al.|[2407.09370](http://arxiv.org/abs/2407.09370)|**[link](https://github.com/zhyuan11/SPE)**|
 421 | |**2024-07-11**|**Autoregressive Speech Synthesis without Vector Quantization**|Lingwei Meng et.al.|[2407.08551](http://arxiv.org/abs/2407.08551)|null|
 422 | |**2024-07-10**|**Source Tracing of Audio Deepfake Systems**|Nicholas Klein et.al.|[2407.08016](http://arxiv.org/abs/2407.08016)|null|
 423 | |**2024-07-07**|**ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation**|Ruibo Fu et.al.|[2407.05421](http://arxiv.org/abs/2407.05421)|null|
 424 | |**2024-07-09**|**CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens**|Zhihao Du et.al.|[2407.05407](http://arxiv.org/abs/2407.05407)|null|
 425 | |**2024-07-04**|**Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis**|Cong-Thanh Do et.al.|[2407.04047](http://arxiv.org/abs/2407.04047)|null|
 426 | |**2024-07-04**|**Optimizing a-DCF for Spoofing-Robust Speaker Verification**|Oğuzhan Kurnaz et.al.|[2407.04034](http://arxiv.org/abs/2407.04034)|null|
 427 | |**2024-07-04**|**On the Effectiveness of Acoustic BPE in Decoder-Only TTS**|Bohan Li et.al.|[2407.03892](http://arxiv.org/abs/2407.03892)|null|
 428 | |**2024-07-14**|**CATT: Character-based Arabic Tashkeel Transformer**|Faris Alasmary et.al.|[2407.03236](http://arxiv.org/abs/2407.03236)|**[link](https://github.com/abjadai/catt)**|
 429 | |**2024-07-02**|**Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization**|Yuchen Hu et.al.|[2407.02243](http://arxiv.org/abs/2407.02243)|null|
 430 | |**2024-07-02**|**TTSlow: Slow Down Text-to-Speech with Efficiency Robustness Evaluations**|Xiaoxue Gao et.al.|[2407.01927](http://arxiv.org/abs/2407.01927)|null|
 431 | |**2024-07-01**|**Lightweight Zero-shot Text-to-Speech with Mixture of Adapters**|Kenichi Fujita et.al.|[2407.01291](http://arxiv.org/abs/2407.01291)|null|
 432 | |**2024-06-30**|**NAIST Simultaneous Speech Translation System for IWSLT 2024**|Yuka Ko et.al.|[2407.00826](http://arxiv.org/abs/2407.00826)|null|
 433 | |**2024-06-30**|**An Attribute Interpolation Method in Speech Synthesis by Model Merging**|Masato Murata et.al.|[2407.00766](http://arxiv.org/abs/2407.00766)|null|
 434 | |**2024-06-30**|**FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis**|Yinlin Guo et.al.|[2407.00753](http://arxiv.org/abs/2407.00753)|null|
 435 | |**2024-07-02**|**Open-Source Conversational AI with SpeechBrain 1.0**|Mirco Ravanelli et.al.|[2407.00463](http://arxiv.org/abs/2407.00463)|null|
 436 | |**2024-06-27**|**Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models**|Borodin Kirill Nikolayevich et.al.|[2406.19243](http://arxiv.org/abs/2406.19243)|null|
 437 | |**2024-06-27**|**DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability**|Hyun Joon Park et.al.|[2406.19135](http://arxiv.org/abs/2406.19135)|**[link](https://github.com/winddori2002/dex-tts)**|
 438 | |**2024-06-26**|**Automatic Speech Recognition for Hindi**|Anish Saha et.al.|[2406.18135](http://arxiv.org/abs/2406.18135)|null|
 439 | |**2024-06-26**|**A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons**|Tzu-Yun Hung et.al.|[2406.18089](http://arxiv.org/abs/2406.18089)|null|
 440 | |**2024-06-29**|**LLM-Driven Multimodal Opinion Expression Identification**|Bonian Jia et.al.|[2406.18088](http://arxiv.org/abs/2406.18088)|null|
 441 | |**2024-06-26**|**E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS**|Sefik Emre Eskimez et.al.|[2406.18009](http://arxiv.org/abs/2406.18009)|**[link](https://github.com/microsoft/e2tts-test-suite)**|
 442 | |**2024-06-25**|**Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment**|Paarth Neekhara et.al.|[2406.17957](http://arxiv.org/abs/2406.17957)|null|
 443 | |**2024-06-22**|**A multi-speaker multi-lingual voice cloning system based on vits2 for limmits 2024 challenge**|Xiaopeng Wang et.al.|[2406.17801](http://arxiv.org/abs/2406.17801)|null|
 444 | |**2024-06-25**|**High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model**|Joun Yeop Lee et.al.|[2406.17310](http://arxiv.org/abs/2406.17310)|null|
 445 | |**2024-06-25**|**Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation**|Yingting Li et.al.|[2406.17257](http://arxiv.org/abs/2406.17257)|null|
 446 | |**2024-06-24**|**Exploring the Capability of Mamba in Speech Applications**|Koichi Miyazaki et.al.|[2406.16808](http://arxiv.org/abs/2406.16808)|null|
 447 | |**2024-06-25**|**Towards Zero-Shot Text-To-Speech for Arabic Dialects**|Khai Duy Doan et.al.|[2406.16751](http://arxiv.org/abs/2406.16751)|null|
 448 | |**2024-06-22**|**TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers**|Yakun Song et.al.|[2406.15752](http://arxiv.org/abs/2406.15752)|**[link](https://github.com/Ereboas/TacoLM)**|
 449 | |**2024-06-21**|**InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions**|Yu Nakagome et.al.|[2406.14890](http://arxiv.org/abs/2406.14890)|null|
 450 | |**2024-06-21**|**GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech**|Wenbin Wang et.al.|[2406.14875](http://arxiv.org/abs/2406.14875)|null|
 451 | |**2024-06-21**|**DASB - Discrete Audio and Speech Benchmark**|Pooneh Mousavi et.al.|[2406.14294](http://arxiv.org/abs/2406.14294)|null|
 452 | |**2024-06-18**|**Instruction Data Generation and Unsupervised Adaptation for Speech Language Models**|Vahid Noroozi et.al.|[2406.12946](http://arxiv.org/abs/2406.12946)|null|
 453 | |**2024-06-17**|**DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer**|Keon Lee et.al.|[2406.11427](http://arxiv.org/abs/2406.11427)|null|
 454 | |**2024-06-16**|**NAST: Noise Aware Speech Tokenization for Speech Language Models**|Shoval Messica et.al.|[2406.11037](http://arxiv.org/abs/2406.11037)|**[link](https://github.com/ShovalMessica/NAST)**|
 455 | |**2024-06-16**|**Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis**|Xuehao Zhou et.al.|[2406.10844](http://arxiv.org/abs/2406.10844)|null|
 456 | |**2024-06-14**|**Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice**|Shubham Gupta et.al.|[2406.10422](http://arxiv.org/abs/2406.10422)|null|
 457 | |**2024-06-14**|**UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner**|Dongchao Yang et.al.|[2406.10056](http://arxiv.org/abs/2406.10056)|**[link](https://github.com/yangdongchao/llm-codec)**|
 458 | |**2024-06-14**|**MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model**|Jiatong Shi et.al.|[2406.09869](http://arxiv.org/abs/2406.09869)|null|
 459 | |**2024-06-13**|**DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage**|Kyra Wang et.al.|[2406.08820](http://arxiv.org/abs/2406.08820)|null|
 460 | |**2024-06-13**|**Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems**|Zhengyang Chen et.al.|[2406.08812](http://arxiv.org/abs/2406.08812)|null|
 461 | |**2024-06-13**|**DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing**|Neha Sahipjohn et.al.|[2406.08802](http://arxiv.org/abs/2406.08802)|null|
 462 | |**2024-06-12**|**Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis**|Wing-Zin Leung et.al.|[2406.08568](http://arxiv.org/abs/2406.08568)|**[link](https://github.com/WingZLeung/TTDS)**|
 463 | |**2024-06-12**|**Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data**|Yuma Shirahata et.al.|[2406.08111](http://arxiv.org/abs/2406.08111)|null|
 464 | |**2024-06-12**|**VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech**|Ashishkumar Gudmalwar et.al.|[2406.08076](http://arxiv.org/abs/2406.08076)|null|
 465 | |**2024-06-12**|**LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning**|Masaya Kawamura et.al.|[2406.07969](http://arxiv.org/abs/2406.07969)|**[link](https://github.com/line/libritts-p)**|
 466 | |**2024-06-12**|**VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment**|Bing Han et.al.|[2406.07855](http://arxiv.org/abs/2406.07855)|null|
 467 | |**2024-06-12**|**EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech**|Deok-Hyeon Cho et.al.|[2406.07803](http://arxiv.org/abs/2406.07803)|**[link](https://github.com/Choddeok/EmoSphere-TTS)**|
 468 | |**2024-06-11**|**The Interspeech 2024 Challenge on Speech Processing Using Discrete Units**|Xuankai Chang et.al.|[2406.07725](http://arxiv.org/abs/2406.07725)|null|
 469 | |**2024-06-11**|**Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?**|Qingkai Fang et.al.|[2406.07289](http://arxiv.org/abs/2406.07289)|null|
 470 | |**2024-06-11**|**AudioMarkBench: Benchmarking Robustness of Audio Watermarking**|Hongbin Liu et.al.|[2406.06979](http://arxiv.org/abs/2406.06979)|**[link](https://github.com/moyangkuo/audiomarkbench)**|
 471 | |**2024-06-11**|**Controlling Emotion in Text-to-Speech with Natural Language Prompts**|Thomas Bott et.al.|[2406.06406](http://arxiv.org/abs/2406.06406)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
 472 | |**2024-06-10**|**Meta Learning Text-to-Speech Synthesis in over 7000 Languages**|Florian Lux et.al.|[2406.06403](http://arxiv.org/abs/2406.06403)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
 473 | |**2024-06-10**|**MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance**|Semin Kim et.al.|[2406.05965](http://arxiv.org/abs/2406.05965)|null|
 474 | |**2024-06-11**|**WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark**|Linhan Ma et.al.|[2406.05763](http://arxiv.org/abs/2406.05763)|**[link](https://github.com/dukGuo/valle-audiodec)**|
 475 | |**2024-06-09**|**An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS**|Xiaofei Wang et.al.|[2406.05699](http://arxiv.org/abs/2406.05699)|null|
 476 | |**2024-06-11**|**Text-aware and Context-aware Expressive Audiobook Speech Synthesis**|Dake Guo et.al.|[2406.05672](http://arxiv.org/abs/2406.05672)|null|
 477 | |**2024-06-08**|**Autoregressive Diffusion Transformer for Text-to-Speech Synthesis**|Zhijun Liu et.al.|[2406.05551](http://arxiv.org/abs/2406.05551)|null|
 478 | |**2024-06-08**|**VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers**|Sanyuan Chen et.al.|[2406.05370](http://arxiv.org/abs/2406.05370)|null|
 479 | |**2024-06-07**|**Spectral Codecs: Spectrogram-Based Audio Codecs for High Quality Speech Synthesis**|Ryan Langman et.al.|[2406.05298](http://arxiv.org/abs/2406.05298)|null|
 480 | |**2024-06-07**|**XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model**|Edresson Casanova et.al.|[2406.04904](http://arxiv.org/abs/2406.04904)|**[link](https://github.com/Edresson/ZS-TTS-Evaluation)**|
 481 | |**2024-06-07**|**TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking**|Junzuo Zhou et.al.|[2406.04840](http://arxiv.org/abs/2406.04840)|null|
 482 | |**2024-06-07**|**Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study**|Chong Zhang et.al.|[2406.04633](http://arxiv.org/abs/2406.04633)|null|
 483 | |**2024-06-06**|**Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis**|Théodor Lemerle et.al.|[2406.04467](http://arxiv.org/abs/2406.04467)|**[link](https://github.com/theodorblackbird/lina-speech)**|
 484 | |**2024-06-06**|**Total-Duration-Aware Duration Modeling for Text-to-Speech Systems**|Sefik Emre Eskimez et.al.|[2406.04281](http://arxiv.org/abs/2406.04281)|null|
 485 | |**2024-06-06**|**Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining**|Jinlong Xue et.al.|[2406.03714](http://arxiv.org/abs/2406.03714)|null|
 486 | |**2024-06-06**|**Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model**|Jinlong Xue et.al.|[2406.03706](http://arxiv.org/abs/2406.03706)|null|
 487 | |**2024-06-05**|**Style Mixture of Experts for Expressive Text-To-Speech Synthesis**|Ahad Jawaid et.al.|[2406.03637](http://arxiv.org/abs/2406.03637)|null|
 488 | |**2024-06-07**|**Harder or Different? Understanding Generalization of Audio Deepfake Detection**|Nicolas M. Müller et.al.|[2406.03512](http://arxiv.org/abs/2406.03512)|null|
 489 | |**2024-06-05**|**LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes**|Trung Dang et.al.|[2406.02897](http://arxiv.org/abs/2406.02897)|null|
 490 | |**2024-06-04**|**Seed-TTS: A Family of High-Quality Versatile Speech Generation Models**|Philip Anastassiou et.al.|[2406.02430](http://arxiv.org/abs/2406.02430)|**[link](https://github.com/BytedanceSpeech/seed-tts-eval)**|
 491 | |**2024-06-05**|**SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models**|Dongchao Yang et.al.|[2406.02328](http://arxiv.org/abs/2406.02328)|null|
 492 | |**2024-06-04**|**BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation**|Hui-Peng Du et.al.|[2406.02162](http://arxiv.org/abs/2406.02162)|null|
 493 | |**2024-06-04**|**Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis**|Kun Zhou et.al.|[2406.02009](http://arxiv.org/abs/2406.02009)|null|
 494 | |**2024-06-03**|**ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec**|Shengpeng Ji et.al.|[2406.01205](http://arxiv.org/abs/2406.01205)|**[link](https://github.com/jishengpeng/controlspeech)**|
 495 | |**2024-06-03**|**Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training**|Jan Melechovsky et.al.|[2406.01018](http://arxiv.org/abs/2406.01018)|null|
 496 | |**2024-06-02**|**Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback**|Chen Chen et.al.|[2406.00654](http://arxiv.org/abs/2406.00654)|null|
 497 | |**2024-05-31**|**Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities**|Vicky Zayats et.al.|[2405.18669](http://arxiv.org/abs/2405.18669)|null|
 498 | |**2024-05-28**|**TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation**|Chenyang Le et.al.|[2405.17809](http://arxiv.org/abs/2405.17809)|null|
 499 | |**2024-05-27**|**RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis**|Haoxiang Shi et.al.|[2405.17028](http://arxiv.org/abs/2405.17028)|null|
 500 | |**2024-05-24**|**Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition**|Zijin Gu et.al.|[2405.15216](http://arxiv.org/abs/2405.15216)|null|
 501 | |**2024-05-23**|**Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models**|Jingyi Chen et.al.|[2405.14632](http://arxiv.org/abs/2405.14632)|null|
 502 | |**2024-05-22**|**A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction**|Yue Li et.al.|[2405.13477](http://arxiv.org/abs/2405.13477)|null|
 503 | |**2024-05-20**|**Multi-speaker Text-to-speech Training with Speaker Anonymized Data**|Wen-Chin Huang et.al.|[2405.11767](http://arxiv.org/abs/2405.11767)|null|
 504 | |**2024-05-19**|**VR-GPT: Visual Language Model for Intelligent Virtual Reality Applications**|Mikhail Konenkov et.al.|[2405.11537](http://arxiv.org/abs/2405.11537)|null|
 505 | |**2024-05-18**|**Exploring speech style spaces with language models: Emotional TTS without emotion labels**|Shreeram Suresh Chandra et.al.|[2405.11413](http://arxiv.org/abs/2405.11413)|null|
 506 | |**2024-05-16**|**Faces that Speak: Jointly Synthesising Talking Face and Speech from Text**|Youngjoon Jang et.al.|[2405.10272](http://arxiv.org/abs/2405.10272)|null|
 507 | |**2024-05-16**|**Building a Luganda Text-to-Speech Model From Crowdsourced Data**|Sulaiman Kagumire et.al.|[2405.10211](http://arxiv.org/abs/2405.10211)|null|
 508 | |**2024-05-16**|**Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model**|Siyang Wang et.al.|[2405.09768](http://arxiv.org/abs/2405.09768)|null|
 509 | |**2024-05-15**|**Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer**|Weifei Jin et.al.|[2405.09470](http://arxiv.org/abs/2405.09470)|null|
 510 | |**2024-05-15**|**Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis**|Sho Inoue et.al.|[2405.09171](http://arxiv.org/abs/2405.09171)|null|
 511 | |**2024-05-14**|**PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset**|Yang Hou et.al.|[2405.08838](http://arxiv.org/abs/2405.08838)|**[link](https://github.com/tobuta/PolyGlotFake)**|
 512 | |**2024-04-30**|**Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech**|Hankun Wang et.al.|[2404.19723](http://arxiv.org/abs/2404.19723)|null|
 513 | |**2024-04-29**|**MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis**|Xiang Li et.al.|[2404.18398](http://arxiv.org/abs/2404.18398)|null|
 514 | |**2024-04-28**|**USAT: A Universal Speaker-Adaptive Text-to-Speech Approach**|Wenbin Wang et.al.|[2404.18094](http://arxiv.org/abs/2404.18094)|**[link](https://github.com/mushanshanshan/esltts)**|
 515 | |**2024-04-27**|**TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality**|Tiantian Feng et.al.|[2404.17983](http://arxiv.org/abs/2404.17983)|null|
 516 | |**2024-04-26**|**An RFP dataset for Real, Fake, and Partially fake audio detection**|Abdulazeez AlAli et.al.|[2404.17721](http://arxiv.org/abs/2404.17721)|null|
 517 | |**2024-04-23**|**StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations**|Sen Liu et.al.|[2404.14946](http://arxiv.org/abs/2404.14946)|null|
 518 | |**2024-04-23**|**Retrieval-Augmented Audio Deepfake Detection**|Zuheng Kang et.al.|[2404.13892](http://arxiv.org/abs/2404.13892)|null|
 519 | |**2024-04-14**|**Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling**|Quanxiu Wang et.al.|[2404.09192](http://arxiv.org/abs/2404.09192)|null|
 520 | |**2024-04-11**|**Voice-Assisted Real-Time Traffic Sign Recognition System Using Convolutional Neural Network**|Mayura Manawadu et.al.|[2404.07807](http://arxiv.org/abs/2404.07807)|null|
 521 | |**2024-04-18**|**Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness**|Xincan Feng et.al.|[2404.06714](http://arxiv.org/abs/2404.06714)|**[link](https://github.com/xincanfeng/vitsgpt)**|
 522 | |**2024-04-10**|**CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations**|Leying Zhang et.al.|[2404.06690](http://arxiv.org/abs/2404.06690)|null|
 523 | |**2024-04-10**|**The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge**|Yiwei Guo et.al.|[2404.06079](http://arxiv.org/abs/2404.06079)|null|
 524 | |**2024-04-07**|**Cross-Domain Audio Deepfake Detection: Dataset and Analysis**|Yuang Li et.al.|[2404.04904](http://arxiv.org/abs/2404.04904)|null|
 525 | |**2024-04-06**|**HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks**|Yingting Li et.al.|[2404.04645](http://arxiv.org/abs/2404.04645)|**[link](https://github.com/declare-lab/hypertts)**|
 526 | |**2024-04-18**|**Open vocabulary keyword spotting through transfer learning from speech synthesis**|Kesavaraj V et.al.|[2404.03914](http://arxiv.org/abs/2404.03914)|null|
 527 | |**2024-04-06**|**RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis**|Detai Xin et.al.|[2404.03204](http://arxiv.org/abs/2404.03204)|null|
 528 | |**2024-04-03**|**CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech**|Jaehyeon Kim et.al.|[2404.02781](http://arxiv.org/abs/2404.02781)|null|
 529 | |**2024-04-13**|**PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders**|Yu Pan et.al.|[2404.02702](http://arxiv.org/abs/2404.02702)|null|
 530 | |**2024-03-31**|**Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation**|Rohan Chaudhury et.al.|[2404.01339](http://arxiv.org/abs/2404.01339)|**[link](https://github.com/rohan-chaudhury/humane-speech-synthesis-through-zero-shot-emotion-and-disfluency-generation)**|
 531 | |**2024-03-28**|**A Review of Multi-Modal Large Language and Vision Models**|Kilian Carolan et.al.|[2404.01322](http://arxiv.org/abs/2404.01322)|null|
 532 | |**2024-04-09**|**KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis**|Adal Abilbekov et.al.|[2404.01033](http://arxiv.org/abs/2404.01033)|**[link](https://github.com/is2ai/kazemotts)**|
 533 | |**2024-03-31**|**CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models**|Xiang Li et.al.|[2404.00569](http://arxiv.org/abs/2404.00569)|**[link](https://github.com/xiangli2022/cm-tts)**|
 534 | |**2024-03-25**|**VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild**|Puyuan Peng et.al.|[2403.16973](http://arxiv.org/abs/2403.16973)|**[link](https://github.com/jasonppy/voicecraft)**|
 535 | |**2024-03-20**|**Isometric Neural Machine Translation using Phoneme Count Ratio Reward-based Reinforcement Learning**|Shivam Ratnakant Mhaskar et.al.|[2403.15469](http://arxiv.org/abs/2403.15469)|null|
 536 | |**2024-03-20**|**UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge**|Wataru Nakata et.al.|[2403.13720](http://arxiv.org/abs/2403.13720)|null|
 537 | |**2024-03-20**|**Building speech corpus with diverse voice characteristics for its prompt-based representation**|Aya Watanabe et.al.|[2403.13353](http://arxiv.org/abs/2403.13353)|null|
 538 | |**2024-03-17**|**Creating an African American-Sounding TTS: Guidelines, Technical Challenges,and Surprising Evaluations**|Claudio Pinhanez et.al.|[2403.11209](http://arxiv.org/abs/2403.11209)|null|
 539 | |**2024-03-17**|**EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech**|Ziqi Liang et.al.|[2403.08164](http://arxiv.org/abs/2403.08164)|null|
 540 | |**2024-03-09**|**HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling**|Chunhui Wang et.al.|[2403.05989](http://arxiv.org/abs/2403.05989)|null|
 541 | |**2024-03-05**|**AttentionStitch: How Attention Solves the Speech Editing Problem**|Antonios Alexos et.al.|[2403.04804](http://arxiv.org/abs/2403.04804)|null|
 542 | |**2024-03-07**|**Attempt Towards Stress Transfer in Speech-to-Speech Machine Translation**|Sai Akarsh et.al.|[2403.04178](http://arxiv.org/abs/2403.04178)|null|
 543 | |**2024-03-27**|**NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models**|Zeqian Ju et.al.|[2403.03100](http://arxiv.org/abs/2403.03100)|null|
 544 | |**2024-03-04**|**Brilla AI: AI Contestant for the National Science and Maths Quiz**|George Boateng et.al.|[2403.01699](http://arxiv.org/abs/2403.01699)|**[link](https://github.com/nsmq-ai/nsmqai)**|
 545 | |**2024-03-02**|**Towards Accurate Lip-to-Speech Synthesis in-the-Wild**|Sindhu Hegde et.al.|[2403.01087](http://arxiv.org/abs/2403.01087)|null|
 546 | |**2024-02-29**|**Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data**|Takaaki Saeki et.al.|[2402.18932](http://arxiv.org/abs/2402.18932)|null|
 547 | |**2024-02-26**|**An Automated End-to-End Open-Source Software for High-Quality Text-to-Speech Dataset Generation**|Ahmet Gunduz et.al.|[2402.16380](http://arxiv.org/abs/2402.16380)|**[link](https://github.com/aixplain/tts-qa)**|
 548 | |**2024-02-22**|**Efficient data selection employing Semantic Similarity-based Graph Structures for model training**|Roxana Petcu et.al.|[2402.14888](http://arxiv.org/abs/2402.14888)|null|
 549 | |**2024-02-22**|**Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition**|Rendi Chevi et.al.|[2402.14523](http://arxiv.org/abs/2402.14523)|null|
 550 | |**2024-02-19**|**On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models**|Miri Varshavsky-Hassid et.al.|[2402.12423](http://arxiv.org/abs/2402.12423)|null|
 551 | |**2024-02-19**|**Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting**|Haolin Chen et.al.|[2402.12220](http://arxiv.org/abs/2402.12220)|**[link](https://github.com/idiap/bayesian-peft)**|
 552 | |**2024-02-18**|**Ain't Misbehavin' -- Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru**|Zining Wang et.al.|[2402.11571](http://arxiv.org/abs/2402.11571)|null|
 553 | |**2024-02-14**|**MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech**|Shengpeng Ji et.al.|[2402.09378](http://arxiv.org/abs/2402.09378)|null|
 554 | |**2024-02-15**|**BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data**|Mateusz Łajszczak et.al.|[2402.08093](http://arxiv.org/abs/2402.08093)|null|
 555 | |**2024-03-04**|**Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like**|Naoyuki Kanda et.al.|[2402.07383](http://arxiv.org/abs/2402.07383)|null|
 556 | |**2024-02-09**|**A New Approach to Voice Authenticity**|Nicolas M. Müller et.al.|[2402.06304](http://arxiv.org/abs/2402.06304)|null|
 557 | |**2024-02-08**|**Unified Speech-Text Pretraining for Spoken Dialog Modeling**|Heeseung Kim et.al.|[2402.05706](http://arxiv.org/abs/2402.05706)|null|
 558 | |**2024-02-05**|**Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations**|Álvaro Martín-Cortinas et.al.|[2402.03407](http://arxiv.org/abs/2402.03407)|null|
 559 | |**2024-02-02**|**Natural language guidance of high-fidelity text-to-speech with synthetic annotations**|Dan Lyth et.al.|[2402.01912](http://arxiv.org/abs/2402.01912)|null|
 560 | |**2024-01-23**|**Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization**|Wei-Ping Huang et.al.|[2402.01692](http://arxiv.org/abs/2402.01692)|null|
 561 | |**2024-02-01**|**Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech**|Dong Yang et.al.|[2402.00288](http://arxiv.org/abs/2402.00288)|null|
 562 | |**2024-02-01**|**PAM: Prompting Audio-Language Models for Audio Quality Assessment**|Soham Deshmukh et.al.|[2402.00282](http://arxiv.org/abs/2402.00282)|**[link](https://github.com/soham97/pam)**|
 563 | |**2024-01-31**|**Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2**|Jiatong Shi et.al.|[2401.17619](http://arxiv.org/abs/2401.17619)|**[link](https://github.com/espnet/espnet)**|
 564 | |**2024-01-28**|**MunTTS: A Text-to-Speech System for Mundari**|Varun Gumma et.al.|[2401.15579](http://arxiv.org/abs/2401.15579)|null|
 565 | |**2024-01-30**|**VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech**|Chenpeng Du et.al.|[2401.14321](http://arxiv.org/abs/2401.14321)|null|
 566 | |**2024-01-25**|**Text to speech synthesis**|Harini s et.al.|[2401.13891](http://arxiv.org/abs/2401.13891)|null|
 567 | |**2024-01-25**|**SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation**|Dong Zhang et.al.|[2401.13527](http://arxiv.org/abs/2401.13527)|**[link](https://github.com/0nutation/speechgpt)**|
 568 | |**2024-01-22**|**Benchmarking Large Multimodal Models against Common Corruptions**|Jiawei Zhang et.al.|[2401.11943](http://arxiv.org/abs/2401.11943)|**[link](https://github.com/sail-sg/mmcbench)**|
 569 | |**2024-01-22**|**Adversarial speech for voice privacy protection from Personalized Speech generation**|Shihao Chen et.al.|[2401.11857](http://arxiv.org/abs/2401.11857)|null|
 570 | |**2024-02-16**|**Empowering Communication: Speech Technology for Indian and Western Accents through AI-powered Speech Synthesis**|Vinotha R et.al.|[2401.11771](http://arxiv.org/abs/2401.11771)|null|
 571 | |**2024-01-19**|**Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech**|Abhinav Garg et.al.|[2401.10465](http://arxiv.org/abs/2401.10465)|null|
 572 | |**2024-02-28**|**MLAAD: The Multi-Language Audio Anti-Spoofing Dataset**|Nicolas M. Müller et.al.|[2401.09512](http://arxiv.org/abs/2401.09512)|null|
 573 | |**2024-01-15**|**MCMChaos: Improvising Rap Music with MCMC Methods and Chaos Theory**|Robert G. Kimelman et.al.|[2401.07967](http://arxiv.org/abs/2401.07967)|null|
 574 | |**2024-01-14**|**ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering**|Yakun Song et.al.|[2401.07333](http://arxiv.org/abs/2401.07333)|null|
 575 | |**2024-01-12**|**Multi-Task Learning for Front-End Text Processing in TTS**|Wonjune Kang et.al.|[2401.06321](http://arxiv.org/abs/2401.06321)|**[link](https://github.com/facebookresearch/llama-hd-dataset)**|
 576 | |**2024-01-11**|**End to end Hindi to English speech conversion using Bark, mBART and a finetuned XLSR Wav2Vec2**|Aniket Tathe et.al.|[2401.06183](http://arxiv.org/abs/2401.06183)|null|
 577 | |**2024-01-11**|**Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection**|Lian Huang et.al.|[2401.05614](http://arxiv.org/abs/2401.05614)|null|
 578 | |**2024-01-10**|**Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters**|Kenichi Fujita et.al.|[2401.05111](http://arxiv.org/abs/2401.05111)|null|
 579 | |**2024-01-07**|**Evaluating and Personalizing User-Perceived Quality of Text-to-Speech Voices for Delivering Mindfulness Meditation with Different Physical Embodiments**|Zhonghao Shi et.al.|[2401.03581](http://arxiv.org/abs/2401.03581)|null|
 580 | |**2024-01-07**|**Transfer the linguistic representations from TTS to accent conversion with non-parallel data**|Xi Chen et.al.|[2401.03538](http://arxiv.org/abs/2401.03538)|null|
 581 | |**2024-01-03**|**Incremental FastPitch: Chunk-based High Quality Text to Speech**|Muyang Du et.al.|[2401.01755](http://arxiv.org/abs/2401.01755)|null|
 582 | |**2024-01-03**|**Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction**|Minchan Kim et.al.|[2401.01498](http://arxiv.org/abs/2401.01498)|null|
 583 | |**2023-12-18**|**Assisting Blind People Using Object Detection with Vocal Feedback**|Heba Najm et.al.|[2401.01362](http://arxiv.org/abs/2401.01362)|null|
 584 | |**2023-12-30**|**Boosting Large Language Model for Speech Synthesis: An Empirical Study**|Hongkun Hao et.al.|[2401.00246](http://arxiv.org/abs/2401.00246)|null|
 585 | |**2024-01-01**|**Normalization of Lithuanian Text Using Regular Expressions**|Pijus Kasparaitis et.al.|[2312.17660](http://arxiv.org/abs/2312.17660)|null|
 586 | |**2023-12-27**|**AE-Flow: AutoEncoder Normalizing Flow**|Jakub Mosiński et.al.|[2312.16552](http://arxiv.org/abs/2312.16552)|null|
 587 | |**2023-12-22**|**Creating New Voices using Normalizing Flows**|Piotr Bilinski et.al.|[2312.14569](http://arxiv.org/abs/2312.14569)|null|
 588 | |**2023-12-22**|**ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations**|Cheng Gong et.al.|[2312.14398](http://arxiv.org/abs/2312.14398)|null|
 589 | |**2023-12-19**|**External Knowledge Augmented Polyphone Disambiguation Using Large Language Model**|Chen Li et.al.|[2312.11920](http://arxiv.org/abs/2312.11920)|null|
 590 | |**2023-12-17**|**A review-based study on different Text-to-Speech technologies**|Md. Jalal Uddin Chowdhury et.al.|[2312.11563](http://arxiv.org/abs/2312.11563)|null|
 591 | |**2024-01-31**|**MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis**|Wenhao Guan et.al.|[2312.10687](http://arxiv.org/abs/2312.10687)|null|
 592 | |**2024-02-22**|**Amphion: An Open-Source Audio, Music and Speech Generation Toolkit**|Xueyao Zhang et.al.|[2312.09911](http://arxiv.org/abs/2312.09911)|**[link](https://github.com/open-mmlab/amphion)**|
 593 | |**2023-12-11**|**Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism**|Georgios Milis et.al.|[2312.06613](http://arxiv.org/abs/2312.06613)|**[link](https://github.com/g-milis/NEUTART)**|
 594 | |**2023-12-08**|**An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis**|Via Nielson et.al.|[2312.05415](http://arxiv.org/abs/2312.05415)|null|
 595 | |**2023-12-06**|**Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis**|Zehua Chen et.al.|[2312.03491](http://arxiv.org/abs/2312.03491)|null|
 596 | |**2023-12-02**|**Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning**|Raviraj Joshi et.al.|[2312.01107](http://arxiv.org/abs/2312.01107)|null|
 597 | |**2023-12-02**|**Code-Mixed Text to Speech Synthesis under Low-Resource Constraints**|Raviraj Joshi et.al.|[2312.01103](http://arxiv.org/abs/2312.01103)|null|
 598 | |**2023-11-29**|**Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes**|Pavel Korshunov et.al.|[2311.17655](http://arxiv.org/abs/2311.17655)|null|
 599 | |**2024-02-06**|**Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech**|Enting Zhou et.al.|[2311.14816](http://arxiv.org/abs/2311.14816)|**[link](https://github.com/ETZET/SpeechEmotionAVLearning)**|
 600 | |**2023-12-07**|**Guided Flows for Generative Modeling and Decision Making**|Qinqing Zheng et.al.|[2311.13443](http://arxiv.org/abs/2311.13443)|null|
 601 | |**2023-11-27**|**HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis**|Sang-Hoon Lee et.al.|[2311.12454](http://arxiv.org/abs/2311.12454)|**[link](https://github.com/sh-lee-prml/hierspeechpp)**|
 602 | |**2023-11-18**|**Utilizing Speech Emotion Recognition and Recommender Systems for Negative Emotion Handling in Therapy Chatbots**|Farideh Majidi et.al.|[2311.11116](http://arxiv.org/abs/2311.11116)|null|
 603 | |**2023-11-18**|**Data Center Audio/Video Intelligence on Device (DAVID) -- An Edge-AI Platform for Smart-Toys**|Gabriel Cosache et.al.|[2311.11030](http://arxiv.org/abs/2311.11030)|null|
 604 | |**2023-11-17**|**A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness**|Mathias Vogel et.al.|[2311.10804](http://arxiv.org/abs/2311.10804)|null|
 605 | |**2023-11-16**|**Improving fairness for spoken language understanding in atypical speech with Text-to-Speech**|Helin Wang et.al.|[2311.10149](http://arxiv.org/abs/2311.10149)|**[link](https://github.com/wanghelin1997/aty-tts)**|
 606 | |**2024-02-02**|**DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation**|Jianzong Wang et.al.|[2311.07965](http://arxiv.org/abs/2311.07965)|null|
 607 | |**2023-11-12**|**ChatAnything: Facetime Chat with LLM-Enhanced Personas**|Yilin Zhao et.al.|[2311.06772](http://arxiv.org/abs/2311.06772)|null|
 608 | |**2023-11-11**|**NewsGPT: ChatGPT Integration for Robot-Reporter**|Abdelhadi Hireche et.al.|[2311.06640](http://arxiv.org/abs/2311.06640)|**[link](https://github.com/aeh1707/NewsGPT_Pepper)**|
 609 | |**2023-11-08**|**Synthetic Speaking Children -- Why We Need Them and How to Make Them**|Muhammad Ali Farooq et.al.|[2311.06307](http://arxiv.org/abs/2311.06307)|null|
 610 | |**2023-09-25**|**Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image**|Minki Kang et.al.|[2311.05844](http://arxiv.org/abs/2311.05844)|null|
 611 | |**2023-11-07**|**Improved Child Text-to-Speech Synthesis through Fastpitch-based Transfer Learning**|Rishabh Jain et.al.|[2311.04313](http://arxiv.org/abs/2311.04313)|**[link](https://github.com/c3imaging/child_tts_fastpitch)**|
 612 | |**2023-11-07**|**Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment**|Jakir Hasan et.al.|[2311.03792](http://arxiv.org/abs/2311.03792)|null|
 613 | |**2023-11-08**|**Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction**|Minchan Kim et.al.|[2311.02898](http://arxiv.org/abs/2311.02898)|null|
 614 | |**2023-11-02**|**Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations**|Hanglei Zhang et.al.|[2311.01260](http://arxiv.org/abs/2311.01260)|null|
 615 | |**2023-11-02**|**E3 TTS: Easy End-to-End Diffusion-based Text to Speech**|Yuan Gao et.al.|[2311.00945](http://arxiv.org/abs/2311.00945)|null|
 616 | |**2023-10-31**|**An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation**|Yingjie Zhou et.al.|[2310.20251](http://arxiv.org/abs/2310.20251)|**[link](https://github.com/zyj-2000/cumt_2d_photospeaker)**|
 617 | |**2023-10-27**|**Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN**|Neeraj Kumar et.al.|[2310.18169](http://arxiv.org/abs/2310.18169)|null|
 618 | |**2023-10-25**|**ArTST: Arabic Text and Speech Transformer**|Hawau Olamide Toyin et.al.|[2310.16621](http://arxiv.org/abs/2310.16621)|**[link](https://github.com/mbzuai-nlp/artst)**|
 619 | |**2023-10-25**|**Generative Pre-training for Speech with Flow Matching**|Alexander H. Liu et.al.|[2310.16338](http://arxiv.org/abs/2310.16338)|null|
 620 | |**2023-10-23**|**DPP-TTS: Diversifying prosodic features of speech via determinantal point processes**|Seongho Joo et.al.|[2310.14663](http://arxiv.org/abs/2310.14663)|null|
 621 | |**2023-10-22**|**An overview of text-to-speech systems and media applications**|Mohammad Reza Hasanabadi et.al.|[2310.14301](http://arxiv.org/abs/2310.14301)|null|
 622 | |**2023-10-14**|**Generative Adversarial Training for Text-to-Speech Synthesis Based on Raw Phonetic Input and Explicit Prosody Modelling**|Tiberiu Boros et.al.|[2310.09636](http://arxiv.org/abs/2310.09636)|**[link](https://github.com/tiberiu44/TTS-Cube)**|
 623 | |**2023-10-14**|**Attentive Multi-Layer Perceptron for Non-autoregressive Generation**|Shuyang Jiang et.al.|[2310.09512](http://arxiv.org/abs/2310.09512)|**[link](https://github.com/shark-nlp/attentivemlp)**|
 624 | |**2023-12-22**|**Crowdsourced and Automatic Speech Prominence Estimation**|Max Morrison et.al.|[2310.08464](http://arxiv.org/abs/2310.08464)|**[link](https://github.com/reseval/reseval)**|
 625 | |**2023-10-12**|**On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition**|Nick Rossenbach et.al.|[2310.08132](http://arxiv.org/abs/2310.08132)|null|
 626 | |**2023-10-12**|**Vec-Tok Speech: speech vectorization and tokenization for neural speech generation**|Xinfa Zhu et.al.|[2310.07246](http://arxiv.org/abs/2310.07246)|**[link](https://github.com/bakerbunker/vectok)**|
 627 | |**2023-10-10**|**Prosody Analysis of Audiobooks**|Charuta Pethe et.al.|[2310.06930](http://arxiv.org/abs/2310.06930)|null|
 628 | |**2023-10-09**|**JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions**|Detai Xin et.al.|[2310.06072](http://arxiv.org/abs/2310.06072)|null|
 629 | |**2024-01-09**|**Unified speech and gesture synthesis using flow matching**|Shivam Mehta et.al.|[2310.05181](http://arxiv.org/abs/2310.05181)|null|
 630 | |**2023-10-08**|**Comparative Analysis of Transfer Learning in Deep Learning Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset**|Ze Liu et.al.|[2310.04982](http://arxiv.org/abs/2310.04982)|null|
 631 | |**2023-10-11**|**LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT**|Jiaming Wang et.al.|[2310.04673](http://arxiv.org/abs/2310.04673)|null|
 632 | |**2024-01-22**|**Latent Filling: Latent Space Data Augmentation for Zero-shot Speech Synthesis**|Jae-Sung Bae et.al.|[2310.03538](http://arxiv.org/abs/2310.03538)|null|
 633 | |**2023-10-07**|**The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains**|Erica Cooper et.al.|[2310.02640](http://arxiv.org/abs/2310.02640)|null|
 634 | |**2023-10-02**|**Towards human-like spoken dialogue generation between AI agents from written dialogue**|Kentaro Mitsui et.al.|[2310.01088](http://arxiv.org/abs/2310.01088)|null|
 635 | |**2023-10-01**|**Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech**|Dareen Alharthi et.al.|[2310.00706](http://arxiv.org/abs/2310.00706)|null|
 636 | |**2024-03-11**|**Fewer-token Neural Speech Codec with Time-invariant Codes**|Yong Ren et.al.|[2310.00014](http://arxiv.org/abs/2310.00014)|**[link](https://github.com/y-ren16/ticodec)**|
 637 | |**2024-01-31**|**ReFlow-TTS: A Rectified Flow Model for High-fidelity Text-to-Speech**|Wenhao Guan et.al.|[2309.17056](http://arxiv.org/abs/2309.17056)|null|
 638 | |**2023-09-29**|**Low-Resource Self-Supervised Learning with SSL-Enhanced TTS**|Po-chun Hsu et.al.|[2309.17020](http://arxiv.org/abs/2309.17020)|null|
 639 | |**2023-09-29**|**Synthetic Speech Detection Based on Temporal Consistency and Distribution of Speaker Features**|Yuxiang Zhang et.al.|[2309.16954](http://arxiv.org/abs/2309.16954)|null|
 640 | |**2023-12-18**|**High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models**|Chunyu Qiang et.al.|[2309.15512](http://arxiv.org/abs/2309.15512)|null|
 641 | |**2024-01-09**|**BiSinger: Bilingual Singing Voice Synthesis**|Huali Zhou et.al.|[2309.14089](http://arxiv.org/abs/2309.14089)|**[link](https://github.com/BiSinger-SVS/BiSinger)**|
 642 | |**2023-10-07**|**HiGNN-TTS: Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS**|Dake Guo et.al.|[2309.13907](http://arxiv.org/abs/2309.13907)|null|
 643 | |**2023-09-24**|**VoiceLDM: Text-to-Speech with Environmental Context**|Yeonghyeon Lee et.al.|[2309.13664](http://arxiv.org/abs/2309.13664)|null|
 644 | |**2023-09-24**|**Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control**|Aya Watanabe et.al.|[2309.13509](http://arxiv.org/abs/2309.13509)|null|
 645 | |**2023-09-22**|**DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis**|Yu Gu et.al.|[2309.12792](http://arxiv.org/abs/2309.12792)|null|
 646 | |**2023-09-22**|**Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts**|Shun Lei et.al.|[2309.11977](http://arxiv.org/abs/2309.11977)|null|
 647 | |**2023-09-21**|**The Impact of Silence on Speech Anti-Spoofing**|Yuxiang Zhang et.al.|[2309.11827](http://arxiv.org/abs/2309.11827)|null|
 648 | |**2023-09-21**|**Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech**|Rui Liu et.al.|[2309.11724](http://arxiv.org/abs/2309.11724)|**[link](https://github.com/ai-s2-lab/emopp)**|
 649 | |**2023-09-20**|**Speak While You Think: Streaming Speech Synthesis During Text Generation**|Avihu Dekel et.al.|[2309.11210](http://arxiv.org/abs/2309.11210)|null|
 650 | |**2023-09-20**|**Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model**|Xinyu Zhou et.al.|[2309.11000](http://arxiv.org/abs/2309.11000)|**[link](https://github.com/XinyuZhou2000/Spoken_Dialogue)**|
 651 | |**2023-09-19**|**Exploring Speech Enhancement for Low-resource Speech Synthesis**|Zhaoheng Ni et.al.|[2309.10795](http://arxiv.org/abs/2309.10795)|null|
 652 | |**2023-09-19**|**Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition**|Ziyang Ma et.al.|[2309.10294](http://arxiv.org/abs/2309.10294)|null|
 653 | |**2023-09-17**|**Augmenting text for spoken language understanding with Large Language Models**|Roshan Sharma et.al.|[2309.09390](http://arxiv.org/abs/2309.09390)|null|
 654 | |**2023-09-16**|**FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework**|Jianzong Wang et.al.|[2309.08837](http://arxiv.org/abs/2309.08837)|null|
 655 | |**2023-09-15**|**Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech**|Dariusz Piotrowski et.al.|[2309.08255](http://arxiv.org/abs/2309.08255)|null|
 656 | |**2023-09-15**|**HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods**|Hyun-seo Shin et.al.|[2309.08208](http://arxiv.org/abs/2309.08208)|**[link](https://github.com/talkingnow/HM-Conformer)**|
 657 | |**2023-12-27**|**PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions**|Reo Shimizu et.al.|[2309.08140](http://arxiv.org/abs/2309.08140)|null|
 658 | |**2023-09-15**|**Diversity-based core-set selection for text-to-speech with linguistic and acoustic features**|Kentaro Seki et.al.|[2309.08127](http://arxiv.org/abs/2309.08127)|null|
 659 | |**2023-09-14**|**Direct Text to Speech Translation System using Acoustic Units**|Victoria Mingote et.al.|[2309.07478](http://arxiv.org/abs/2309.07478)|null|
 660 | |**2023-10-07**|**FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec**|Zhihao Du et.al.|[2309.07405](http://arxiv.org/abs/2309.07405)|**[link](https://github.com/alibaba-damo-academy/funcodec)**|
 661 | |**2023-09-13**|**DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation**|Zhichao Wu et.al.|[2309.06787](http://arxiv.org/abs/2309.06787)|null|
 662 | |**2023-09-11**|**Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP**|Jinzuomu Zhong et.al.|[2309.05423](http://arxiv.org/abs/2309.05423)|**[link](https://github.com/jzmzhong/Automatic-Prosody-Annotator-with-SSWP-CLAP)**|
 663 | |**2024-01-16**|**VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching**|Yiwei Guo et.al.|[2309.05027](http://arxiv.org/abs/2309.05027)|**[link](https://github.com/X-LANCE/VoiceFlow-TTS)**|
 664 | |**2023-09-08**|**Cross-Utterance Conditioned VAE for Speech Generation**|Yang Li et.al.|[2309.04156](http://arxiv.org/abs/2309.04156)|null|
 665 | |**2023-09-07**|**Large-Scale Automatic Audiobook Creation**|Brendan Walsh et.al.|[2309.03926](http://arxiv.org/abs/2309.03926)|null|
 666 | |**2023-09-11**|**GRASS: Unified Generation Model for Speech-to-Semantic Tasks**|Aobo Xia et.al.|[2309.02780](http://arxiv.org/abs/2309.02780)|null|
 667 | |**2023-09-12**|**MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023**|Zhihang Xu et.al.|[2309.02743](http://arxiv.org/abs/2309.02743)|null|
 668 | |**2023-10-12**|**PromptTTS 2: Describing and Generating Voices with Text Prompt**|Yichong Leng et.al.|[2309.02285](http://arxiv.org/abs/2309.02285)|null|
 669 | |**2023-09-04**|**A Comparative Analysis of Pretrained Language Models for Text-to-Speech**|Marcel Granero-Moya et.al.|[2309.01576](http://arxiv.org/abs/2309.01576)|null|
 670 | |**2023-09-02**|**DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin**|Tao Li et.al.|[2309.00883](http://arxiv.org/abs/2309.00883)|null|
 671 | |**2023-12-18**|**Learning Speech Representation From Contrastive Token-Acoustic Pretraining**|Chunyu Qiang et.al.|[2309.00424](http://arxiv.org/abs/2309.00424)|null|
 672 | |**2023-09-01**|**The FruitShell French synthesis system at the Blizzard 2023 Challenge**|Xin Qi et.al.|[2309.00223](http://arxiv.org/abs/2309.00223)|null|
 673 | |**2023-08-31**|**QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning**|Haohan Guo et.al.|[2309.00126](http://arxiv.org/abs/2309.00126)|null|
 674 | |**2024-01-23**|**SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models**|Xin Zhang et.al.|[2308.16692](http://arxiv.org/abs/2308.16692)|**[link](https://github.com/0nutation/uslm)**|
 675 | |**2023-08-31**|**Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis**|Weiqin Li et.al.|[2308.16593](http://arxiv.org/abs/2308.16593)|null|
 676 | |**2023-08-31**|**Improving Mandarin Prosodic Structure Prediction with Multi-level Contextual Information**|Jie Chen et.al.|[2308.16577](http://arxiv.org/abs/2308.16577)|null|
 677 | |**2023-08-31**|**LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech**|Jie Chen et.al.|[2308.16569](http://arxiv.org/abs/2308.16569)|null|
 678 | |**2023-08-30**|**CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis**|Yi Meng et.al.|[2308.16021](http://arxiv.org/abs/2308.16021)|null|
 679 | |**2023-09-01**|**The DeepZen Speech Synthesis System for Blizzard Challenge 2023**|Christophe Veaux et.al.|[2308.15945](http://arxiv.org/abs/2308.15945)|null|
 680 | |**2023-08-28**|**Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech**|Hyungchan Yoon et.al.|[2308.14909](http://arxiv.org/abs/2308.14909)|null|
 681 | |**2023-09-04**|**Rep2wav: Noise Robust text-to-speech Using self-supervised representations**|Qiushi Zhu et.al.|[2308.14553](http://arxiv.org/abs/2308.14553)|null|
 682 | |**2023-08-28**|**TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models**|Shengpeng Ji et.al.|[2308.14430](http://arxiv.org/abs/2308.14430)|**[link](https://github.com/jishengpeng/TextrolSpeech)**|
 683 | |**2023-09-02**|**Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder**|Xuyuan Li et.al.|[2308.13365](http://arxiv.org/abs/2308.13365)|null|
 684 | |**2023-08-24**|**Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations**|Wenbin Wang et.al.|[2308.13007](http://arxiv.org/abs/2308.13007)|null|
 685 | |**2023-09-22**|**Sparks of Large Audio Models: A Survey and Outlook**|Siddique Latif et.al.|[2308.12792](http://arxiv.org/abs/2308.12792)|null|
 686 | |**2023-10-25**|**SeamlessM4T: Massively Multilingual & Multimodal Machine Translation**|Seamless Communication et.al.|[2308.11596](http://arxiv.org/abs/2308.11596)|**[link](https://github.com/facebookresearch/seamless_communication)**|
 687 | |**2023-08-31**|**Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models**|Heyang Xue et.al.|[2308.10428](http://arxiv.org/abs/2308.10428)|null|
 688 | |**2023-08-16**|**AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis**|Hrishikesh Viswanath et.al.|[2308.08577](http://arxiv.org/abs/2308.08577)|null|
 689 | |**2023-08-14**|**SpeechX: Neural Codec Language Model as a Versatile Speech Transformer**|Xiaofei Wang et.al.|[2308.06873](http://arxiv.org/abs/2308.06873)|null|
 690 | |**2023-08-12**|**Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation**|Zhichao Wang et.al.|[2308.06457](http://arxiv.org/abs/2308.06457)|**[link](https://github.com/zhichaowang970201/text-to-video)**|
 691 | |**2023-09-09**|**AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining**|Haohe Liu et.al.|[2308.05734](http://arxiv.org/abs/2308.05734)|**[link](https://github.com/haoheliu/AudioLDM2)**|
 692 | |**2023-08-09**|**Data Player: Automatic Generation of Data Videos with Narration-Animation Interplay**|Leixian Shen et.al.|[2308.04703](http://arxiv.org/abs/2308.04703)|null|
 693 | |**2023-08-08**|**Towards an AI to Win Ghana's National Science and Maths Quiz**|George Boateng et.al.|[2308.04333](http://arxiv.org/abs/2308.04333)|**[link](https://github.com/nsmq-ai/nsmqai)**|
 694 | |**2023-08-08**|**WonderFlow: Narration-Centric Design of Animated Data Videos**|Yun Wang et.al.|[2308.04040](http://arxiv.org/abs/2308.04040)|null|
 695 | |**2023-08-04**|**Let's Give a Voice to Conversational Agents in Virtual Reality**|Michele Yin et.al.|[2308.02665](http://arxiv.org/abs/2308.02665)|**[link](https://github.com/sislab-unitn/Let-s-Give-a-Voice-to-Conversational-Agents-in-VR)**|
 696 | |**2023-08-03**|**Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation**|Minsu Kim et.al.|[2308.01831](http://arxiv.org/abs/2308.01831)|**[link](https://github.com/choijeongsoo/utut)**|
 697 | |**2023-08-02**|**SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis**|Ramanan Sivaguru et.al.|[2308.01018](http://arxiv.org/abs/2308.01018)|null|
 698 | |**2023-07-07**|**Artificial Eye for the Blind**|Abhinav Benagi et.al.|[2308.00801](http://arxiv.org/abs/2308.00801)|null|
 699 | |**2023-07-31**|**Multilingual context-based pronunciation learning for Text-to-Speech**|Giulia Comini et.al.|[2307.16709](http://arxiv.org/abs/2307.16709)|null|
 700 | |**2023-07-31**|**Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech**|Guangyan Zhang et.al.|[2307.16679](http://arxiv.org/abs/2307.16679)|null|
 701 | |**2023-07-31**|**Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings**|Manuel Sam Ribeiro et.al.|[2307.16643](http://arxiv.org/abs/2307.16643)|null|
 702 | |**2023-07-31**|**DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training**|Hyung-Seok Oh et.al.|[2307.16549](http://arxiv.org/abs/2307.16549)|**[link](https://github.com/hsoh0306/diffprosody)**|
 703 | |**2023-07-31**|**VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design**|Jungil Kong et.al.|[2307.16430](http://arxiv.org/abs/2307.16430)|null|
 704 | |**2023-07-30**|**Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation**|Yuanhao Chen et.al.|[2307.16199](http://arxiv.org/abs/2307.16199)|**[link](https://github.com/edward-martyr/shanghainese-tts)**|
 705 | |**2023-07-29**|**METTS: Multilingual Emotional Text-to-Speech by Cross-speaker and Cross-lingual Emotion Transfer**|Xinfa Zhu et.al.|[2307.15951](http://arxiv.org/abs/2307.15951)|null|
 706 | |**2023-12-18**|**Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding**|Chunyu Qiang et.al.|[2307.15484](http://arxiv.org/abs/2307.15484)|null|
 707 | |**2023-07-20**|**SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer**|Daegyeom Kim et.al.|[2307.10550](http://arxiv.org/abs/2307.10550)|**[link](https://github.com/0913ktg/sc_vall-e)**|
 708 | |**2023-07-18**|**SLMGAN: Exploiting Speech Language Model Representations for Unsupervised Zero-Shot Voice Conversion in GANs**|Yinghao Aaron Li et.al.|[2307.09435](http://arxiv.org/abs/2307.09435)|null|
 709 | |**2023-09-28**|**Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts**|Ziyue Jiang et.al.|[2307.07218](http://arxiv.org/abs/2307.07218)|null|
 710 | |**2023-07-13**|**Controllable Emphasis with zero data for text-to-speech**|Arnaud Joly et.al.|[2307.07062](http://arxiv.org/abs/2307.07062)|null|
 711 | |**2023-07-11**|**On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis**|Siyang Wang et.al.|[2307.05132](http://arxiv.org/abs/2307.05132)|null|
 712 | |**2023-07-10**|**The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task**|Kun Song et.al.|[2307.04630](http://arxiv.org/abs/2307.04630)|null|
 713 | |**2023-10-07**|**ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading**|Yujia Xiao et.al.|[2307.00782](http://arxiv.org/abs/2307.00782)|null|
 714 | |**2023-06-28**|**EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech**|Daria Diatlova et.al.|[2307.00024](http://arxiv.org/abs/2307.00024)|**[link](https://github.com/deepvk/emospeech)**|
 715 | |**2023-06-29**|**High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units**|Junchen Lu et.al.|[2306.17005](http://arxiv.org/abs/2306.17005)|null|
 716 | |**2023-06-28**|**UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data**|Heeseung Kim et.al.|[2306.16083](http://arxiv.org/abs/2306.16083)|**[link](https://github.com/gmltmd789/UnitSpeech)**|
 717 | |**2023-10-19**|**Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale**|Matthew Le et.al.|[2306.15687](http://arxiv.org/abs/2306.15687)|null|
 718 | |**2023-06-27**|**GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech**|Yahuan Cong et.al.|[2306.15304](http://arxiv.org/abs/2306.15304)|null|
 719 | |**2023-06-25**|**DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech**|Sen Liu et.al.|[2306.14145](http://arxiv.org/abs/2306.14145)|null|
 720 | |**2023-06-21**|**Visual-Aware Text-to-Speech**|Mohan Zhou et.al.|[2306.12020](http://arxiv.org/abs/2306.12020)|null|
 721 | |**2023-06-21**|**Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer**|Jakub Swiatkowski et.al.|[2306.11662](http://arxiv.org/abs/2306.11662)|null|
 722 | |**2023-06-16**|**Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation**|Kishor Kayyar Lakshminarayana et.al.|[2306.10152](http://arxiv.org/abs/2306.10152)|null|
 723 | |**2023-06-16**|**CML-TTS A Multilingual Dataset for Speech Synthesis in Low-Resource Languages**|Frederico S. Oliveira et.al.|[2306.10097](http://arxiv.org/abs/2306.10097)|null|
 724 | |**2023-06-14**|**Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation**|Zheng Liang et.al.|[2306.08588](http://arxiv.org/abs/2306.08588)|null|
 725 | |**2023-06-14**|**Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects**|Xinghua Qu et.al.|[2306.08219](http://arxiv.org/abs/2306.08219)|**[link](https://github.com/hyllll/vcrs)**|
 726 | |**2023-11-20**|**StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models**|Yinghao Aaron Li et.al.|[2306.07691](http://arxiv.org/abs/2306.07691)|null|
 727 | |**2024-01-18**|**UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding**|Chenpeng Du et.al.|[2306.07547](http://arxiv.org/abs/2306.07547)|null|
 728 | |**2023-06-13**|**PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling**|Ji-Sang Hwang et.al.|[2306.07489](http://arxiv.org/abs/2306.07489)|null|
 729 | |**2023-06-09**|**Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech**|Shijun Wang et.al.|[2306.05709](http://arxiv.org/abs/2306.05709)|null|
 730 | |**2023-06-08**|**VIFS: An End-to-End Variational Inference for Foley Sound Synthesis**|Junhyeok Lee et.al.|[2306.05004](http://arxiv.org/abs/2306.05004)|**[link](https://github.com/junjun3518/vifs)**|
 731 | |**2023-07-11**|**Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge**|Wenhao Guan et.al.|[2306.04301](http://arxiv.org/abs/2306.04301)|null|
 732 | |**2023-06-06**|**Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias**|Ziyue Jiang et.al.|[2306.03509](http://arxiv.org/abs/2306.03509)|null|
 733 | |**2023-08-02**|**Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis**|Zhenhui Ye et.al.|[2306.03504](http://arxiv.org/abs/2306.03504)|null|
 734 | |**2023-06-05**|**Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis**|Dengfeng Ke et.al.|[2306.02593](http://arxiv.org/abs/2306.02593)|null|
 735 | |**2023-06-05**|**Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model**|Hoyeon Lee et.al.|[2306.02579](http://arxiv.org/abs/2306.02579)|null|
 736 | |**2023-06-05**|**Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming**|Xinlei Niu et.al.|[2306.02568](http://arxiv.org/abs/2306.02568)|**[link](https://github.com/Berthaniu/LatentOptimalPathsBayesianDP)**|
 737 | |**2023-06-02**|**Towards Robust FastSpeech 2 by Modelling Residual Multimodality**|Fabian Kögel et.al.|[2306.01442](http://arxiv.org/abs/2306.01442)|**[link](https://github.com/sony/ai-research-code)**|
 738 | |**2023-05-30**|**Towards Selection of Text-to-speech Data to Augment ASR Training**|Shuo Liu et.al.|[2306.00998](http://arxiv.org/abs/2306.00998)|null|
 739 | |**2023-06-01**|**EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis**|Haobin Tang et.al.|[2306.00648](http://arxiv.org/abs/2306.00648)|null|
 740 | |**2023-06-01**|**The Effects of Input Type and Pronunciation Dictionary Usage in Transfer Learning for Low-Resource Text-to-Speech**|Phat Do et.al.|[2306.00535](http://arxiv.org/abs/2306.00535)|null|
 741 | |**2023-05-31**|**Text-to-Speech Pipeline for Swiss German -- A comparison**|Tobias Bollinger et.al.|[2305.19750](http://arxiv.org/abs/2305.19750)|null|
 742 | |**2023-05-31**|**XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech**|Linh The Nguyen et.al.|[2305.19709](http://arxiv.org/abs/2305.19709)|**[link](https://github.com/vinairesearch/xphonebert)**|
 743 | |**2023-06-01**|**PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions**|Guanghou Liu et.al.|[2305.19522](http://arxiv.org/abs/2305.19522)|null|
 744 | |**2023-05-30**|**Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages**|Phat Do et.al.|[2305.19396](http://arxiv.org/abs/2305.19396)|null|
 745 | |**2023-05-30**|**Make-A-Voice: Unified Voice Synthesis With Discrete Representation**|Rongjie Huang et.al.|[2305.19269](http://arxiv.org/abs/2305.19269)|null|
 746 | |**2023-05-30**|**STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions**|Michel Plüss et.al.|[2305.18855](http://arxiv.org/abs/2305.18855)|null|
 747 | |**2023-05-30**|**LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus**|Yuma Koizumi et.al.|[2305.18802](http://arxiv.org/abs/2305.18802)|null|
 748 | |**2023-10-09**|**An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization**|Fei Kong et.al.|[2305.18355](http://arxiv.org/abs/2305.18355)|**[link](https://github.com/kong13661/pia)**|
 749 | |**2023-05-29**|**ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation**|Ambuj Mehrish et.al.|[2305.18028](http://arxiv.org/abs/2305.18028)|**[link](https://github.com/declare-lab/adapter-mix)**|
 750 | |**2023-05-29**|**Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis**|Erik Ekstedt et.al.|[2305.17971](http://arxiv.org/abs/2305.17971)|null|
 751 | |**2023-07-25**|**StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation**|Kun Song et.al.|[2305.17732](http://arxiv.org/abs/2305.17732)|null|
 752 | |**2023-05-28**|**Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS**|Sewade Ogun et.al.|[2305.17724](http://arxiv.org/abs/2305.17724)|**[link](https://github.com/ogunlao/glowtts_stdp)**|
 753 | |**2023-07-19**|**Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing**|Julia Kaiwen Lau et.al.|[2305.17445](http://arxiv.org/abs/2305.17445)|**[link](https://github.com/julianyonghao/fainasrtest)**|
 754 | |**2023-05-26**|**DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction**|Vineet Bhat et.al.|[2305.16957](http://arxiv.org/abs/2305.16957)|null|
 755 | |**2023-05-25**|**Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion**|Rui Liu et.al.|[2305.16353](http://arxiv.org/abs/2305.16353)|**[link](https://github.com/ai-s2-lab/m2s-add)**|
 756 | |**2023-05-22**|**Text Generation with Speech Synthesis for ASR Data Augmentation**|Zhuangqun Huang et.al.|[2305.16333](http://arxiv.org/abs/2305.16333)|null|
 757 | |**2023-05-25**|**VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation**|Tianrui Wang et.al.|[2305.16107](http://arxiv.org/abs/2305.16107)|null|
 758 | |**2023-05-25**|**Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration**|Rustem Yeshpanov et.al.|[2305.15749](http://arxiv.org/abs/2305.15749)|**[link](https://github.com/is2ai/turkictts)**|
 759 | |**2024-02-05**|**LAraBench: Benchmarking Arabic AI with Large Language Models**|Ahmed Abdelali et.al.|[2305.14982](http://arxiv.org/abs/2305.14982)|null|
 760 | |**2023-05-23**|**EfficientSpeech: An On-Device Text to Speech Model**|Rowel Atienza et.al.|[2305.13905](http://arxiv.org/abs/2305.13905)|**[link](https://github.com/roatienza/efficientspeech)**|
 761 | |**2023-05-23**|**ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models**|Minki Kang et.al.|[2305.13831](http://arxiv.org/abs/2305.13831)|null|
 762 | |**2023-05-22**|**U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech**|Xin Jing et.al.|[2305.13195](http://arxiv.org/abs/2305.13195)|null|
 763 | |**2023-05-25**|**EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels**|Kari Ali Noriy et.al.|[2305.13137](http://arxiv.org/abs/2305.13137)|**[link](https://github.com/knoriy/emns-dct)**|
 764 | |**2023-05-22**|**ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer**|Huadai Liu et.al.|[2305.12708](http://arxiv.org/abs/2305.12708)|null|
 765 | |**2023-05-21**|**VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages**|Shivam Mhaskar et.al.|[2305.12518](http://arxiv.org/abs/2305.12518)|null|
 766 | |**2023-05-26**|**Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus**|Detai Xin et.al.|[2305.12442](http://arxiv.org/abs/2305.12442)|**[link](https://github.com/aria-k-alethia/laughter-synthesis)**|
 767 | |**2023-05-20**|**ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios**|Yuyue Wang et.al.|[2305.12200](http://arxiv.org/abs/2305.12200)|null|
 768 | |**2023-05-19**|**MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting**|Neil Shah et.al.|[2305.11926](http://arxiv.org/abs/2305.11926)|null|
 769 | |**2024-02-20**|**Data Redaction from Conditional Generative Models**|Zhifeng Kong et.al.|[2305.11351](http://arxiv.org/abs/2305.11351)|null|
 770 | |**2023-05-18**|**Parameter-Efficient Learning for Text-to-Speech Accent Adaptation**|Li-Jen Yang et.al.|[2305.11320](http://arxiv.org/abs/2305.11320)|**[link](https://github.com/TTS-Research/PEL-TTS)**|
 771 | |**2023-05-19**|**Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation**|Martijn Bartelds et.al.|[2305.10951](http://arxiv.org/abs/2305.10951)|**[link](https://github.com/bartelds/asr-augmentation)**|
 772 | |**2023-09-30**|**Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data**|Yusheng Tian et.al.|[2305.10891](http://arxiv.org/abs/2305.10891)|**[link](https://github.com/dmse4tts/dmse4tts)**|
 773 | |**2023-05-18**|**FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs**|Won Jang et.al.|[2305.10823](http://arxiv.org/abs/2305.10823)|null|
 774 | |**2023-05-18**|**CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training**|Zhenhui Ye et.al.|[2305.10763](http://arxiv.org/abs/2305.10763)|null|
 775 | |**2023-08-29**|**a unified front-end framework for english text-to-speech synthesis**|Zelin Ying et.al.|[2305.10666](http://arxiv.org/abs/2305.10666)|null|
 776 | |**2023-09-19**|**Controllable Speaking Styles Using a Large Language Model**|Atli Thor Sigurgeirsson et.al.|[2305.10321](http://arxiv.org/abs/2305.10321)|null|
 777 | |**2023-05-23**|**Better speech synthesis through scaling**|James Betker et.al.|[2305.07243](http://arxiv.org/abs/2305.07243)|**[link](https://github.com/neonbjb/tortoise-tts)**|
 778 | |**2023-10-29**|**CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model**|Zhen Ye et.al.|[2305.06908](http://arxiv.org/abs/2305.06908)|**[link](https://github.com/zhenye234/CoMoSpeech)**|
 779 | |**2023-05-08**|**Accented Text-to-Speech Synthesis with Limited Data**|Xuehao Zhou et.al.|[2305.04816](http://arxiv.org/abs/2305.04816)|null|
 780 | |**2023-05-03**|**M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis**|Jinlong Xue et.al.|[2305.02269](http://arxiv.org/abs/2305.02269)|null|
 781 | |**2023-05-30**|**A Review of Deep Learning Techniques for Speech Processing**|Ambuj Mehrish et.al.|[2305.00359](http://arxiv.org/abs/2305.00359)|null|
 782 | |**2023-04-26**|**Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis**|Ye-Xin Lu et.al.|[2304.13270](http://arxiv.org/abs/2304.13270)|null|
 783 | |**2023-04-25**|**Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge**|Chenpeng Du et.al.|[2304.13121](http://arxiv.org/abs/2304.13121)|null|
 784 | |**2023-04-24**|**Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model**|Kenichi Fujita et.al.|[2304.11976](http://arxiv.org/abs/2304.11976)|null|
 785 | |**2023-04-23**|**DiffVoice: Text-to-Speech with Latent Diffusion**|Zhijun Liu et.al.|[2304.11750](http://arxiv.org/abs/2304.11750)|null|
 786 | |**2023-04-23**|**SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model**|Jianzong Wang et.al.|[2304.11547](http://arxiv.org/abs/2304.11547)|null|
 787 | |**2023-05-30**|**NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers**|Kai Shen et.al.|[2304.09116](http://arxiv.org/abs/2304.09116)|null|
 788 | |**2023-04-16**|**A Virtual Simulation-Pilot Agent for Training of Air Traffic Controllers**|Juan Zuluaga-Gomez et.al.|[2304.07842](http://arxiv.org/abs/2304.07842)|null|
 789 | |**2023-04-13**|**Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis**|Shun Lei et.al.|[2304.06359](http://arxiv.org/abs/2304.06359)|null|
 790 | |**2023-04-10**|**Enhancing Speech-to-Speech Translation with Multiple TTS Targets**|Jiatong Shi et.al.|[2304.04618](http://arxiv.org/abs/2304.04618)|null|
 791 | |**2023-04-07**|**ArmanTTS single-speaker Persian dataset**|Mohammd Hasan Shamgholi et.al.|[2304.03585](http://arxiv.org/abs/2304.03585)|null|
 792 | |**2023-04-03**|**Ensemble prosody prediction for expressive speech synthesis**|Tian Huey Teh et.al.|[2304.00714](http://arxiv.org/abs/2304.00714)|null|
 793 | |**2023-03-29**|**AraSpot: Arabic Spoken Command Spotting**|Mahmoud Salhab et.al.|[2303.16621](http://arxiv.org/abs/2303.16621)|**[link](https://github.com/msalhab96/araspot)**|
 794 | |**2023-03-28**|**Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages**|Seongyeon Park et.al.|[2303.15669](http://arxiv.org/abs/2303.15669)|**[link](https://github.com/cnaigithub/SpeechDewarping)**|
 795 | |**2023-03-27**|**Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis**|Karren Yang et.al.|[2303.14885](http://arxiv.org/abs/2303.14885)|null|
 796 | |**2023-03-24**|**Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis**|Takuhiro Kaneko et.al.|[2303.13909](http://arxiv.org/abs/2303.13909)|null|
 797 | |**2023-04-02**|**A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI**|Chenshuang Zhang et.al.|[2303.13336](http://arxiv.org/abs/2303.13336)|null|
 798 | |**2023-03-20**|**Code-Switching Text Generation and Injection in Mandarin-English ASR**|Haibin Yu et.al.|[2303.10949](http://arxiv.org/abs/2303.10949)|null|
 799 | |**2023-03-14**|**Controlling High-Dimensional Data With Sparse Input**|Dan Andrei Iliescu et.al.|[2303.09446](http://arxiv.org/abs/2303.09446)|null|
 800 | |**2023-03-09**|**Text-to-ECG: 12-Lead Electrocardiogram Synthesis conditioned on Clinical Text Reports**|Hyunseung Chung et.al.|[2303.09395](http://arxiv.org/abs/2303.09395)|**[link](https://github.com/tclife/text_to_ecg)**|
 801 | |**2023-03-15**|**Cross-speaker Emotion Transfer by Manipulating Speech Style Latents**|Suhee Jo et.al.|[2303.08329](http://arxiv.org/abs/2303.08329)|null|
 802 | |**2023-03-14**|**QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis**|Haobin Tang et.al.|[2303.07682](http://arxiv.org/abs/2303.07682)|null|
 803 | |**2023-03-10**|**An End-to-End Neural Network for Image-to-Audio Transformation**|Liu Chen et.al.|[2303.06078](http://arxiv.org/abs/2303.06078)|null|
 804 | |**2023-03-09**|**Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation**|Qi Chen et.al.|[2303.05322](http://arxiv.org/abs/2303.05322)|**[link](https://github.com/moon0316/t2a)**|
 805 | |**2023-03-07**|**Do Prosody Transfer Models Transfer Prosody?**|Atli Thor Sigurgeirsson et.al.|[2303.04289](http://arxiv.org/abs/2303.04289)|null|
 806 | |**2023-03-07**|**Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling**|Ziqiang Zhang et.al.|[2303.03926](http://arxiv.org/abs/2303.03926)|null|
 807 | |**2023-03-02**|**Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding**|Yingting Li et.al.|[2303.03267](http://arxiv.org/abs/2303.03267)|**[link](https://github.com/declare-lab/speech-adapters)**|
 808 | |**2023-03-08**|**FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model**|Ruiqing Xue et.al.|[2303.02939](http://arxiv.org/abs/2303.02939)|null|
 809 | |**2023-08-14**|**Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations**|Yuma Koizumi et.al.|[2303.01664](http://arxiv.org/abs/2303.01664)|null|
 810 | |**2023-03-11**|**Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities**|Shijun Wang et.al.|[2303.01508](http://arxiv.org/abs/2303.01508)|null|
 811 | |**2023-12-17**|**ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations**|Neil Shah et.al.|[2303.01261](http://arxiv.org/abs/2303.01261)|null|
 812 | |**2023-03-02**|**LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion**|Chunfeng Wang et.al.|[2303.01086](http://arxiv.org/abs/2303.01086)|null|
 813 | |**2023-03-02**|**Leveraging Large Text Corpora for End-to-End Speech Summarization**|Kohei Matsuura et.al.|[2303.00978](http://arxiv.org/abs/2303.00978)|null|
 814 | |**2023-03-01**|**DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction**|Raviteja Anantha et.al.|[2303.00171](http://arxiv.org/abs/2303.00171)|null|
 815 | |**2023-02-28**|**ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus**|Ajinkya Kulkarni et.al.|[2303.00069](http://arxiv.org/abs/2303.00069)|null|
 816 | |**2023-02-28**|**Automatic Heteronym Resolution Pipeline Using RAD-TTS Aligners**|Jocelyn Huang et.al.|[2302.14523](http://arxiv.org/abs/2302.14523)|null|
 817 | |**2023-06-12**|**CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis**|Ji-Hoon Kim et.al.|[2302.14370](http://arxiv.org/abs/2302.14370)|null|
 818 | |**2023-05-19**|**UniFLG: Unified Facial Landmark Generator from Text or Speech**|Kentaro Mitsui et.al.|[2302.14337](http://arxiv.org/abs/2302.14337)|null|
 819 | |**2023-02-27**|**Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech**|Jiyoung Lee et.al.|[2302.13700](http://arxiv.org/abs/2302.13700)|**[link](https://github.com/naver-ai/facetts)**|
 820 | |**2023-02-27**|**Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech**|Dong Yang et.al.|[2302.13652](http://arxiv.org/abs/2302.13652)|null|
 821 | |**2023-02-27**|**Varianceflow: High-Quality and Controllable Text-to-Speech using Variance Information via Normalizing Flow**|Yoonhyung Lee et.al.|[2302.13458](http://arxiv.org/abs/2302.13458)|null|
 822 | |**2023-06-06**|**PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS**|Junhyeok Lee et.al.|[2302.12391](http://arxiv.org/abs/2302.12391)|**[link](https://github.com/anonymous-pits/pits)**|
 823 | |**2023-02-21**|**Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition**|Leyuan Qu et.al.|[2302.09723](http://arxiv.org/abs/2302.09723)|null|
 824 | |**2023-02-23**|**QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion**|Houjian Guo et.al.|[2302.08296](http://arxiv.org/abs/2302.08296)|**[link](https://github.com/quickvc/quickvoice-conversion)**|
 825 | |**2023-02-13**|**Fast and small footprint Hybrid HMM-HiFiGAN based system for speech synthesis in Indian languages**|Sudhanshu Srivastava et.al.|[2302.06227](http://arxiv.org/abs/2302.06227)|null|
 826 | |**2023-02-08**|**A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech**|Li-Wei Chen et.al.|[2302.04215](http://arxiv.org/abs/2302.04215)|**[link](https://github.com/b04901014/mqtts)**|
 827 | |**2023-02-07**|**Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision**|Eugene Kharitonov et.al.|[2302.03540](http://arxiv.org/abs/2302.03540)|null|
 828 | |**2023-02-15**|**MAC: A unified framework boosting low resource automatic speech recognition**|Zeping Min et.al.|[2302.03498](http://arxiv.org/abs/2302.03498)|null|
 829 | |**2023-06-25**|**InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt**|Dongchao Yang et.al.|[2301.13662](http://arxiv.org/abs/2301.13662)|**[link](https://github.com/yangdongchao/academicodec)**|
 830 | |**2023-03-01**|**UzbekTagger: The rule-based POS tagger for Uzbek language**|Maksud Sharipov et.al.|[2301.12711](http://arxiv.org/abs/2301.12711)|null|
 831 | |**2023-05-27**|**Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining**|Takaaki Saeki et.al.|[2301.12596](http://arxiv.org/abs/2301.12596)|**[link](https://github.com/takaaki-saeki/zm-text-tts)**|
 832 | |**2023-01-31**|**Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker**|Navjot Kaur et.al.|[2301.12331](http://arxiv.org/abs/2301.12331)|**[link](https://github.com/chocobearz/speech_timing)**|
 833 | |**2023-01-26**|**On granularity of prosodic representations in expressive text-to-speech**|Mikolaj Babianski et.al.|[2301.11446](http://arxiv.org/abs/2301.11446)|null|
 834 | |**2023-01-26**|**Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study**|Massa Baali et.al.|[2301.09099](http://arxiv.org/abs/2301.09099)|**[link](https://github.com/espnet/espnet/tree/master/egs2/qasr_tts/tts1)**|
 835 | |**2023-01-20**|**Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions**|Yinghao Aaron Li et.al.|[2301.08810](http://arxiv.org/abs/2301.08810)|null|
 836 | |**2023-01-11**|**Modelling low-resource accents without accent-specific TTS frontend**|Georgi Tinchev et.al.|[2301.04606](http://arxiv.org/abs/2301.04606)|null|
 837 | |**2022-12-11**|**BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm**|Yu-Wen Chen et.al.|[2301.04120](http://arxiv.org/abs/2301.04120)|**[link](https://github.com/yuwchen/baspro)**|
 838 | |**2023-01-10**|**UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion**|Haogeng Liu et.al.|[2301.03801](http://arxiv.org/abs/2301.03801)|null|
 839 | |**2023-01-10**|**Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation**|Abdullah Shahid et.al.|[2301.03751](http://arxiv.org/abs/2301.03751)|null|
 840 | |**2023-09-19**|**Applying Automated Machine Translation to Educational Video Courses**|Linden Wang et.al.|[2301.03141](http://arxiv.org/abs/2301.03141)|null|
 841 | |**2023-01-06**|**Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition**|David M. Chan et.al.|[2301.02736](http://arxiv.org/abs/2301.02736)|null|
 842 | |**2023-01-05**|**Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers**|Chengyi Wang et.al.|[2301.02111](http://arxiv.org/abs/2301.02111)|**[link](https://github.com/microsoft/unilm)**|
 843 | |**2022-12-11**|**MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset**|Kailin Liang et.al.|[2301.00657](http://arxiv.org/abs/2301.00657)|**[link](https://github.com/ssmlkl/mntts2)**|
 844 | |**2022-12-30**|**ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech**|Zehua Chen et.al.|[2212.14518](http://arxiv.org/abs/2212.14518)|null|
 845 | |**2022-12-29**|**StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models**|Yinghao Aaron Li et.al.|[2212.14227](http://arxiv.org/abs/2212.14227)|**[link](https://github.com/yl4579/StyleTTS-VC)**|
 846 | |**2022-12-22**|**HMM-based data augmentation for E2E systems for building conversational speech synthesis systems**|Ishika Gupta et.al.|[2212.11982](http://arxiv.org/abs/2212.11982)|null|
 847 | |**2022-12-21**|**ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement**|Wei-Ning Hsu et.al.|[2212.11377](http://arxiv.org/abs/2212.11377)|null|
 848 | |**2022-12-20**|**TTS-Guided Training for Accent Conversion Without Parallel Data**|Yi Zhou et.al.|[2212.10204](http://arxiv.org/abs/2212.10204)|null|
 849 | |**2023-06-28**|**Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling**|Tuomo Raitio et.al.|[2212.10075](http://arxiv.org/abs/2212.10075)|null|
 850 | |**2022-12-16**|**Speech Aware Dialog System Technology Challenge (DSTC11)**|Hagen Soltau et.al.|[2212.08704](http://arxiv.org/abs/2212.08704)|null|
 851 | |**2022-12-16**|**Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder**|Yusuke Yasuda et.al.|[2212.08329](http://arxiv.org/abs/2212.08329)|null|
 852 | |**2022-12-16**|**Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language**|Yusuke Yasuda et.al.|[2212.08321](http://arxiv.org/abs/2212.08321)|null|
 853 | |**2022-12-15**|**RWEN-TTS: Relation-aware Word Encoding Network for Natural Text-to-Speech Synthesis**|Shinhyeok Oh et.al.|[2212.07939](http://arxiv.org/abs/2212.07939)|**[link](https://github.com/shinhyeokoh/rwen)**|
 854 | |**2022-12-14**|**Probing Deep Speaker Embeddings for Speaker-related Tasks**|Zifeng Zhao et.al.|[2212.07068](http://arxiv.org/abs/2212.07068)|null|
 855 | |**2022-12-08**|**SpeechLMScore: Evaluating speech generation using speech language model**|Soumi Maiti et.al.|[2212.04559](http://arxiv.org/abs/2212.04559)|**[link](https://github.com/espnet/espnet)**|
 856 | |**2023-04-04**|**Learning to Dub Movies via Hierarchical Prosody Models**|Gaoxiang Cong et.al.|[2212.04054](http://arxiv.org/abs/2212.04054)|**[link](https://github.com/galaxycong/hpmdubbing)**|
 857 | |**2022-12-07**|**Low-Resource End-to-end Sanskrit TTS using Tacotron2, WaveGlow and Transfer Learning**|Ankur Debnath et.al.|[2212.03558](http://arxiv.org/abs/2212.03558)|null|
 858 | |**2022-12-07**|**Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue**|Daxin Tan et.al.|[2212.03398](http://arxiv.org/abs/2212.03398)|null|
 859 | |**2022-12-06**|**UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis**|Yi Lei et.al.|[2212.01546](http://arxiv.org/abs/2212.01546)|null|
 860 | |**2022-11-30**|**SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech**|Byoung Jin Choi et.al.|[2211.16866](http://arxiv.org/abs/2211.16866)|null|
 861 | |**2022-11-29**|**Controllable speech synthesis by learning discrete phoneme-level prosodic representations**|Nikolaos Ellinas et.al.|[2211.16307](http://arxiv.org/abs/2211.16307)|null|
 862 | |**2023-05-25**|**Evaluating and reducing the distance between synthetic and real speech distributions**|Christoph Minixhofer et.al.|[2211.16049](http://arxiv.org/abs/2211.16049)|null|
 863 | |**2022-11-26**|**Contextual Expressive Text-to-Speech**|Jianhong Tu et.al.|[2211.14548](http://arxiv.org/abs/2211.14548)|null|
 864 | |**2022-12-05**|**Efficient Incremental Text-to-Speech on GPUs**|Muyang Du et.al.|[2211.13939](http://arxiv.org/abs/2211.13939)|null|
 865 | |**2023-03-21**|**Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?**|Xuan Shi et.al.|[2211.13868](http://arxiv.org/abs/2211.13868)|**[link](https://github.com/nii-yamagishilab/midi-to-audio)**|
 866 | |**2022-11-23**|**IMaSC -- ICFOSS Malayalam Speech Corpus**|Deepa P Gopinath et.al.|[2211.12796](http://arxiv.org/abs/2211.12796)|null|
 867 | |**2022-11-22**|**PromptTTS: Controllable Text-to-Speech with Text Descriptions**|Zhifang Guo et.al.|[2211.12171](http://arxiv.org/abs/2211.12171)|null|
 868 | |**2022-11-04**|**Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech**|Xin Zhang et.al.|[2211.09731](http://arxiv.org/abs/2211.09731)|null|
 869 | |**2023-02-17**|**Towards Building Text-To-Speech Systems for the Next Billion Users**|Gokul Karthik Kumar et.al.|[2211.09536](http://arxiv.org/abs/2211.09536)|**[link](https://github.com/gokulkarthik/text2speech)**|
 870 | |**2023-02-16**|**EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance**|Yiwei Guo et.al.|[2211.09496](http://arxiv.org/abs/2211.09496)|null|
 871 | |**2022-11-17**|**Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation**|Chunyu Qiang et.al.|[2211.09495](http://arxiv.org/abs/2211.09495)|null|
 872 | |**2022-11-17**|**NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis**|Hyeong-Seok Choi et.al.|[2211.09407](http://arxiv.org/abs/2211.09407)|null|
 873 | |**2023-03-14**|**Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models**|Minki Kang et.al.|[2211.09383](http://arxiv.org/abs/2211.09383)|null|
 874 | |**2023-01-04**|**Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation**|Xin Yuan et.al.|[2211.09365](http://arxiv.org/abs/2211.09365)|null|
 875 | |**2022-11-14**|**SNIPER Training: Variable Sparsity Rate Training For Text-To-Speech**|Perry Lam et.al.|[2211.07283](http://arxiv.org/abs/2211.07283)|null|
 876 | |**2023-05-24**|**Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing**|Jacob J Webber et.al.|[2211.06989](http://arxiv.org/abs/2211.06989)|null|
 877 | |**2023-05-29**|**OverFlow: Putting flows on top of neural transducers for better TTS**|Shivam Mehta et.al.|[2211.06892](http://arxiv.org/abs/2211.06892)|**[link](https://github.com/coqui-ai/TTS)**|
 878 | |**2023-05-29**|**Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations**|Yoori Oh et.al.|[2211.06160](http://arxiv.org/abs/2211.06160)|null|
 879 | |**2022-12-04**|**ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech**|Xiaoran Fan et.al.|[2211.03545](http://arxiv.org/abs/2211.03545)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**|
 880 | |**2022-11-07**|**Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder**|Jan Melechovsky et.al.|[2211.03316](http://arxiv.org/abs/2211.03316)|**[link](https://github.com/dapwner/cvae-tacotron)**|
 881 | |**2022-11-06**|**Parallel Attention Forcing for Machine Translation**|Qingyun Dou et.al.|[2211.03237](http://arxiv.org/abs/2211.03237)|null|
 882 | |**2022-11-06**|**An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space**|Jihwan Lee et.al.|[2211.03078](http://arxiv.org/abs/2211.03078)|null|
 883 | |**2022-11-04**|**NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS**|Dongchao Yang et.al.|[2211.02448](http://arxiv.org/abs/2211.02448)|null|
 884 | |**2022-11-04**|**Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts**|Detai Xin et.al.|[2211.02336](http://arxiv.org/abs/2211.02336)|null|
 885 | |**2023-04-16**|**Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS**|Ziqi Liang et.al.|[2211.01948](http://arxiv.org/abs/2211.01948)|null|
 886 | |**2022-11-01**|**Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages**|Anusha Prakash et.al.|[2211.01338](http://arxiv.org/abs/2211.01338)|null|
 887 | |**2023-05-28**|**DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP**|Kun Song et.al.|[2211.01087](http://arxiv.org/abs/2211.01087)|null|
 888 | |**2022-11-22**|**Multi-Speaker Multi-Style Speech Synthesis with Timbre and Style Disentanglement**|Wei Song et.al.|[2211.00967](http://arxiv.org/abs/2211.00967)|null|
 889 | |**2022-11-01**|**Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers**|Cheng-Ping Hsieh et.al.|[2211.00585](http://arxiv.org/abs/2211.00585)|**[link](https://github.com/NVIDIA/NeMo)**|
 890 | |**2023-06-11**|**Generating Multilingual Gender-Ambiguous Text-to-Speech Voices**|Konstantinos Markopoulos et.al.|[2211.00375](http://arxiv.org/abs/2211.00375)|null|
 891 | |**2023-05-07**|**Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features**|Alexandra Vioni et.al.|[2211.00342](http://arxiv.org/abs/2211.00342)|null|
 892 | |**2022-11-02**|**Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS**|Kun Song et.al.|[2210.17349](http://arxiv.org/abs/2210.17349)|null|
 893 | |**2024-02-27**|**Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation**|Nikolaos Ellinas et.al.|[2210.17264](http://arxiv.org/abs/2210.17264)|null|
 894 | |**2022-10-31**|**Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection**|Luigi Attorresi et.al.|[2210.17222](http://arxiv.org/abs/2210.17222)|null|
 895 | |**2022-10-31**|**Structured State Space Decoder for Speech Recognition and Synthesis**|Koichi Miyazaki et.al.|[2210.17098](http://arxiv.org/abs/2210.17098)|null|
 896 | |**2022-10-28**|**Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders**|Jason Fong et.al.|[2210.16045](http://arxiv.org/abs/2210.16045)|null|
 897 | |**2023-02-21**|**Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform**|Masaya Kawamura et.al.|[2210.15975](http://arxiv.org/abs/2210.15975)|**[link](https://github.com/masayakawamura/mb-istft-vits)**|
 898 | |**2023-02-22**|**Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis**|Yuma Shirahata et.al.|[2210.15964](http://arxiv.org/abs/2210.15964)|null|
 899 | |**2022-10-28**|**Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation**|Nobuyuki Morioka et.al.|[2210.15868](http://arxiv.org/abs/2210.15868)|null|
 900 | |**2023-03-15**|**Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech**|Takaaki Saeki et.al.|[2210.15447](http://arxiv.org/abs/2210.15447)|null|
 901 | |**2022-10-27**|**Explicit Intensity Control for Accented Text-to-speech**|Rui Liu et.al.|[2210.15364](http://arxiv.org/abs/2210.15364)|null|
 902 | |**2022-10-27**|**FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis**|Yifan Hu et.al.|[2210.15360](http://arxiv.org/abs/2210.15360)|**[link](https://github.com/walker-hyf/fctalker)**|
 903 | |**2022-10-26**|**Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection**|Kentaro Seki et.al.|[2210.14850](http://arxiv.org/abs/2210.14850)|null|
 904 | |**2022-10-25**|**Semi-Supervised Learning Based on Reference Model for Low-resource TTS**|Xulong Zhang et.al.|[2210.14723](http://arxiv.org/abs/2210.14723)|null|
 905 | |**2022-10-26**|**Cover Reproducible Steganography via Deep Generative Models**|Kejiang Chen et.al.|[2210.14632](http://arxiv.org/abs/2210.14632)|null|
 906 | |**2022-10-26**|**Improving Speech-to-Speech Translation Through Unlabeled Text**|Xuan-Phi Nguyen et.al.|[2210.14514](http://arxiv.org/abs/2210.14514)|null|
 907 | |**2022-10-26**|**The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge**|Yuhao Liang et.al.|[2210.14448](http://arxiv.org/abs/2210.14448)|null|
 908 | |**2022-10-25**|**Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data**|Xulong Zhang et.al.|[2210.13803](http://arxiv.org/abs/2210.13803)|null|
 909 | |**2023-09-17**|**HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation**|Chunhui Wang et.al.|[2210.12740](http://arxiv.org/abs/2210.12740)|null|
 910 | |**2022-10-21**|**Low-Resource Multilingual and Zero-Shot Multispeaker TTS**|Florian Lux et.al.|[2210.12223](http://arxiv.org/abs/2210.12223)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
 911 | |**2022-10-21**|**Adaptive re-calibration of channel-wise features for Adversarial Audio Classification**|Vardhan Dongre et.al.|[2210.11722](http://arxiv.org/abs/2210.11722)|null|
 912 | |**2022-10-20**|**Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS**|Chunyu Qiang et.al.|[2210.11429](http://arxiv.org/abs/2210.11429)|null|
 913 | |**2022-10-17**|**Towards Relation Extraction From Speech**|Tongtong Wu et.al.|[2210.08759](http://arxiv.org/abs/2210.08759)|**[link](https://github.com/wutong8023/speechre)**|
 914 | |**2023-02-08**|**Generating Synthetic Speech from SpokenVocab for Speech Translation**|Jinming Zhao et.al.|[2210.08174](http://arxiv.org/abs/2210.08174)|**[link](https://github.com/mingzi151/spokenvocab)**|
 915 | |**2022-10-17**|**LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge**|Yan Jia et.al.|[2210.07749](http://arxiv.org/abs/2210.07749)|null|
 916 | |**2022-10-20**|**Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy**|Sarina Meyer et.al.|[2210.07002](http://arxiv.org/abs/2210.07002)|**[link](https://github.com/digitalphonetics/speaker-anonymization)**|
 917 | |**2022-10-13**|**Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar**|Aolan Sun et.al.|[2210.06877](http://arxiv.org/abs/2210.06877)|null|
 918 | |**2022-10-12**|**Can we use Common Voice to train a Multi-Speaker TTS system?**|Sewade Ogun et.al.|[2210.06370](http://arxiv.org/abs/2210.06370)|null|
 919 | |**2023-06-01**|**SQuId: Measuring Speech Naturalness in Many Languages**|Thibault Sellam et.al.|[2210.06324](http://arxiv.org/abs/2210.06324)|null|
 920 | |**2022-11-22**|**Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech**|Byoung Jin Choi et.al.|[2210.05979](http://arxiv.org/abs/2210.05979)|null|
 921 | |**2022-10-06**|**An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era**|Andreas Triantafyllopoulos et.al.|[2210.03538](http://arxiv.org/abs/2210.03538)|null|
 922 | |**2022-09-29**|**Facial Landmark Predictions with Applications to Metaverse**|Qiao Han et.al.|[2209.14698](http://arxiv.org/abs/2209.14698)|**[link](https://github.com/sweatybridge/text-to-anime)**|
 923 | |**2022-09-26**|**Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech**|Yusuke Nakai et.al.|[2209.12549](http://arxiv.org/abs/2209.12549)|null|
 924 | |**2022-09-22**|**EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models**|Perry Lam et.al.|[2209.10890](http://arxiv.org/abs/2209.10890)|null|
 925 | |**2022-09-22**|**MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline**|Yifan Hu et.al.|[2209.10848](http://arxiv.org/abs/2209.10848)|**[link](https://github.com/walker-hyf/mntts)**|
 926 | |**2022-09-22**|**Controllable Accented Text-to-Speech Synthesis**|Rui Liu et.al.|[2209.10804](http://arxiv.org/abs/2209.10804)|null|
 927 | |**2022-09-16**|**TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection**|Davide Salvi et.al.|[2209.08000](http://arxiv.org/abs/2209.08000)|null|
 928 | |**2022-09-14**|**Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset**|Michael Chinen et.al.|[2209.06358](http://arxiv.org/abs/2209.06358)|null|
 929 | |**2022-09-08**|**SANIP: Shopping Assistant and Navigation for the visually impaired**|Shubham Deshmukh et.al.|[2209.03570](http://arxiv.org/abs/2209.03570)|null|
 930 | |**2022-09-07**|**Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech**|Huu-Tien Dang et.al.|[2209.02971](http://arxiv.org/abs/2209.02971)|null|
 931 | |**2022-09-02**|**Improving Contextual Recognition of Rare Words with an Alternate Spelling Prediction Model**|Jennifer Drexler Fox et.al.|[2209.01250](http://arxiv.org/abs/2209.01250)|null|
 932 | |**2022-08-28**|**Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks**|Lev Finkelstein et.al.|[2208.13183](http://arxiv.org/abs/2208.13183)|null|
 933 | |**2022-10-04**|**Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale**|Aditya Agarwal et.al.|[2208.09796](http://arxiv.org/abs/2208.09796)|null|
 934 | |**2022-08-21**|**Visualising Model Training via Vowel Space for Text-To-Speech Systems**|Binu Abeysinghe et.al.|[2208.09775](http://arxiv.org/abs/2208.09775)|**[link](https://github.com/babe269/performant)**|
 935 | |**2022-08-15**|**Towards Parametric Speech Synthesis Using Gaussian-Markov Model of Spectral Envelope and Wavelet-Based Decomposition of F0**|Mohammed Salah Al-Radhi et.al.|[2208.07122](http://arxiv.org/abs/2208.07122)|null|
 936 | |**2022-12-28**|**Speech Synthesis with Mixed Emotions**|Kun Zhou et.al.|[2208.05890](http://arxiv.org/abs/2208.05890)|null|
 937 | |**2022-08-03**|**A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis**|Qibing Bai et.al.|[2208.02189](http://arxiv.org/abs/2208.02189)|null|
 938 | |**2022-07-29**|**Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation**|Giulia Comini et.al.|[2207.14607](http://arxiv.org/abs/2207.14607)|null|
 939 | |**2022-07-25**|**Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis**|Raul Fernandez et.al.|[2207.12262](http://arxiv.org/abs/2207.12262)|null|
 940 | |**2022-07-01**|**A Polyphone BERT for Polyphone Disambiguation in Mandarin Chinese**|Song Zhang et.al.|[2207.12089](http://arxiv.org/abs/2207.12089)|null|
 941 | |**2022-07-20**|**When Is TTS Augmentation Through a Pivot Language Useful?**|Nathaniel Robinson et.al.|[2207.09889](http://arxiv.org/abs/2207.09889)|**[link](https://github.com/n8rob/multilingual_tts_augmentation)**|
 942 | |**2022-07-11**|**LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech**|Harshvardhan Anand et.al.|[2207.07118](http://arxiv.org/abs/2207.07118)|null|
 943 | |**2022-07-13**|**ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech**|Rongjie Huang et.al.|[2207.06389](http://arxiv.org/abs/2207.06389)|**[link](https://github.com/Rongjiehuang/ProDiff)**|
 944 | |**2022-07-13**|**Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech**|Zhengxi Liu et.al.|[2207.06088](http://arxiv.org/abs/2207.06088)|null|
 945 | |**2022-07-13**|**SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate**|Nabarun Goswami et.al.|[2207.06011](http://arxiv.org/abs/2207.06011)|null|
 946 | |**2022-07-13**|**Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS**|Yookyung Shin et.al.|[2207.06000](http://arxiv.org/abs/2207.06000)|null|
 947 | |**2022-07-13**|**A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System**|Yi-Chiao Wu et.al.|[2207.05913](http://arxiv.org/abs/2207.05913)|null|
 948 | |**2022-07-12**|**Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition**|Rodolfo Zevallos et.al.|[2207.05498](http://arxiv.org/abs/2207.05498)|null|
 949 | |**2022-07-12**|**End-to-end speech recognition modeling from de-identified data**|Martin Flechl et.al.|[2207.05469](http://arxiv.org/abs/2207.05469)|null|
 950 | |**2022-07-11**|**Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data**|Naoki Makishima et.al.|[2207.04659](http://arxiv.org/abs/2207.04659)|null|
 951 | |**2022-07-11**|**DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders**|Yanqing Liu et.al.|[2207.04646](http://arxiv.org/abs/2207.04646)|null|
 952 | |**2023-01-02**|**Dreamento: an open-source dream engineering toolbox for sleep EEG wearables**|Mahdad Jafarzadeh Esfahani et.al.|[2207.03977](http://arxiv.org/abs/2207.03977)|**[link](https://github.com/dreamento/dreamento)**|
 953 | |**2022-07-07**|**BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus**|Josh Meyer et.al.|[2207.03546](http://arxiv.org/abs/2207.03546)|**[link](https://github.com/alpoktem/bible2speechdb)**|
 954 | |**2022-07-05**|**Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion**|Yi Lei et.al.|[2207.01832](http://arxiv.org/abs/2207.01832)|null|
 955 | |**2022-07-04**|**BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model**|Brooke Stephenson et.al.|[2207.01718](http://arxiv.org/abs/2207.01718)|null|
 956 | |**2022-07-04**|**Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS)**|Ariadna Sanchez et.al.|[2207.01547](http://arxiv.org/abs/2207.01547)|null|
 957 | |**2022-07-04**|**Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)**|Ziyao Zhang et.al.|[2207.01507](http://arxiv.org/abs/2207.01507)|null|
 958 | |**2023-03-13**|**DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech**|Keon Lee et.al.|[2207.01063](http://arxiv.org/abs/2207.01063)|**[link](https://github.com/keonlee9420/DailyTalk)**|
 959 | |**2022-07-02**|**Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need**|Daniel Korzekwa et.al.|[2207.00774](http://arxiv.org/abs/2207.00774)|null|
 960 | |**2022-07-01**|**Building African Voices**|Perez Ogayo et.al.|[2207.00688](http://arxiv.org/abs/2207.00688)|**[link](https://github.com/neulab/africanvoices)**|
 961 | |**2022-07-01**|**Automatic Evaluation of Speaker Similarity**|Deja Kamil et.al.|[2207.00344](http://arxiv.org/abs/2207.00344)|null|
 962 | |**2022-08-03**|**Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding**|Wei-Ping Huang et.al.|[2206.15427](http://arxiv.org/abs/2206.15427)|null|
 963 | |**2022-06-30**|**R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS**|Kyle Kastner et.al.|[2206.15276](http://arxiv.org/abs/2206.15276)|null|
 964 | |**2022-07-01**|**Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems**|Hyun-Wook Yoon et.al.|[2206.15067](http://arxiv.org/abs/2206.15067)|null|
 965 | |**2022-06-30**|**TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder**|Eunwoo Song et.al.|[2206.14984](http://arxiv.org/abs/2206.14984)|null|
 966 | |**2022-06-29**|**Improving Deliberation by Text-Only and Semi-Supervised Training**|Ke Hu et.al.|[2206.14716](http://arxiv.org/abs/2206.14716)|null|
 967 | |**2022-06-29**|**Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody**|Peter Makarov et.al.|[2206.14643](http://arxiv.org/abs/2206.14643)|null|
 968 | |**2022-06-28**|**Expressive, Variable, and Controllable Duration Modelling in TTS**|Ammar Abbas et.al.|[2206.14165](http://arxiv.org/abs/2206.14165)|null|
 969 | |**2022-06-28**|**Comparison of Speech Representations for the MOS Prediction System**|Aki Kunikoshi et.al.|[2206.13817](http://arxiv.org/abs/2206.13817)|null|
 970 | |**2022-06-22**|**A Simple Baseline for Domain Adaptation in End to End ASR Systems Using Synthetic Data**|Raviraj Joshi et.al.|[2206.13240](http://arxiv.org/abs/2206.13240)|null|
 971 | |**2022-06-25**|**Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations**|Chin-Cheng Hsu et.al.|[2206.12662](http://arxiv.org/abs/2206.12662)|null|
 972 | |**2022-10-21**|**Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech**|Florian Lux et.al.|[2206.12229](http://arxiv.org/abs/2206.12229)|**[link](https://github.com/digitalphonetics/ims-toucan)**|
 973 | |**2022-06-24**|**SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech**|Hyunjae Cho et.al.|[2206.12132](http://arxiv.org/abs/2206.12132)|null|
 974 | |**2022-06-24**|**End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue**|Kentaro Mitsui et.al.|[2206.12040](http://arxiv.org/abs/2206.12040)|null|
 975 | |**2022-05-29**|**Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning**|Sameea Naeem et.al.|[2206.11860](http://arxiv.org/abs/2206.11860)|null|
 976 | |**2022-06-21**|**Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS**|Kenta Udagawa et.al.|[2206.10256](http://arxiv.org/abs/2206.10256)|null|
 977 | |**2022-06-24**|**Towards Optimizing OCR for Accessibility**|Peya Mowar et.al.|[2206.10254](http://arxiv.org/abs/2206.10254)|null|
 978 | |**2022-06-16**|**Automatic Prosody Annotation with Pre-Trained Text-Speech Model**|Ziqian Dai et.al.|[2206.07956](http://arxiv.org/abs/2206.07956)|**[link](https://github.com/daisyqk/automatic-prosody-annotation)**|
 979 | |**2022-11-16**|**NatiQ: An End-to-end Text-to-Speech System for Arabic**|Ahmed Abdelali et.al.|[2206.07373](http://arxiv.org/abs/2206.07373)|null|
 980 | |**2022-06-15**|**Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning**|Rui Liu et.al.|[2206.07229](http://arxiv.org/abs/2206.07229)|**[link](https://github.com/ttslr/strengthnet)**|
 981 | |**2022-12-12**|**A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation**|Junhui Zhang et.al.|[2206.04922](http://arxiv.org/abs/2206.04922)|null|
 982 | |**2022-06-09**|**Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos**|Alexander Waibel et.al.|[2206.04523](http://arxiv.org/abs/2206.04523)|null|
 983 | |**2022-06-07**|**FlexLip: A Controllable Text-to-Lip System**|Dan Oneata et.al.|[2206.03206](http://arxiv.org/abs/2206.03206)|null|
 984 | |**2022-10-11**|**UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder**|Jiachen Lian et.al.|[2206.02512](http://arxiv.org/abs/2206.02512)|null|
 985 | |**2023-10-19**|**Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech**|Ziyue Jiang et.al.|[2206.02147](http://arxiv.org/abs/2206.02147)|**[link](https://github.com/zain-jiang/dict-tts)**|
 986 | |**2022-11-02**|**AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation**|Kun Song et.al.|[2206.00208](http://arxiv.org/abs/2206.00208)|null|
 987 | |**2022-05-31**|**Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish**|Alp Öktem et.al.|[2205.15599](http://arxiv.org/abs/2205.15599)|**[link](https://github.com/collectivat-dev/ladino_tts)**|
 988 | |**2023-11-20**|**StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis**|Yinghao Aaron Li et.al.|[2205.15439](http://arxiv.org/abs/2205.15439)|**[link](https://github.com/yl4579/StyleTTS)**|
 989 | |**2022-05-30**|**Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data**|Sungwon Kim et.al.|[2205.15370](http://arxiv.org/abs/2205.15370)|null|
 990 | |**2022-05-26**|**QSpeech: Low-Qubit Quantum Speech Application Toolkit**|Zhenhou Hong et.al.|[2205.13221](http://arxiv.org/abs/2205.13221)|**[link](https://github.com/zhenhouhong/qspeech)**|
 991 | |**2022-11-10**|**T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation**|Paul-Ambroise Duquenne et.al.|[2205.12216](http://arxiv.org/abs/2205.12216)|null|
 992 | |**2022-05-20**|**PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit**|Hui Zhang et.al.|[2205.12007](http://arxiv.org/abs/2205.12007)|**[link](https://github.com/PaddlePaddle/PaddleSpeech)**|
 993 | |**2022-05-24**|**TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS**|Xulong Zhang et.al.|[2205.11824](http://arxiv.org/abs/2205.11824)|null|
 994 | |**2022-10-12**|**GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech**|Rongjie Huang et.al.|[2205.07211](http://arxiv.org/abs/2205.07211)|**[link](https://github.com/Rongjiehuang/GenerSpeech)**|
 995 | |**2022-05-13**|**Talking Face Generation with Multilingual TTS**|Hyoung-Kyu Song et.al.|[2205.06421](http://arxiv.org/abs/2205.06421)|null|
 996 | |**2022-05-10**|**NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality**|Xu Tan et.al.|[2205.04421](http://arxiv.org/abs/2205.04421)|**[link](https://github.com/microsoft/NeuralSpeech)**|
 997 | |**2022-05-09**|**Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech**|Yang Li et.al.|[2205.04120](http://arxiv.org/abs/2205.04120)|**[link](https://github.com/neurowave-ai/cucvae-tts)**|
 998 | |**2022-05-09**|**ReCAB-VAE: Gumbel-Softmax Variational Inference Based on Analytic Divergence**|Sangshin Oh et.al.|[2205.04104](http://arxiv.org/abs/2205.04104)|null|
 999 | |**2022-07-14**|**Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss**|Efthymios Georgiou et.al.|[2204.13437](http://arxiv.org/abs/2204.13437)|null|
1000 | |**2022-04-25**|**SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech**|Zhenhui Ye et.al.|[2204.11792](http://arxiv.org/abs/2204.11792)|null|
1001 | |**2022-04-22**|**LibriS2S: A German-English Speech-to-Speech Translation Corpus**|Pedro Jeuris et.al.|[2204.10593](http://arxiv.org/abs/2204.10593)|**[link](https://github.com/pedrodke/libris2s)**|
1002 | |**2022-07-05**|**Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation**|Ryo Terashima et.al.|[2204.10020](http://arxiv.org/abs/2204.10020)|null|
1003 | |**2022-04-21**|**FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis**|Rongjie Huang et.al.|[2204.09934](http://arxiv.org/abs/2204.09934)|**[link](https://github.com/Rongjiehuang/FastDiff)**|
1004 | |**2022-04-20**|**Audio Deep Fake Detection System with Neural Stitching for ADD 2022**|Rui Yan et.al.|[2204.08720](http://arxiv.org/abs/2204.08720)|null|
1005 | |**2022-04-14**|**Applying Feature Underspecified Lexicon Phonological Features in Multilingual Text-to-Speech**|Cong Zhang et.al.|[2204.07228](http://arxiv.org/abs/2204.07228)|null|
1006 | |**2022-12-09**|**Study of Indian English Pronunciation Variabilities relative to Received Pronunciation**|Priyanshi Pal et.al.|[2204.06502](http://arxiv.org/abs/2204.06502)|null|
1007 | |**2022-04-12**|**Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch**|Hanbin Bae et.al.|[2204.05753](http://arxiv.org/abs/2204.05753)|null|
1008 | |**2023-01-30**|**The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance**|Lin Zhang et.al.|[2204.05177](http://arxiv.org/abs/2204.05177)|null|
1009 | |**2022-10-27**|**Fine-grained Noise Control for Multispeaker Speech Synthesis**|Karolos Nikitaras et.al.|[2204.05070](http://arxiv.org/abs/2204.05070)|null|
1010 | |**2022-08-31**|**Karaoker: Alignment-free singing voice synthesis with speech training data**|Panos Kakoulidis et.al.|[2204.04127](http://arxiv.org/abs/2204.04127)|null|
1011 | |**2022-08-15**|**Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech**|Jae-Sung Bae et.al.|[2204.04004](http://arxiv.org/abs/2204.04004)|null|
1012 | |**2022-04-07**|**Arabic Text-To-Speech (TTS) Data Preparation**|Hala Al Masri et.al.|[2204.03255](http://arxiv.org/abs/2204.03255)|null|
1013 | |**2022-04-07**|**Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis**|Yutian Wang et.al.|[2204.03238](http://arxiv.org/abs/2204.03238)|null|
1014 | |**2022-08-24**|**SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis**|Georgia Maniati et.al.|[2204.03040](http://arxiv.org/abs/2204.03040)|null|
1015 | |**2022-09-13**|**Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation**|Sravya Popuri et.al.|[2204.02967](http://arxiv.org/abs/2204.02967)|null|
1016 | |**2022-07-02**|**Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification**|Jin Woo Lee et.al.|[2204.02639](http://arxiv.org/abs/2204.02639)|null|
1017 | |**2023-08-28**|**Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech**|Hyungchan Yoon et.al.|[2204.02172](http://arxiv.org/abs/2204.02172)|null|
1018 | |**2022-09-07**|**Deliberation Model for On-Device Spoken Language Understanding**|Duc Le et.al.|[2204.01893](http://arxiv.org/abs/2204.01893)|null|
1019 | |**2022-12-14**|**Anti-Spoofing Using Transfer Learning with Variational Information Bottleneck**|Youngsik Eom et.al.|[2204.01387](http://arxiv.org/abs/2204.01387)|null|
1020 | |**2022-11-11**|**Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis**|Yixuan Zhou et.al.|[2204.00990](http://arxiv.org/abs/2204.00990)|null|
1021 | |**2022-06-30**|**VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature**|Chenpeng Du et.al.|[2204.00768](http://arxiv.org/abs/2204.00768)|null|
1022 | |**2022-04-01**|**AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios**|Yihan Wu et.al.|[2204.00436](http://arxiv.org/abs/2204.00436)|null|
1023 | |**2022-04-01**|**Text-To-Speech Data Augmentation for Low Resource Speech Recognition**|Rodolfo Zevallos et.al.|[2204.00291](http://arxiv.org/abs/2204.00291)|null|
1024 | |**2022-07-19**|**Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech**|Guangyan Zhang et.al.|[2203.17190](http://arxiv.org/abs/2203.17190)|null|
1025 | |**2022-03-31**|**An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer**|Wenlin Dai et.al.|[2203.16954](http://arxiv.org/abs/2203.16954)|**[link](https://github.com/thuhcsi/flattn)**|
1026 | |**2022-07-11**|**WavThruVec: Latent speech representation as intermediate features for neural speech synthesis**|Hubert Siuzdak et.al.|[2203.16930](http://arxiv.org/abs/2203.16930)|null|
1027 | |**2022-03-31**|**A Character-level Span-based Model for Mandarin Prosodic Structure Prediction**|Xueyuan Chen et.al.|[2203.16922](http://arxiv.org/abs/2203.16922)|**[link](https://github.com/thuhcsi/spanpsp)**|
1028 | |**2022-07-01**|**JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech**|Dan Lim et.al.|[2203.16852](http://arxiv.org/abs/2203.16852)|**[link](https://github.com/imdanboy/jets)**|
1029 | |**2022-03-31**|**Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset**|Zehui Yang et.al.|[2203.16844](http://arxiv.org/abs/2203.16844)|null|
1030 | |**2022-03-31**|**NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism**|Jingbei Li et.al.|[2203.16838](http://arxiv.org/abs/2203.16838)|**[link](https://github.com/thuhcsi/neufa)**|
1031 | |**2022-03-31**|**Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition**|Anirudh Gupta et.al.|[2203.16823](http://arxiv.org/abs/2203.16823)|null|
1032 | |**2022-04-21**|**Does Audio Deepfake Detection Generalize?**|Nicolas M. Müller et.al.|[2203.16263](http://arxiv.org/abs/2203.16263)|null|
1033 | |**2022-03-30**|**End to End Lip Synchronization with a Temporal AutoEncoder**|Yoav Shalev et.al.|[2203.16224](http://arxiv.org/abs/2203.16224)|**[link](https://github.com/itsyoavshalev/end-to-end-lip-synchronization-with-a-temporal-autoencoder)**|
1034 | |**2022-08-15**|**Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition**|Junrui Ni et.al.|[2203.15796](http://arxiv.org/abs/2203.15796)|**[link](https://github.com/lwang114/unsuptts)**|
1035 | |**2022-06-29**|**DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning**|Takaaki Saeki et.al.|[2203.15683](http://arxiv.org/abs/2203.15683)|null|
1036 | |**2022-11-05**|**Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation**|Rendi Chevi et.al.|[2203.15643](http://arxiv.org/abs/2203.15643)|**[link](https://github.com/rendchevi/nix-tts)**|
1037 | |**2022-10-06**|**Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus**|Minchan Kim et.al.|[2203.15447](http://arxiv.org/abs/2203.15447)|null|
1038 | |**2022-07-11**|**VoiceMe: Personalized voice generation in TTS**|Pol van Rijn et.al.|[2203.15379](http://arxiv.org/abs/2203.15379)|**[link](https://github.com/polvanrijn/voiceme)**|
1039 | 
1040 | [contributors-shield]: https://img.shields.io/github/contributors/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1041 | [contributors-url]: https://github.com/Vincentqyw/cv-arxiv-daily/graphs/contributors
1042 | [forks-shield]: https://img.shields.io/github/forks/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1043 | [forks-url]: https://github.com/Vincentqyw/cv-arxiv-daily/network/members
1044 | [stars-shield]: https://img.shields.io/github/stars/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1045 | [stars-url]: https://github.com/Vincentqyw/cv-arxiv-daily/stargazers
1046 | [issues-shield]: https://img.shields.io/github/issues/Vincentqyw/cv-arxiv-daily.svg?style=for-the-badge
1047 | [issues-url]: https://github.com/Vincentqyw/cv-arxiv-daily/issues
1048 | 
1049 | 


--------------------------------------------------------------------------------