├── .gitignore
├── .idea
├── .gitignore
├── SenseVoice-Real-Time.iml
├── inspectionProfiles
│ ├── Project_Default.xml
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
└── vcs.xml
├── README.md
├── SenseVoiceSmall
└── .gitkeep
├── audio_convert.py
├── audio_logs
└── .gitkeep
├── demo1.py
├── demo_record_natural_voice_lock.py
├── demo_vad_voice_translate_async.py
├── model.py
├── models
└── vad_onnx
│ └── silero_vad.onnx
├── pics
└── 运行截图.png
├── requirements.txt
├── speak_log
└── log.txt
├── speakers
└── .gitkeep
└── speech_campplus_sv_zh_en_16k-common_advanced
└── .gitkeep
/.gitignore:
--------------------------------------------------------------------------------
1 | speakers/*
2 | !speakers/.gitkeep
3 | audio_logs/*
4 | !audio_logs/.gitkeep
5 | models/SenseVoiceSmall/*
6 | !models/SenseVoiceSmall/.gitkeep
7 | models/speech_campplus_sv_zh_en_16k-common_advanced/*
8 | !models/speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep
9 | /improve/
10 |
--------------------------------------------------------------------------------
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | /shelf/
3 | /workspace.xml
4 | # Editor-based HTTP Client requests
5 | /httpRequests/
6 | # Datasource local storage ignored files
7 | /dataSources/
8 | /dataSources.local.xml
9 |
--------------------------------------------------------------------------------
/.idea/SenseVoice-Real-Time.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/Project_Default.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # SenseVoice-Real-Time
2 |
3 | 本项目就是一个小的学习项目,实现了一下最简单的语音端点检测VAD(没用模型实现,这点确实该优化一下)、利用SenseVoice实现语音转录、利用CAM++实现说话人确认(声纹锁),实现并不复杂,简单玩玩,没什么实力,如果你找到了我这那纯属我们的缘分哈哈。
4 |
5 |
6 |
7 | PS:下面这部分是旧版本的使用说明了,当然了也是能跑起来的,想看最新的可以拉到最底下去看新版使用说明。
8 |
9 |
10 |
11 | ## 运行步骤
12 |
13 | 1. **下个CUDA版本的torch,版本号<=2.3就行(但是可能会下的比较慢)**
14 |
15 | ```
16 | pip install torch==2.2.1+cu118 torchaudio==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118
17 | ```
18 |
19 | 2. **安装requirements.txt中的依赖**
20 |
21 | ```
22 | pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
23 | ```
24 |
25 | 3. **前往`modelscope`下载`SenseVoiceSmall`**
26 |
27 | 链接:https://www.modelscope.cn/models/iic/SenseVoiceSmall/files
28 |
29 | 或者直接通过 git lfs下载
30 |
31 | ```
32 | git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git
33 | ```
34 |
35 | 下载完成后放入根目录下的`SenseVoiceSmall`文件夹中即可
36 |
37 | 4. **前往`modelscope`下载`iic/speech_campplus_sv_zh_en_16k-common_advanced`**
38 | 链接:https://www.modelscope.cn/models/iic/speech_campplus_sv_zh_en_16k-common_advanced/files
39 | 或者直接通过 git lfs下载
40 |
41 | ```
42 | git clone https://www.modelscope.cn/iic/speech_campplus_sv_zh_en_16k-common_advanced.git
43 | ```
44 |
45 | 下载后放入根目录下的`speech_campplus_sv_zh_en_16k-common_advanced`文件夹即可
46 |
47 | 5. **自行录音,手机随便录一段,并通过脚本转换采样率**
48 |
49 | 首先将音频放入speakers文件夹中
50 | 项目中提供了脚本`audio_convert.py`,将音频转换为WAV格式并把采样率转换为16K,因为`speech_campplus`模型只能处理16K的音频。
51 | 如果你的音频名字有修改,记得去`demo_record_natural_voice_lock.py`中也把文件名改一下
52 |
53 | ```Python
54 | def main():
55 | # 创建保存目录(如果目录不存在)
56 | save_directory = "audio_logs"
57 | os.makedirs(save_directory, exist_ok=True)
58 | # 加载声纹锁示例音频,如果你的音频名字修改了则这里也需要修改
59 | reference_audio = "speakers/speaker_mine_converted.wav"
60 | ```
61 |
62 | 6. 运行`demo_record_natural_voice_lock.py`
63 |
64 | ```
65 | python demo_record_natural_voice_lock.py
66 | ```
67 |
68 |
69 |
70 |
71 | ## 运行示例
72 |
73 | 
74 |
75 |
76 |
77 | 如果你觉得运行效果不太行,可以修改一下参数,在`demo_record_natural_voice_lock.py`中
78 |
79 | ```Python
80 | # 音频参数
81 | CHUNK = 1024
82 | FORMAT = pyaudio.paInt16
83 | CHANNELS = 1
84 | RATE = 16000
85 | MAX_TIME = 60 # 最大录音时间(秒)
86 |
87 | # VAD 参数
88 | THRESHOLD = 500
89 | SILENCE_LIMIT = 2
90 |
91 | # 声纹识别参数
92 | SIMILARITY_THRESHOLD = 0.1 # 相似度阈值,可以根据需要调整
93 | ```
94 |
95 |
96 |
97 | 另外我觉得SenseVoice转录错字率有点太高了,不过能识别粤语挺不错的占用显存也少,你也可以在本项目基础上改用其他模型来玩玩。
98 |
99 |
100 |
101 |
102 |
103 | # 2024.09.23更新
104 |
105 | 也是更新上了,这次更新是大更,原本SenseVoice的错字率有点太高了,这次换了另一个模型,不过使用很简单。
106 |
107 | 先说说现在能实现的效果:使用VAD模型进行端点检测、较高的识别准确率、较低的显存占用、语音识别后添加标点(不过没做声纹锁哦这次,想要加也很简单的)
108 |
109 |
110 |
111 | 首先如果你之前clone过我的项目,那么请更新一下拉取最新的文件。
112 |
113 |
114 |
115 | 接着再次pip install一下,因为更新了几个依赖
116 |
117 | ```
118 | pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple -i https://pypi.tuna.tsinghua.edu.cn/simple some-package
119 | ```
120 |
121 |
122 |
123 | **然后分别下载这几个模型:**
124 |
125 | 1.`speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch`
126 |
127 | ```
128 | git clone https://www.modelscope.cn/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git
129 | ```
130 |
131 | 2.`speech_fsmn_vad_zh-cn-16k-common-pytorch`
132 |
133 | ```
134 | git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git
135 | ```
136 |
137 | 3.`punc_ct-transformer_zh-cn-common-vocab272727-pytorch`
138 |
139 | ```
140 | git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git
141 | ```
142 |
143 | PS:新版我没搞说话人确认,但是很简单,你可以去看看`speech_campplus_sv_zh-cn_16k-common`模型的官方的文档,把模型加进来还有按照我旧版的说明转换一下16K音频提供给模型即可。
144 |
145 | 然后把模型都放到根目录下的models文件夹里,各自起个文件夹对应,反正你能搞明白路径就行。
146 |
147 | ```py
148 | asr_model = AutoModel(model="./models/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
149 | model_revision="v2.0.4",
150 | vad_model="./models/speech_fsmn_vad_zh-cn-16k-common-pytorch",
151 | vad_model_revision="v2.0.4",
152 | punc_model="./models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
153 | punc_model_revision="v2.0.4",
154 | # 想搞说话人确认的在这加进来
155 | # spk_model="./models/speech_campplus_sv_zh-cn_16k-common",
156 | # spk_model_revision="v2.0.2",
157 | device="cuda:0")
158 | ```
159 |
160 |
161 |
162 | 确保已经有了这俩文件夹,其中silero_vad才是我们真正用来进行端点检测的模型。
163 |
164 | ```py
165 | vad_model_path = './models/vad_onnx/silero_vad.onnx' # VAD onnx模型路径
166 | audio_save_path = './audio_logs' # 保存临时音频文件的路径
167 | ```
168 |
169 | 那为什么后面AutoModel还要加一个VAD模型呢?因为这里不加VAD的话PUNC就不生效,我也懒得再分开了,感觉也不吃性能。
170 |
171 | 觉得膈应的可以把AutoModel里的VAD去掉,然后把PUNC单独拆开,代码里有详细说明。
172 |
173 |
174 |
175 | 调整参数以获取更好的结果:
176 |
177 | ```py
178 | max_files = 10 # 最多保留的临时音频数量
179 | window_size = 512 # 语音窗口大小
180 | min_silence_duration = 1.5 # 最小静音持续时间(s)
181 | min_speech_duration = 0.05 # 最小语音持续时间(s)
182 | segment_count = 0 # 用于生成音频文件的命名
183 | ```
184 |
185 |
186 |
187 | 最后运行`demo_vad_voice_translate_async.py`
188 |
189 | 本项目中使用了多线程进行在上一次识别的时候还能进行下一次的录音,让实时更实时。
190 |
191 | 有问题提Issue。
--------------------------------------------------------------------------------
/SenseVoiceSmall/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/SenseVoiceSmall/.gitkeep
--------------------------------------------------------------------------------
/audio_convert.py:
--------------------------------------------------------------------------------
1 | from pydub import AudioSegment
2 |
3 | # 读取音频文件,自行修改文件路径文件名
4 | audio = AudioSegment.from_file("./speaker/speaker_mine.mp3")
5 |
6 | # 转换为单声道并调整采样率
7 | audio = audio.set_channels(1)
8 | audio = audio.set_frame_rate(16000) # 如果需要将采样率调整为 16kHz
9 |
10 | # 保存转换后的音频文件
11 | audio.export("./speaker/speaker_mine_converted.wav", format="wav")
12 |
--------------------------------------------------------------------------------
/audio_logs/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/audio_logs/.gitkeep
--------------------------------------------------------------------------------
/demo1.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- encoding: utf-8 -*-
3 | # Copyright FunASR (https://github.com/FunAudioLLM/SenseVoice). All Rights Reserved.
4 | # MIT License (https://opensource.org/licenses/MIT)
5 |
6 | from funasr import AutoModel
7 | from funasr.utils.postprocess_utils import rich_transcription_postprocess
8 |
9 | model_dir = "models/SenseVoiceSmall"
10 |
11 | model = AutoModel(
12 | model=model_dir,
13 | trust_remote_code=True,
14 | remote_code="./model.py",
15 | vad_model="fsmn-vad",
16 | vad_kwargs={"max_single_segment_time": 30000},
17 | device="cuda:0",
18 | )
19 |
20 | # en
21 | res = model.generate(
22 | input=f"{model.model_path}/example/en.mp3",
23 | cache={},
24 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
25 | use_itn=True,
26 | batch_size_s=60,
27 | merge_vad=True, #
28 | merge_length_s=15,
29 | )
30 | text = rich_transcription_postprocess(res[0]["text"])
31 | print(text)
32 |
33 | # zh
34 | res = model.generate(
35 | input=f"{model.model_path}/example/zh.mp3",
36 | cache={},
37 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
38 | use_itn=True,
39 | batch_size_s=60,
40 | merge_vad=True, #
41 | merge_length_s=15,
42 | )
43 | text = rich_transcription_postprocess(res[0]["text"])
44 | print(text)
45 |
46 | # yue
47 | res = model.generate(
48 | input=f"{model.model_path}/example/yue.mp3",
49 | cache={},
50 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
51 | use_itn=True,
52 | batch_size_s=60,
53 | merge_vad=True, #
54 | merge_length_s=15,
55 | )
56 | text = rich_transcription_postprocess(res[0]["text"])
57 | print(text)
58 |
59 | # ja
60 | res = model.generate(
61 | input=f"{model.model_path}/example/ja.mp3",
62 | cache={},
63 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
64 | use_itn=True,
65 | batch_size_s=60,
66 | merge_vad=True, #
67 | merge_length_s=15,
68 | )
69 | text = rich_transcription_postprocess(res[0]["text"])
70 | print(text)
71 |
72 | # ko
73 | res = model.generate(
74 | input=f"{model.model_path}/example/ko.mp3",
75 | cache={},
76 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech"
77 | use_itn=True,
78 | batch_size_s=60,
79 | merge_vad=True, #
80 | merge_length_s=15,
81 | )
82 | text = rich_transcription_postprocess(res[0]["text"])
83 | print(text)
84 |
--------------------------------------------------------------------------------
/demo_record_natural_voice_lock.py:
--------------------------------------------------------------------------------
1 | import glob
2 | import os
3 |
4 | import pyaudio
5 | import wave
6 | import numpy as np
7 | import time
8 | from modelscope.pipelines import pipeline
9 | from model import SenseVoiceSmall
10 | from funasr.utils.postprocess_utils import rich_transcription_postprocess
11 |
12 | # 音频参数
13 | CHUNK = 1024
14 | FORMAT = pyaudio.paInt16
15 | CHANNELS = 1
16 | RATE = 16000
17 | MAX_TIME = 60 # 最大录音时间(秒)
18 |
19 | # VAD 参数
20 | THRESHOLD = 1000
21 | SILENCE_LIMIT = 2
22 |
23 | # 声纹识别参数
24 | SIMILARITY_THRESHOLD = 0.1 # 相似度阈值,可以根据需要调整
25 |
26 | accumulated_audio = np.array([]) # 用于存储累积的音频数据
27 | chunk_size = 200 # ms
28 | sample_rate = 16000
29 | chunk_stride = int(chunk_size * sample_rate / 1000)
30 |
31 |
32 | # 初始化声纹识别模型
33 | sv_pipeline = pipeline(
34 | task='speaker-verification',
35 | model='models/speech_campplus_sv_zh-cn_16k-common',
36 | model_revision='v1.0.0'
37 | )
38 |
39 | # 初始化 SenseVoiceSmall 模型
40 | model_dir = "models/SenseVoiceSmall"
41 | m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
42 | m.eval()
43 |
44 |
45 | def is_silent(data_chunk):
46 | return max(data_chunk) < THRESHOLD
47 |
48 |
49 | def record_audio():
50 | p = pyaudio.PyAudio()
51 | stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
52 |
53 | print("开始监听...")
54 | audio_buffer = []
55 | silence_start = None
56 | is_recording = False
57 |
58 | while True:
59 | data = stream.read(CHUNK)
60 | audio_buffer.append(data)
61 |
62 | if len(audio_buffer) > RATE / CHUNK * MAX_TIME:
63 | audio_buffer.pop(0)
64 |
65 | if not is_recording:
66 | if not is_silent(np.frombuffer(data, dtype=np.int16)):
67 | print("检测到声音,开始录音...")
68 | is_recording = True
69 | silence_start = None
70 | else:
71 | if is_silent(np.frombuffer(data, dtype=np.int16)):
72 | if silence_start is None:
73 | silence_start = time.time()
74 | elif time.time() - silence_start > SILENCE_LIMIT:
75 | print("检测到静音,停止录音")
76 | break
77 | else:
78 | silence_start = None
79 |
80 | stream.stop_stream()
81 | stream.close()
82 | p.terminate()
83 |
84 | return b''.join(audio_buffer)
85 |
86 | def save_audio(data, filename):
87 | wf = wave.open(filename, 'wb')
88 | wf.setnchannels(CHANNELS)
89 | wf.setsampwidth(pyaudio.PyAudio().get_sample_size(FORMAT))
90 | wf.setframerate(RATE)
91 | wf.writeframes(data)
92 | wf.close()
93 |
94 |
95 | def verify_voice(reference_audio, input_audio):
96 | result = sv_pipeline([reference_audio, input_audio])
97 | print(result)
98 | return result['score'] # 返回相似度得分
99 |
100 |
101 | def transcribe_audio(audio_file):
102 | res = m.inference(
103 | data_in=audio_file,
104 | language="auto",
105 | use_itn=False,
106 | ban_emo_unk=False,
107 | **kwargs,
108 | )
109 |
110 | text = rich_transcription_postprocess(res[0][0]["text"])
111 | return text
112 |
113 |
114 | # 手动热词
115 | def replace_diy_hotword(sentence):
116 | # 定义一个Map,用于存储高频错词和对应的替换词
117 | error_hotkey_map = {
118 | '你冇': '蕾姆',
119 | '我冇': '蕾姆',
120 | '雷母': '蕾姆',
121 | '雷姆': '蕾姆',
122 | '蕾母': '蕾姆',
123 | '雷冇': '蕾姆',
124 | '蕾冇': '蕾姆',
125 | '人母': '蕾姆',
126 | '你悟': '蕾姆',
127 | '你姆': '蕾姆',
128 | '人冇': '蕾姆',
129 | '人姆': '蕾姆',
130 | '李慕': '蕾姆',
131 | # 添加更多的错词和替换词
132 | }
133 | for wrong_word, correct_word in error_hotkey_map.items():
134 | # 每次替换都更新 sentence
135 | sentence = sentence.replace(wrong_word, correct_word)
136 | # print("corrected_text:{}".format(sentence))
137 | return sentence
138 |
139 |
140 | def main():
141 | # 创建保存目录(如果目录不存在)
142 | save_directory = "audio_logs"
143 | os.makedirs(save_directory, exist_ok=True)
144 | # 加载声纹锁示例音频
145 | reference_audio = "speakers/speaker_mine_converted.wav"
146 |
147 | max_files = 10 # 最多保留的文件数量
148 |
149 | while True:
150 | audio_data = record_audio()
151 | # 使用 os.path.join 来拼接保存路径
152 | output_filename = os.path.join(save_directory, f"recorded_audio_{int(time.time())}.wav")
153 | save_audio(audio_data, output_filename)
154 |
155 | # 检查并删除最旧的文件(如果文件数量超过 max_files)
156 | existing_files = glob.glob(os.path.join(save_directory, "*.wav"))
157 | if len(existing_files) > max_files:
158 | # 按文件的修改时间排序
159 | existing_files.sort(key=os.path.getmtime)
160 | # 删除最旧的文件
161 | os.remove(existing_files[0])
162 |
163 | print("正在进行声纹验证...")
164 | similarity = verify_voice(reference_audio, output_filename)
165 |
166 | if similarity >= SIMILARITY_THRESHOLD:
167 | print(f"声纹验证通过 (相似度: {similarity:.2f})")
168 | print("正在进行语音识别...")
169 | transcribed_text = replace_diy_hotword(transcribe_audio(output_filename))
170 | print("识别结果:")
171 | print(transcribed_text)
172 |
173 | log_dir = 'speak_log'
174 | if not os.path.exists(log_dir):
175 | os.makedirs(log_dir)
176 | with open(os.path.join(log_dir, 'log.txt'), 'a', encoding='utf-8') as f:
177 | if transcribed_text is not None or transcribed_text != "":
178 | f.write(transcribed_text)
179 | f.write("\n")
180 | f.flush() # 刷新缓冲区,确保数据写入磁盘
181 | else:
182 | print(f"声纹验证失败 (相似度: {similarity:.2f})")
183 |
184 | print("\n准备进行下一次录音,按 Ctrl+C 退出程序")
185 |
186 |
187 | if __name__ == "__main__":
188 | try:
189 | main()
190 | except KeyboardInterrupt:
191 | print("程序已退出")
192 |
--------------------------------------------------------------------------------
/demo_vad_voice_translate_async.py:
--------------------------------------------------------------------------------
1 | import gc
2 | import glob
3 | import time
4 | import numpy as np
5 | import sounddevice as sd
6 | import sherpa_onnx
7 | import os
8 | import scipy.io.wavfile as wav
9 | import torch
10 | from funasr import AutoModel
11 | from concurrent.futures import ThreadPoolExecutor
12 |
13 | g_sample_rate = 16000 # 采样率,只能16k哦
14 | vad_model_path = './models/vad_onnx/silero_vad.onnx' # VAD onnx模型路径
15 | audio_save_path = './audio_logs' # 保存临时音频文件的路径
16 | max_files = 10 # 最多保留的临时音频数量
17 | window_size = 512 # 语音窗口大小
18 | min_silence_duration = 1.5 # 最小静音持续时间(s)
19 | min_speech_duration = 0.05 # 最小语音持续时间(s)
20 | segment_count = 0 # 用于生成音频文件的命名
21 |
22 | # 初始化ASR、PUNC、SPK模型
23 | # 为什么要加一个VAD模型呢在这?因为这里不加VAD的话PUNC就不生效,我也懒得再分开了,感觉也不吃性能。
24 | # 觉得膈应的可以把AutoModel里的VAD去掉,然后把PUNC单独拆开,示例:
25 | # model = AutoModel(model="ct-punc", model_revision="v2.0.4")
26 | # res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
27 | asr_model = AutoModel(model="./models/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
28 | model_revision="v2.0.4",
29 | vad_model="./models/speech_fsmn_vad_zh-cn-16k-common-pytorch",
30 | vad_model_revision="v2.0.4",
31 | punc_model="./models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
32 | punc_model_revision="v2.0.4",
33 | # 想搞说话人确认的在这加进来
34 | # spk_model="./models/speech_campplus_sv_zh-cn_16k-common",
35 | # spk_model_revision="v2.0.2",
36 | device="cuda:0")
37 |
38 |
39 | def save_audio_to_file(audio_data, file_name):
40 | output_dir = os.path.dirname(file_name)
41 | if not os.path.exists(output_dir):
42 | os.makedirs(output_dir)
43 |
44 | amplified_audio = audio_data * 2
45 | amplified_audio = np.clip(amplified_audio, -1.0, 1.0)
46 | wav.write(file_name, g_sample_rate, amplified_audio.astype(np.float32))
47 | print(f"Audio saved to {file_name}")
48 |
49 |
50 | def rm_file():
51 | # 检查并删除最旧的文件(如果文件数量超过 max_files)
52 | existing_files = glob.glob(os.path.join(audio_save_path, "*.wav"))
53 | if len(existing_files) > max_files:
54 | # 按文件的修改时间排序
55 | existing_files.sort(key=os.path.getmtime)
56 | # 删除最旧的文件
57 | os.remove(existing_files[0])
58 |
59 |
60 | def asr_inference_task(file_name, segment_count):
61 | """
62 | ASR 推理任务:从文件加载音频并进行识别。
63 | """
64 | try:
65 | asr_res = asr_model.generate(input=file_name,
66 | batch_size_s=300,
67 | hotword='蕾姆 拉姆')
68 | for resi in asr_res:
69 | print(f"[{segment_count}] 结果:{resi['text']}")
70 | print("--" * 150)
71 | finally:
72 | # 手动清理显存和对象
73 | del asr_res # 删除结果对象
74 | torch.cuda.empty_cache() # 释放显存
75 | gc.collect() # 强制进行垃圾回收
76 |
77 |
78 | def start_recording_with_vad(vad_model_path, window_size):
79 | global segment_count
80 | vad_config = sherpa_onnx.VadModelConfig()
81 | vad_config.silero_vad.model = vad_model_path
82 | vad_config.silero_vad.min_silence_duration = min_silence_duration
83 | vad_config.silero_vad.min_speech_duration = min_speech_duration
84 | vad_config.sample_rate = g_sample_rate
85 |
86 | vad = sherpa_onnx.VoiceActivityDetector(vad_config, buffer_size_in_seconds=100)
87 |
88 | samples_per_read = int(0.1 * g_sample_rate)
89 |
90 | print("--" * 150)
91 | devices = sd.query_devices()
92 | if len(devices) == 0:
93 | print("No microphone devices found")
94 | return
95 |
96 | default_input_device_idx = sd.default.device[0]
97 | print(f'当前使用默认设备: 【{devices[default_input_device_idx]["name"]}】')
98 | print("开始监听...")
99 |
100 | buffer = np.array([]) # 初始化缓冲区
101 | preroll_buffer = np.array([]) # 用于存储前导音频
102 | preroll_duration = 0.3 # 希望保留的前导音频的持续时间(s)
103 | preroll_samples = int(preroll_duration * g_sample_rate) # 希望保留的前导音频的采样点数(0.2s * 16000 = 3200个采样点)
104 |
105 | # 初始化线程池执行器
106 | with ThreadPoolExecutor(max_workers=2) as executor:
107 | with sd.InputStream(channels=1, dtype="float32", samplerate=g_sample_rate) as stream:
108 | while True:
109 | samples, _ = stream.read(samples_per_read) # 每次读取 0.1 秒的音频数据
110 | samples = samples.reshape(-1) # 将音频数据展平为一维数组
111 |
112 | # 更新前导缓冲区
113 | preroll_buffer = np.concatenate([preroll_buffer, samples])
114 | if len(preroll_buffer) > preroll_samples:
115 | preroll_buffer = preroll_buffer[-preroll_samples:]
116 |
117 | buffer = np.concatenate([buffer, samples]) # 将新读取的音频数据添加到缓冲区
118 |
119 | while len(buffer) > window_size:
120 | start_time = time.time()
121 | vad.accept_waveform(buffer[:window_size]) # 向 VAD 模型输入窗口大小的音频数据
122 | buffer = buffer[window_size:]
123 |
124 | while not vad.empty():
125 | end_time = time.time()
126 | detection_time = end_time - start_time
127 | segment = vad.front
128 |
129 | if len(segment.samples) < 0.5 * g_sample_rate:
130 | vad.pop()
131 | continue
132 |
133 | print(f"检测到语音活动, VAD检测耗时: {detection_time:.6f} s")
134 |
135 | # 合并前导音频和检测到的语音段
136 | full_segment = np.concatenate([preroll_buffer, np.array(segment.samples)])
137 | segment_count += 1
138 | file_name = f"{audio_save_path}/speech_segment_{segment_count}.wav"
139 | save_audio_to_file(np.array(full_segment), file_name)
140 |
141 | # 将ASR推理任务提交到线程池中进行异步处理
142 | executor.submit(asr_inference_task, file_name, segment_count)
143 |
144 | vad.pop()
145 |
146 |
147 | if __name__ == "__main__":
148 | try:
149 | start_recording_with_vad(vad_model_path, window_size)
150 | except KeyboardInterrupt:
151 | print("加,马达捏~")
152 |
--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
1 |
2 | import time
3 | import torch
4 | from torch import nn
5 | import torch.nn.functional as F
6 | from typing import Iterable, Optional
7 |
8 | from funasr.register import tables
9 | from funasr.models.ctc.ctc import CTC
10 | from funasr.utils.datadir_writer import DatadirWriter
11 | from funasr.models.paraformer.search import Hypothesis
12 | from funasr.train_utils.device_funcs import force_gatherable
13 | from funasr.losses.label_smoothing_loss import LabelSmoothingLoss
14 | from funasr.metrics.compute_acc import compute_accuracy, th_accuracy
15 | from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
16 |
17 |
18 | class SinusoidalPositionEncoder(torch.nn.Module):
19 | """ """
20 |
21 | def __int__(self, d_model=80, dropout_rate=0.1):
22 | pass
23 |
24 | def encode(
25 | self, positions: torch.Tensor = None, depth: int = None, dtype: torch.dtype = torch.float32
26 | ):
27 | batch_size = positions.size(0)
28 | positions = positions.type(dtype)
29 | device = positions.device
30 | log_timescale_increment = torch.log(torch.tensor([10000], dtype=dtype, device=device)) / (
31 | depth / 2 - 1
32 | )
33 | inv_timescales = torch.exp(
34 | torch.arange(depth / 2, device=device).type(dtype) * (-log_timescale_increment)
35 | )
36 | inv_timescales = torch.reshape(inv_timescales, [batch_size, -1])
37 | scaled_time = torch.reshape(positions, [1, -1, 1]) * torch.reshape(
38 | inv_timescales, [1, 1, -1]
39 | )
40 | encoding = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=2)
41 | return encoding.type(dtype)
42 |
43 | def forward(self, x):
44 | batch_size, timesteps, input_dim = x.size()
45 | positions = torch.arange(1, timesteps + 1, device=x.device)[None, :]
46 | position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
47 |
48 | return x + position_encoding
49 |
50 |
51 | class PositionwiseFeedForward(torch.nn.Module):
52 | """Positionwise feed forward layer.
53 |
54 | Args:
55 | idim (int): Input dimenstion.
56 | hidden_units (int): The number of hidden units.
57 | dropout_rate (float): Dropout rate.
58 |
59 | """
60 |
61 | def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()):
62 | """Construct an PositionwiseFeedForward object."""
63 | super(PositionwiseFeedForward, self).__init__()
64 | self.w_1 = torch.nn.Linear(idim, hidden_units)
65 | self.w_2 = torch.nn.Linear(hidden_units, idim)
66 | self.dropout = torch.nn.Dropout(dropout_rate)
67 | self.activation = activation
68 |
69 | def forward(self, x):
70 | """Forward function."""
71 | return self.w_2(self.dropout(self.activation(self.w_1(x))))
72 |
73 |
74 | class MultiHeadedAttentionSANM(nn.Module):
75 | """Multi-Head Attention layer.
76 |
77 | Args:
78 | n_head (int): The number of heads.
79 | n_feat (int): The number of features.
80 | dropout_rate (float): Dropout rate.
81 |
82 | """
83 |
84 | def __init__(
85 | self,
86 | n_head,
87 | in_feat,
88 | n_feat,
89 | dropout_rate,
90 | kernel_size,
91 | sanm_shfit=0,
92 | lora_list=None,
93 | lora_rank=8,
94 | lora_alpha=16,
95 | lora_dropout=0.1,
96 | ):
97 | """Construct an MultiHeadedAttention object."""
98 | super().__init__()
99 | assert n_feat % n_head == 0
100 | # We assume d_v always equals d_k
101 | self.d_k = n_feat // n_head
102 | self.h = n_head
103 | # self.linear_q = nn.Linear(n_feat, n_feat)
104 | # self.linear_k = nn.Linear(n_feat, n_feat)
105 | # self.linear_v = nn.Linear(n_feat, n_feat)
106 |
107 | self.linear_out = nn.Linear(n_feat, n_feat)
108 | self.linear_q_k_v = nn.Linear(in_feat, n_feat * 3)
109 | self.attn = None
110 | self.dropout = nn.Dropout(p=dropout_rate)
111 |
112 | self.fsmn_block = nn.Conv1d(
113 | n_feat, n_feat, kernel_size, stride=1, padding=0, groups=n_feat, bias=False
114 | )
115 | # padding
116 | left_padding = (kernel_size - 1) // 2
117 | if sanm_shfit > 0:
118 | left_padding = left_padding + sanm_shfit
119 | right_padding = kernel_size - 1 - left_padding
120 | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0)
121 |
122 | def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None):
123 | b, t, d = inputs.size()
124 | if mask is not None:
125 | mask = torch.reshape(mask, (b, -1, 1))
126 | if mask_shfit_chunk is not None:
127 | mask = mask * mask_shfit_chunk
128 | inputs = inputs * mask
129 |
130 | x = inputs.transpose(1, 2)
131 | x = self.pad_fn(x)
132 | x = self.fsmn_block(x)
133 | x = x.transpose(1, 2)
134 | x += inputs
135 | x = self.dropout(x)
136 | if mask is not None:
137 | x = x * mask
138 | return x
139 |
140 | def forward_qkv(self, x):
141 | """Transform query, key and value.
142 |
143 | Args:
144 | query (torch.Tensor): Query tensor (#batch, time1, size).
145 | key (torch.Tensor): Key tensor (#batch, time2, size).
146 | value (torch.Tensor): Value tensor (#batch, time2, size).
147 |
148 | Returns:
149 | torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
150 | torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
151 | torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
152 |
153 | """
154 | b, t, d = x.size()
155 | q_k_v = self.linear_q_k_v(x)
156 | q, k, v = torch.split(q_k_v, int(self.h * self.d_k), dim=-1)
157 | q_h = torch.reshape(q, (b, t, self.h, self.d_k)).transpose(
158 | 1, 2
159 | ) # (batch, head, time1, d_k)
160 | k_h = torch.reshape(k, (b, t, self.h, self.d_k)).transpose(
161 | 1, 2
162 | ) # (batch, head, time2, d_k)
163 | v_h = torch.reshape(v, (b, t, self.h, self.d_k)).transpose(
164 | 1, 2
165 | ) # (batch, head, time2, d_k)
166 |
167 | return q_h, k_h, v_h, v
168 |
169 | def forward_attention(self, value, scores, mask, mask_att_chunk_encoder=None):
170 | """Compute attention context vector.
171 |
172 | Args:
173 | value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
174 | scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
175 | mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
176 |
177 | Returns:
178 | torch.Tensor: Transformed value (#batch, time1, d_model)
179 | weighted by the attention score (#batch, time1, time2).
180 |
181 | """
182 | n_batch = value.size(0)
183 | if mask is not None:
184 | if mask_att_chunk_encoder is not None:
185 | mask = mask * mask_att_chunk_encoder
186 |
187 | mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2)
188 |
189 | min_value = -float(
190 | "inf"
191 | ) # float(numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min)
192 | scores = scores.masked_fill(mask, min_value)
193 | self.attn = torch.softmax(scores, dim=-1).masked_fill(
194 | mask, 0.0
195 | ) # (batch, head, time1, time2)
196 | else:
197 | self.attn = torch.softmax(scores, dim=-1) # (batch, head, time1, time2)
198 |
199 | p_attn = self.dropout(self.attn)
200 | x = torch.matmul(p_attn, value) # (batch, head, time1, d_k)
201 | x = (
202 | x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
203 | ) # (batch, time1, d_model)
204 |
205 | return self.linear_out(x) # (batch, time1, d_model)
206 |
207 | def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
208 | """Compute scaled dot product attention.
209 |
210 | Args:
211 | query (torch.Tensor): Query tensor (#batch, time1, size).
212 | key (torch.Tensor): Key tensor (#batch, time2, size).
213 | value (torch.Tensor): Value tensor (#batch, time2, size).
214 | mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
215 | (#batch, time1, time2).
216 |
217 | Returns:
218 | torch.Tensor: Output tensor (#batch, time1, d_model).
219 |
220 | """
221 | q_h, k_h, v_h, v = self.forward_qkv(x)
222 | fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk)
223 | q_h = q_h * self.d_k ** (-0.5)
224 | scores = torch.matmul(q_h, k_h.transpose(-2, -1))
225 | att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder)
226 | return att_outs + fsmn_memory
227 |
228 | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
229 | """Compute scaled dot product attention.
230 |
231 | Args:
232 | query (torch.Tensor): Query tensor (#batch, time1, size).
233 | key (torch.Tensor): Key tensor (#batch, time2, size).
234 | value (torch.Tensor): Value tensor (#batch, time2, size).
235 | mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
236 | (#batch, time1, time2).
237 |
238 | Returns:
239 | torch.Tensor: Output tensor (#batch, time1, d_model).
240 |
241 | """
242 | q_h, k_h, v_h, v = self.forward_qkv(x)
243 | if chunk_size is not None and look_back > 0 or look_back == -1:
244 | if cache is not None:
245 | k_h_stride = k_h[:, :, : -(chunk_size[2]), :]
246 | v_h_stride = v_h[:, :, : -(chunk_size[2]), :]
247 | k_h = torch.cat((cache["k"], k_h), dim=2)
248 | v_h = torch.cat((cache["v"], v_h), dim=2)
249 |
250 | cache["k"] = torch.cat((cache["k"], k_h_stride), dim=2)
251 | cache["v"] = torch.cat((cache["v"], v_h_stride), dim=2)
252 | if look_back != -1:
253 | cache["k"] = cache["k"][:, :, -(look_back * chunk_size[1]) :, :]
254 | cache["v"] = cache["v"][:, :, -(look_back * chunk_size[1]) :, :]
255 | else:
256 | cache_tmp = {
257 | "k": k_h[:, :, : -(chunk_size[2]), :],
258 | "v": v_h[:, :, : -(chunk_size[2]), :],
259 | }
260 | cache = cache_tmp
261 | fsmn_memory = self.forward_fsmn(v, None)
262 | q_h = q_h * self.d_k ** (-0.5)
263 | scores = torch.matmul(q_h, k_h.transpose(-2, -1))
264 | att_outs = self.forward_attention(v_h, scores, None)
265 | return att_outs + fsmn_memory, cache
266 |
267 |
268 | class LayerNorm(nn.LayerNorm):
269 | def __init__(self, *args, **kwargs):
270 | super().__init__(*args, **kwargs)
271 |
272 | def forward(self, input):
273 | output = F.layer_norm(
274 | input.float(),
275 | self.normalized_shape,
276 | self.weight.float() if self.weight is not None else None,
277 | self.bias.float() if self.bias is not None else None,
278 | self.eps,
279 | )
280 | return output.type_as(input)
281 |
282 |
283 | def sequence_mask(lengths, maxlen=None, dtype=torch.float32, device=None):
284 | if maxlen is None:
285 | maxlen = lengths.max()
286 | row_vector = torch.arange(0, maxlen, 1).to(lengths.device)
287 | matrix = torch.unsqueeze(lengths, dim=-1)
288 | mask = row_vector < matrix
289 | mask = mask.detach()
290 |
291 | return mask.type(dtype).to(device) if device is not None else mask.type(dtype)
292 |
293 |
294 | class EncoderLayerSANM(nn.Module):
295 | def __init__(
296 | self,
297 | in_size,
298 | size,
299 | self_attn,
300 | feed_forward,
301 | dropout_rate,
302 | normalize_before=True,
303 | concat_after=False,
304 | stochastic_depth_rate=0.0,
305 | ):
306 | """Construct an EncoderLayer object."""
307 | super(EncoderLayerSANM, self).__init__()
308 | self.self_attn = self_attn
309 | self.feed_forward = feed_forward
310 | self.norm1 = LayerNorm(in_size)
311 | self.norm2 = LayerNorm(size)
312 | self.dropout = nn.Dropout(dropout_rate)
313 | self.in_size = in_size
314 | self.size = size
315 | self.normalize_before = normalize_before
316 | self.concat_after = concat_after
317 | if self.concat_after:
318 | self.concat_linear = nn.Linear(size + size, size)
319 | self.stochastic_depth_rate = stochastic_depth_rate
320 | self.dropout_rate = dropout_rate
321 |
322 | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
323 | """Compute encoded features.
324 |
325 | Args:
326 | x_input (torch.Tensor): Input tensor (#batch, time, size).
327 | mask (torch.Tensor): Mask tensor for the input (#batch, time).
328 | cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
329 |
330 | Returns:
331 | torch.Tensor: Output tensor (#batch, time, size).
332 | torch.Tensor: Mask tensor (#batch, time).
333 |
334 | """
335 | skip_layer = False
336 | # with stochastic depth, residual connection `x + f(x)` becomes
337 | # `x <- x + 1 / (1 - p) * f(x)` at training time.
338 | stoch_layer_coeff = 1.0
339 | if self.training and self.stochastic_depth_rate > 0:
340 | skip_layer = torch.rand(1).item() < self.stochastic_depth_rate
341 | stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate)
342 |
343 | if skip_layer:
344 | if cache is not None:
345 | x = torch.cat([cache, x], dim=1)
346 | return x, mask
347 |
348 | residual = x
349 | if self.normalize_before:
350 | x = self.norm1(x)
351 |
352 | if self.concat_after:
353 | x_concat = torch.cat(
354 | (
355 | x,
356 | self.self_attn(
357 | x,
358 | mask,
359 | mask_shfit_chunk=mask_shfit_chunk,
360 | mask_att_chunk_encoder=mask_att_chunk_encoder,
361 | ),
362 | ),
363 | dim=-1,
364 | )
365 | if self.in_size == self.size:
366 | x = residual + stoch_layer_coeff * self.concat_linear(x_concat)
367 | else:
368 | x = stoch_layer_coeff * self.concat_linear(x_concat)
369 | else:
370 | if self.in_size == self.size:
371 | x = residual + stoch_layer_coeff * self.dropout(
372 | self.self_attn(
373 | x,
374 | mask,
375 | mask_shfit_chunk=mask_shfit_chunk,
376 | mask_att_chunk_encoder=mask_att_chunk_encoder,
377 | )
378 | )
379 | else:
380 | x = stoch_layer_coeff * self.dropout(
381 | self.self_attn(
382 | x,
383 | mask,
384 | mask_shfit_chunk=mask_shfit_chunk,
385 | mask_att_chunk_encoder=mask_att_chunk_encoder,
386 | )
387 | )
388 | if not self.normalize_before:
389 | x = self.norm1(x)
390 |
391 | residual = x
392 | if self.normalize_before:
393 | x = self.norm2(x)
394 | x = residual + stoch_layer_coeff * self.dropout(self.feed_forward(x))
395 | if not self.normalize_before:
396 | x = self.norm2(x)
397 |
398 | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder
399 |
400 | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
401 | """Compute encoded features.
402 |
403 | Args:
404 | x_input (torch.Tensor): Input tensor (#batch, time, size).
405 | mask (torch.Tensor): Mask tensor for the input (#batch, time).
406 | cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
407 |
408 | Returns:
409 | torch.Tensor: Output tensor (#batch, time, size).
410 | torch.Tensor: Mask tensor (#batch, time).
411 |
412 | """
413 |
414 | residual = x
415 | if self.normalize_before:
416 | x = self.norm1(x)
417 |
418 | if self.in_size == self.size:
419 | attn, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
420 | x = residual + attn
421 | else:
422 | x, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
423 |
424 | if not self.normalize_before:
425 | x = self.norm1(x)
426 |
427 | residual = x
428 | if self.normalize_before:
429 | x = self.norm2(x)
430 | x = residual + self.feed_forward(x)
431 | if not self.normalize_before:
432 | x = self.norm2(x)
433 |
434 | return x, cache
435 |
436 |
437 | @tables.register("encoder_classes", "SenseVoiceEncoderSmall")
438 | class SenseVoiceEncoderSmall(nn.Module):
439 | """
440 | Author: Speech Lab of DAMO Academy, Alibaba Group
441 | SCAMA: Streaming chunk-aware multihead attention for online end-to-end speech recognition
442 | https://arxiv.org/abs/2006.01713
443 | """
444 |
445 | def __init__(
446 | self,
447 | input_size: int,
448 | output_size: int = 256,
449 | attention_heads: int = 4,
450 | linear_units: int = 2048,
451 | num_blocks: int = 6,
452 | tp_blocks: int = 0,
453 | dropout_rate: float = 0.1,
454 | positional_dropout_rate: float = 0.1,
455 | attention_dropout_rate: float = 0.0,
456 | stochastic_depth_rate: float = 0.0,
457 | input_layer: Optional[str] = "conv2d",
458 | pos_enc_class=SinusoidalPositionEncoder,
459 | normalize_before: bool = True,
460 | concat_after: bool = False,
461 | positionwise_layer_type: str = "linear",
462 | positionwise_conv_kernel_size: int = 1,
463 | padding_idx: int = -1,
464 | kernel_size: int = 11,
465 | sanm_shfit: int = 0,
466 | selfattention_layer_type: str = "sanm",
467 | **kwargs,
468 | ):
469 | super().__init__()
470 | self._output_size = output_size
471 |
472 | self.embed = SinusoidalPositionEncoder()
473 |
474 | self.normalize_before = normalize_before
475 |
476 | positionwise_layer = PositionwiseFeedForward
477 | positionwise_layer_args = (
478 | output_size,
479 | linear_units,
480 | dropout_rate,
481 | )
482 |
483 | encoder_selfattn_layer = MultiHeadedAttentionSANM
484 | encoder_selfattn_layer_args0 = (
485 | attention_heads,
486 | input_size,
487 | output_size,
488 | attention_dropout_rate,
489 | kernel_size,
490 | sanm_shfit,
491 | )
492 | encoder_selfattn_layer_args = (
493 | attention_heads,
494 | output_size,
495 | output_size,
496 | attention_dropout_rate,
497 | kernel_size,
498 | sanm_shfit,
499 | )
500 |
501 | self.encoders0 = nn.ModuleList(
502 | [
503 | EncoderLayerSANM(
504 | input_size,
505 | output_size,
506 | encoder_selfattn_layer(*encoder_selfattn_layer_args0),
507 | positionwise_layer(*positionwise_layer_args),
508 | dropout_rate,
509 | )
510 | for i in range(1)
511 | ]
512 | )
513 | self.encoders = nn.ModuleList(
514 | [
515 | EncoderLayerSANM(
516 | output_size,
517 | output_size,
518 | encoder_selfattn_layer(*encoder_selfattn_layer_args),
519 | positionwise_layer(*positionwise_layer_args),
520 | dropout_rate,
521 | )
522 | for i in range(num_blocks - 1)
523 | ]
524 | )
525 |
526 | self.tp_encoders = nn.ModuleList(
527 | [
528 | EncoderLayerSANM(
529 | output_size,
530 | output_size,
531 | encoder_selfattn_layer(*encoder_selfattn_layer_args),
532 | positionwise_layer(*positionwise_layer_args),
533 | dropout_rate,
534 | )
535 | for i in range(tp_blocks)
536 | ]
537 | )
538 |
539 | self.after_norm = LayerNorm(output_size)
540 |
541 | self.tp_norm = LayerNorm(output_size)
542 |
543 | def output_size(self) -> int:
544 | return self._output_size
545 |
546 | def forward(
547 | self,
548 | xs_pad: torch.Tensor,
549 | ilens: torch.Tensor,
550 | ):
551 | """Embed positions in tensor."""
552 | masks = sequence_mask(ilens, device=ilens.device)[:, None, :]
553 |
554 | xs_pad *= self.output_size() ** 0.5
555 |
556 | xs_pad = self.embed(xs_pad)
557 |
558 | # forward encoder1
559 | for layer_idx, encoder_layer in enumerate(self.encoders0):
560 | encoder_outs = encoder_layer(xs_pad, masks)
561 | xs_pad, masks = encoder_outs[0], encoder_outs[1]
562 |
563 | for layer_idx, encoder_layer in enumerate(self.encoders):
564 | encoder_outs = encoder_layer(xs_pad, masks)
565 | xs_pad, masks = encoder_outs[0], encoder_outs[1]
566 |
567 | xs_pad = self.after_norm(xs_pad)
568 |
569 | # forward encoder2
570 | olens = masks.squeeze(1).sum(1).int()
571 |
572 | for layer_idx, encoder_layer in enumerate(self.tp_encoders):
573 | encoder_outs = encoder_layer(xs_pad, masks)
574 | xs_pad, masks = encoder_outs[0], encoder_outs[1]
575 |
576 | xs_pad = self.tp_norm(xs_pad)
577 | return xs_pad, olens
578 |
579 |
580 | @tables.register("model_classes", "SenseVoiceSmall")
581 | class SenseVoiceSmall(nn.Module):
582 | """CTC-attention hybrid Encoder-Decoder model"""
583 |
584 | def __init__(
585 | self,
586 | specaug: str = None,
587 | specaug_conf: dict = None,
588 | normalize: str = None,
589 | normalize_conf: dict = None,
590 | encoder: str = None,
591 | encoder_conf: dict = None,
592 | ctc_conf: dict = None,
593 | input_size: int = 80,
594 | vocab_size: int = -1,
595 | ignore_id: int = -1,
596 | blank_id: int = 0,
597 | sos: int = 1,
598 | eos: int = 2,
599 | length_normalized_loss: bool = False,
600 | **kwargs,
601 | ):
602 |
603 | super().__init__()
604 |
605 | if specaug is not None:
606 | specaug_class = tables.specaug_classes.get(specaug)
607 | specaug = specaug_class(**specaug_conf)
608 | if normalize is not None:
609 | normalize_class = tables.normalize_classes.get(normalize)
610 | normalize = normalize_class(**normalize_conf)
611 | encoder_class = tables.encoder_classes.get(encoder)
612 | encoder = encoder_class(input_size=input_size, **encoder_conf)
613 | encoder_output_size = encoder.output_size()
614 |
615 | if ctc_conf is None:
616 | ctc_conf = {}
617 | ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **ctc_conf)
618 |
619 | self.blank_id = blank_id
620 | self.sos = sos if sos is not None else vocab_size - 1
621 | self.eos = eos if eos is not None else vocab_size - 1
622 | self.vocab_size = vocab_size
623 | self.ignore_id = ignore_id
624 | self.specaug = specaug
625 | self.normalize = normalize
626 | self.encoder = encoder
627 | self.error_calculator = None
628 |
629 | self.ctc = ctc
630 |
631 | self.length_normalized_loss = length_normalized_loss
632 | self.encoder_output_size = encoder_output_size
633 |
634 | self.lid_dict = {"auto": 0, "zh": 3, "en": 4, "yue": 7, "ja": 11, "ko": 12, "nospeech": 13}
635 | self.lid_int_dict = {24884: 3, 24885: 4, 24888: 7, 24892: 11, 24896: 12, 24992: 13}
636 | self.textnorm_dict = {"withitn": 14, "woitn": 15}
637 | self.textnorm_int_dict = {25016: 14, 25017: 15}
638 | self.embed = torch.nn.Embedding(7 + len(self.lid_dict) + len(self.textnorm_dict), input_size)
639 | self.emo_dict = {"unk": 25009, "happy": 25001, "sad": 25002, "angry": 25003, "neutral": 25004}
640 |
641 | self.criterion_att = LabelSmoothingLoss(
642 | size=self.vocab_size,
643 | padding_idx=self.ignore_id,
644 | smoothing=kwargs.get("lsm_weight", 0.0),
645 | normalize_length=self.length_normalized_loss,
646 | )
647 |
648 | @staticmethod
649 | def from_pretrained(model:str=None, **kwargs):
650 | from funasr import AutoModel
651 | model, kwargs = AutoModel.build_model(model=model, trust_remote_code=True, **kwargs)
652 |
653 | return model, kwargs
654 |
655 | def forward(
656 | self,
657 | speech: torch.Tensor,
658 | speech_lengths: torch.Tensor,
659 | text: torch.Tensor,
660 | text_lengths: torch.Tensor,
661 | **kwargs,
662 | ):
663 | """Encoder + Decoder + Calc loss
664 | Args:
665 | speech: (Batch, Length, ...)
666 | speech_lengths: (Batch, )
667 | text: (Batch, Length)
668 | text_lengths: (Batch,)
669 | """
670 | # import pdb;
671 | # pdb.set_trace()
672 | if len(text_lengths.size()) > 1:
673 | text_lengths = text_lengths[:, 0]
674 | if len(speech_lengths.size()) > 1:
675 | speech_lengths = speech_lengths[:, 0]
676 |
677 | batch_size = speech.shape[0]
678 |
679 | # 1. Encoder
680 | encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, text)
681 |
682 | loss_ctc, cer_ctc = None, None
683 | loss_rich, acc_rich = None, None
684 | stats = dict()
685 |
686 | loss_ctc, cer_ctc = self._calc_ctc_loss(
687 | encoder_out[:, 4:, :], encoder_out_lens - 4, text[:, 4:], text_lengths - 4
688 | )
689 |
690 | loss_rich, acc_rich = self._calc_rich_ce_loss(
691 | encoder_out[:, :4, :], text[:, :4]
692 | )
693 |
694 | loss = loss_ctc + loss_rich
695 | # Collect total loss stats
696 | stats["loss_ctc"] = torch.clone(loss_ctc.detach()) if loss_ctc is not None else None
697 | stats["loss_rich"] = torch.clone(loss_rich.detach()) if loss_rich is not None else None
698 | stats["loss"] = torch.clone(loss.detach()) if loss is not None else None
699 | stats["acc_rich"] = acc_rich
700 |
701 | # force_gatherable: to-device and to-tensor if scalar for DataParallel
702 | if self.length_normalized_loss:
703 | batch_size = int((text_lengths + 1).sum())
704 | loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
705 | return loss, stats, weight
706 |
707 | def encode(
708 | self,
709 | speech: torch.Tensor,
710 | speech_lengths: torch.Tensor,
711 | text: torch.Tensor,
712 | **kwargs,
713 | ):
714 | """Frontend + Encoder. Note that this method is used by asr_inference.py
715 | Args:
716 | speech: (Batch, Length, ...)
717 | speech_lengths: (Batch, )
718 | ind: int
719 | """
720 |
721 | # Data augmentation
722 | if self.specaug is not None and self.training:
723 | speech, speech_lengths = self.specaug(speech, speech_lengths)
724 |
725 | # Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
726 | if self.normalize is not None:
727 | speech, speech_lengths = self.normalize(speech, speech_lengths)
728 |
729 |
730 | lids = torch.LongTensor([[self.lid_int_dict[int(lid)] if torch.rand(1) > 0.2 and int(lid) in self.lid_int_dict else 0 ] for lid in text[:, 0]]).to(speech.device)
731 | language_query = self.embed(lids)
732 |
733 | styles = torch.LongTensor([[self.textnorm_int_dict[int(style)]] for style in text[:, 3]]).to(speech.device)
734 | style_query = self.embed(styles)
735 | speech = torch.cat((style_query, speech), dim=1)
736 | speech_lengths += 1
737 |
738 | event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(speech.size(0), 1, 1)
739 | input_query = torch.cat((language_query, event_emo_query), dim=1)
740 | speech = torch.cat((input_query, speech), dim=1)
741 | speech_lengths += 3
742 |
743 | encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths)
744 |
745 | return encoder_out, encoder_out_lens
746 |
747 | def _calc_ctc_loss(
748 | self,
749 | encoder_out: torch.Tensor,
750 | encoder_out_lens: torch.Tensor,
751 | ys_pad: torch.Tensor,
752 | ys_pad_lens: torch.Tensor,
753 | ):
754 | # Calc CTC loss
755 | loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
756 |
757 | # Calc CER using CTC
758 | cer_ctc = None
759 | if not self.training and self.error_calculator is not None:
760 | ys_hat = self.ctc.argmax(encoder_out).data
761 | cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
762 | return loss_ctc, cer_ctc
763 |
764 | def _calc_rich_ce_loss(
765 | self,
766 | encoder_out: torch.Tensor,
767 | ys_pad: torch.Tensor,
768 | ):
769 | decoder_out = self.ctc.ctc_lo(encoder_out)
770 | # 2. Compute attention loss
771 | loss_rich = self.criterion_att(decoder_out, ys_pad.contiguous())
772 | acc_rich = th_accuracy(
773 | decoder_out.view(-1, self.vocab_size),
774 | ys_pad.contiguous(),
775 | ignore_label=self.ignore_id,
776 | )
777 |
778 | return loss_rich, acc_rich
779 |
780 |
781 | def inference(
782 | self,
783 | data_in,
784 | data_lengths=None,
785 | key: list = ["wav_file_tmp_name"],
786 | tokenizer=None,
787 | frontend=None,
788 | **kwargs,
789 | ):
790 |
791 |
792 | meta_data = {}
793 | if (
794 | isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank"
795 | ): # fbank
796 | speech, speech_lengths = data_in, data_lengths
797 | if len(speech.shape) < 3:
798 | speech = speech[None, :, :]
799 | if speech_lengths is None:
800 | speech_lengths = speech.shape[1]
801 | else:
802 | # extract fbank feats
803 | time1 = time.perf_counter()
804 | audio_sample_list = load_audio_text_image_video(
805 | data_in,
806 | fs=frontend.fs,
807 | audio_fs=kwargs.get("fs", 16000),
808 | data_type=kwargs.get("data_type", "sound"),
809 | tokenizer=tokenizer,
810 | )
811 | time2 = time.perf_counter()
812 | meta_data["load_data"] = f"{time2 - time1:0.3f}"
813 | speech, speech_lengths = extract_fbank(
814 | audio_sample_list, data_type=kwargs.get("data_type", "sound"), frontend=frontend
815 | )
816 | time3 = time.perf_counter()
817 | meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
818 | meta_data["batch_data_time"] = (
819 | speech_lengths.sum().item() * frontend.frame_shift * frontend.lfr_n / 1000
820 | )
821 |
822 | speech = speech.to(device=kwargs["device"])
823 | speech_lengths = speech_lengths.to(device=kwargs["device"])
824 |
825 | language = kwargs.get("language", "auto")
826 | language_query = self.embed(
827 | torch.LongTensor(
828 | [[self.lid_dict[language] if language in self.lid_dict else 0]]
829 | ).to(speech.device)
830 | ).repeat(speech.size(0), 1, 1)
831 |
832 | use_itn = kwargs.get("use_itn", False)
833 | textnorm = kwargs.get("text_norm", None)
834 | if textnorm is None:
835 | textnorm = "withitn" if use_itn else "woitn"
836 | textnorm_query = self.embed(
837 | torch.LongTensor([[self.textnorm_dict[textnorm]]]).to(speech.device)
838 | ).repeat(speech.size(0), 1, 1)
839 | speech = torch.cat((textnorm_query, speech), dim=1)
840 | speech_lengths += 1
841 |
842 | event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(
843 | speech.size(0), 1, 1
844 | )
845 | input_query = torch.cat((language_query, event_emo_query), dim=1)
846 | speech = torch.cat((input_query, speech), dim=1)
847 | speech_lengths += 3
848 |
849 | # Encoder
850 | encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths)
851 | if isinstance(encoder_out, tuple):
852 | encoder_out = encoder_out[0]
853 |
854 | # c. Passed the encoder result and the beam search
855 | ctc_logits = self.ctc.log_softmax(encoder_out)
856 | if kwargs.get("ban_emo_unk", False):
857 | ctc_logits[:, :, self.emo_dict["unk"]] = -float("inf")
858 |
859 | results = []
860 | b, n, d = encoder_out.size()
861 | if isinstance(key[0], (list, tuple)):
862 | key = key[0]
863 | if len(key) < b:
864 | key = key * b
865 | for i in range(b):
866 | x = ctc_logits[i, : encoder_out_lens[i].item(), :]
867 | yseq = x.argmax(dim=-1)
868 | yseq = torch.unique_consecutive(yseq, dim=-1)
869 |
870 | ibest_writer = None
871 | if kwargs.get("output_dir") is not None:
872 | if not hasattr(self, "writer"):
873 | self.writer = DatadirWriter(kwargs.get("output_dir"))
874 | ibest_writer = self.writer[f"1best_recog"]
875 |
876 | mask = yseq != self.blank_id
877 | token_int = yseq[mask].tolist()
878 |
879 | # Change integer-ids to tokens
880 | text = tokenizer.decode(token_int)
881 |
882 | result_i = {"key": key[i], "text": text}
883 | results.append(result_i)
884 |
885 | if ibest_writer is not None:
886 | ibest_writer["text"][key[i]] = text
887 |
888 | return results, meta_data
889 |
890 | def export(self, **kwargs):
891 | from export_meta import export_rebuild_model
892 |
893 | if "max_seq_len" not in kwargs:
894 | kwargs["max_seq_len"] = 512
895 | models = export_rebuild_model(model=self, **kwargs)
896 | return models
897 |
--------------------------------------------------------------------------------
/models/vad_onnx/silero_vad.onnx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/models/vad_onnx/silero_vad.onnx
--------------------------------------------------------------------------------
/pics/运行截图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/pics/运行截图.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | modelscope
2 | huggingface
3 | huggingface_hub
4 | funasr>=1.1.3
5 | numpy<=1.26.4
6 | gradio
7 | pyaudio
8 | addict
9 | datasets
10 | simplejson
11 | sortedcontainers
12 | sounddevice
13 | soundfile
14 | sherpa-onnx
15 | scipy
--------------------------------------------------------------------------------
/speak_log/log.txt:
--------------------------------------------------------------------------------
1 | 窗外的麻雀在电线杆上多嘴
2 | 你说这一句很有夏天的感觉
3 |
--------------------------------------------------------------------------------
/speakers/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/speakers/.gitkeep
--------------------------------------------------------------------------------
/speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep
--------------------------------------------------------------------------------