├── .gitignore ├── .idea ├── .gitignore ├── SenseVoice-Real-Time.iml ├── inspectionProfiles │ ├── Project_Default.xml │ └── profiles_settings.xml ├── misc.xml ├── modules.xml └── vcs.xml ├── README.md ├── SenseVoiceSmall └── .gitkeep ├── audio_convert.py ├── audio_logs └── .gitkeep ├── demo1.py ├── demo_record_natural_voice_lock.py ├── demo_vad_voice_translate_async.py ├── model.py ├── models └── vad_onnx │ └── silero_vad.onnx ├── pics └── 运行截图.png ├── requirements.txt ├── speak_log └── log.txt ├── speakers └── .gitkeep └── speech_campplus_sv_zh_en_16k-common_advanced └── .gitkeep /.gitignore: -------------------------------------------------------------------------------- 1 | speakers/* 2 | !speakers/.gitkeep 3 | audio_logs/* 4 | !audio_logs/.gitkeep 5 | models/SenseVoiceSmall/* 6 | !models/SenseVoiceSmall/.gitkeep 7 | models/speech_campplus_sv_zh_en_16k-common_advanced/* 8 | !models/speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep 9 | /improve/ 10 | -------------------------------------------------------------------------------- /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /shelf/ 3 | /workspace.xml 4 | # Editor-based HTTP Client requests 5 | /httpRequests/ 6 | # Datasource local storage ignored files 7 | /dataSources/ 8 | /dataSources.local.xml 9 | -------------------------------------------------------------------------------- /.idea/SenseVoice-Real-Time.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/Project_Default.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 20 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SenseVoice-Real-Time 2 | 3 | 本项目就是一个小的学习项目，实现了一下最简单的语音端点检测VAD（没用模型实现，这点确实该优化一下）、利用SenseVoice实现语音转录、利用CAM++实现说话人确认（声纹锁），实现并不复杂，简单玩玩，没什么实力，如果你找到了我这那纯属我们的缘分哈哈。 4 | 5 | 6 | 7 | PS：下面这部分是旧版本的使用说明了，当然了也是能跑起来的，想看最新的可以拉到最底下去看新版使用说明。 8 | 9 | 10 | 11 | ## 运行步骤 12 | 13 | 1. **下个CUDA版本的torch，版本号<=2.3就行（但是可能会下的比较慢）** 14 | 15 | ``` 16 | pip install torch==2.2.1+cu118 torchaudio==2.2.1+cu118 --index-url https://download.pytorch.org/whl/cu118 17 | ``` 18 | 19 | 2. **安装requirements.txt中的依赖** 20 | 21 | ``` 22 | pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple -i https://pypi.tuna.tsinghua.edu.cn/simple some-package 23 | ``` 24 | 25 | 3. **前往`modelscope`下载`SenseVoiceSmall`** 26 | 27 | 链接：https://www.modelscope.cn/models/iic/SenseVoiceSmall/files 28 | 29 | 或者直接通过 git lfs下载 30 | 31 | ``` 32 | git clone https://www.modelscope.cn/iic/SenseVoiceSmall.git 33 | ``` 34 | 35 | 下载完成后放入根目录下的`SenseVoiceSmall`文件夹中即可 36 | 37 | 4. **前往`modelscope`下载`iic/speech_campplus_sv_zh_en_16k-common_advanced`** 38 | 链接：https://www.modelscope.cn/models/iic/speech_campplus_sv_zh_en_16k-common_advanced/files 39 | 或者直接通过 git lfs下载 40 | 41 | ``` 42 | git clone https://www.modelscope.cn/iic/speech_campplus_sv_zh_en_16k-common_advanced.git 43 | ``` 44 | 45 | 下载后放入根目录下的`speech_campplus_sv_zh_en_16k-common_advanced`文件夹即可 46 | 47 | 5. **自行录音，手机随便录一段，并通过脚本转换采样率** 48 | 49 | 首先将音频放入speakers文件夹中 50 | 项目中提供了脚本`audio_convert.py`，将音频转换为WAV格式并把采样率转换为16K，因为`speech_campplus`模型只能处理16K的音频。 51 | 如果你的音频名字有修改，记得去`demo_record_natural_voice_lock.py`中也把文件名改一下 52 | 53 | ```Python 54 | def main(): 55 | # 创建保存目录（如果目录不存在） 56 | save_directory = "audio_logs" 57 | os.makedirs(save_directory, exist_ok=True) 58 | # 加载声纹锁示例音频，如果你的音频名字修改了则这里也需要修改 59 | reference_audio = "speakers/speaker_mine_converted.wav" 60 | ``` 61 | 62 | 6. 运行`demo_record_natural_voice_lock.py` 63 | 64 | ``` 65 | python demo_record_natural_voice_lock.py 66 | ``` 67 | 68 | 69 | 70 | 71 | ## 运行示例 72 | 73 | ![运行截图](./pics/运行截图.png) 74 | 75 | 76 | 77 | 如果你觉得运行效果不太行，可以修改一下参数，在`demo_record_natural_voice_lock.py`中 78 | 79 | ```Python 80 | # 音频参数 81 | CHUNK = 1024 82 | FORMAT = pyaudio.paInt16 83 | CHANNELS = 1 84 | RATE = 16000 85 | MAX_TIME = 60 # 最大录音时间（秒） 86 | 87 | # VAD 参数 88 | THRESHOLD = 500 89 | SILENCE_LIMIT = 2 90 | 91 | # 声纹识别参数 92 | SIMILARITY_THRESHOLD = 0.1 # 相似度阈值，可以根据需要调整 93 | ``` 94 | 95 | 96 | 97 | 另外我觉得SenseVoice转录错字率有点太高了，不过能识别粤语挺不错的占用显存也少，你也可以在本项目基础上改用其他模型来玩玩。 98 | 99 | 100 | 101 | 102 | 103 | # 2024.09.23更新 104 | 105 | 也是更新上了，这次更新是大更，原本SenseVoice的错字率有点太高了，这次换了另一个模型，不过使用很简单。 106 | 107 | 先说说现在能实现的效果：使用VAD模型进行端点检测、较高的识别准确率、较低的显存占用、语音识别后添加标点（不过没做声纹锁哦这次，想要加也很简单的） 108 | 109 | 110 | 111 | 首先如果你之前clone过我的项目，那么请更新一下拉取最新的文件。 112 | 113 | 114 | 115 | 接着再次pip install一下，因为更新了几个依赖 116 | 117 | ``` 118 | pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple -i https://pypi.tuna.tsinghua.edu.cn/simple some-package 119 | ``` 120 | 121 | 122 | 123 | **然后分别下载这几个模型：** 124 | 125 | 1.`speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch` 126 | 127 | ``` 128 | git clone https://www.modelscope.cn/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.git 129 | ``` 130 | 131 | 2.`speech_fsmn_vad_zh-cn-16k-common-pytorch` 132 | 133 | ``` 134 | git clone https://www.modelscope.cn/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch.git 135 | ``` 136 | 137 | 3.`punc_ct-transformer_zh-cn-common-vocab272727-pytorch` 138 | 139 | ``` 140 | git clone https://www.modelscope.cn/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch.git 141 | ``` 142 | 143 | PS：新版我没搞说话人确认，但是很简单，你可以去看看`speech_campplus_sv_zh-cn_16k-common`模型的官方的文档，把模型加进来还有按照我旧版的说明转换一下16K音频提供给模型即可。 144 | 145 | 然后把模型都放到根目录下的models文件夹里，各自起个文件夹对应，反正你能搞明白路径就行。 146 | 147 | ```py 148 | asr_model = AutoModel(model="./models/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 149 | model_revision="v2.0.4", 150 | vad_model="./models/speech_fsmn_vad_zh-cn-16k-common-pytorch", 151 | vad_model_revision="v2.0.4", 152 | punc_model="./models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 153 | punc_model_revision="v2.0.4", 154 | # 想搞说话人确认的在这加进来 155 | # spk_model="./models/speech_campplus_sv_zh-cn_16k-common", 156 | # spk_model_revision="v2.0.2", 157 | device="cuda:0") 158 | ``` 159 | 160 | 161 | 162 | 确保已经有了这俩文件夹，其中silero_vad才是我们真正用来进行端点检测的模型。 163 | 164 | ```py 165 | vad_model_path = './models/vad_onnx/silero_vad.onnx' # VAD onnx模型路径 166 | audio_save_path = './audio_logs' # 保存临时音频文件的路径 167 | ``` 168 | 169 | 那为什么后面AutoModel还要加一个VAD模型呢？因为这里不加VAD的话PUNC就不生效，我也懒得再分开了，感觉也不吃性能。 170 | 171 | 觉得膈应的可以把AutoModel里的VAD去掉，然后把PUNC单独拆开，代码里有详细说明。 172 | 173 | 174 | 175 | 调整参数以获取更好的结果： 176 | 177 | ```py 178 | max_files = 10 # 最多保留的临时音频数量 179 | window_size = 512 # 语音窗口大小 180 | min_silence_duration = 1.5 # 最小静音持续时间(s) 181 | min_speech_duration = 0.05 # 最小语音持续时间(s) 182 | segment_count = 0 # 用于生成音频文件的命名 183 | ``` 184 | 185 | 186 | 187 | 最后运行`demo_vad_voice_translate_async.py` 188 | 189 | 本项目中使用了多线程进行在上一次识别的时候还能进行下一次的录音，让实时更实时。 190 | 191 | 有问题提Issue。 -------------------------------------------------------------------------------- /SenseVoiceSmall/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/SenseVoiceSmall/.gitkeep -------------------------------------------------------------------------------- /audio_convert.py: -------------------------------------------------------------------------------- 1 | from pydub import AudioSegment 2 | 3 | # 读取音频文件，自行修改文件路径文件名 4 | audio = AudioSegment.from_file("./speaker/speaker_mine.mp3") 5 | 6 | # 转换为单声道并调整采样率 7 | audio = audio.set_channels(1) 8 | audio = audio.set_frame_rate(16000) # 如果需要将采样率调整为 16kHz 9 | 10 | # 保存转换后的音频文件 11 | audio.export("./speaker/speaker_mine_converted.wav", format="wav") 12 | -------------------------------------------------------------------------------- /audio_logs/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/audio_logs/.gitkeep -------------------------------------------------------------------------------- /demo1.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/FunAudioLLM/SenseVoice). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | from funasr import AutoModel 7 | from funasr.utils.postprocess_utils import rich_transcription_postprocess 8 | 9 | model_dir = "models/SenseVoiceSmall" 10 | 11 | model = AutoModel( 12 | model=model_dir, 13 | trust_remote_code=True, 14 | remote_code="./model.py", 15 | vad_model="fsmn-vad", 16 | vad_kwargs={"max_single_segment_time": 30000}, 17 | device="cuda:0", 18 | ) 19 | 20 | # en 21 | res = model.generate( 22 | input=f"{model.model_path}/example/en.mp3", 23 | cache={}, 24 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" 25 | use_itn=True, 26 | batch_size_s=60, 27 | merge_vad=True, # 28 | merge_length_s=15, 29 | ) 30 | text = rich_transcription_postprocess(res[0]["text"]) 31 | print(text) 32 | 33 | # zh 34 | res = model.generate( 35 | input=f"{model.model_path}/example/zh.mp3", 36 | cache={}, 37 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" 38 | use_itn=True, 39 | batch_size_s=60, 40 | merge_vad=True, # 41 | merge_length_s=15, 42 | ) 43 | text = rich_transcription_postprocess(res[0]["text"]) 44 | print(text) 45 | 46 | # yue 47 | res = model.generate( 48 | input=f"{model.model_path}/example/yue.mp3", 49 | cache={}, 50 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" 51 | use_itn=True, 52 | batch_size_s=60, 53 | merge_vad=True, # 54 | merge_length_s=15, 55 | ) 56 | text = rich_transcription_postprocess(res[0]["text"]) 57 | print(text) 58 | 59 | # ja 60 | res = model.generate( 61 | input=f"{model.model_path}/example/ja.mp3", 62 | cache={}, 63 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" 64 | use_itn=True, 65 | batch_size_s=60, 66 | merge_vad=True, # 67 | merge_length_s=15, 68 | ) 69 | text = rich_transcription_postprocess(res[0]["text"]) 70 | print(text) 71 | 72 | # ko 73 | res = model.generate( 74 | input=f"{model.model_path}/example/ko.mp3", 75 | cache={}, 76 | language="auto", # "zh", "en", "yue", "ja", "ko", "nospeech" 77 | use_itn=True, 78 | batch_size_s=60, 79 | merge_vad=True, # 80 | merge_length_s=15, 81 | ) 82 | text = rich_transcription_postprocess(res[0]["text"]) 83 | print(text) 84 | -------------------------------------------------------------------------------- /demo_record_natural_voice_lock.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os 3 | 4 | import pyaudio 5 | import wave 6 | import numpy as np 7 | import time 8 | from modelscope.pipelines import pipeline 9 | from model import SenseVoiceSmall 10 | from funasr.utils.postprocess_utils import rich_transcription_postprocess 11 | 12 | # 音频参数 13 | CHUNK = 1024 14 | FORMAT = pyaudio.paInt16 15 | CHANNELS = 1 16 | RATE = 16000 17 | MAX_TIME = 60 # 最大录音时间（秒） 18 | 19 | # VAD 参数 20 | THRESHOLD = 1000 21 | SILENCE_LIMIT = 2 22 | 23 | # 声纹识别参数 24 | SIMILARITY_THRESHOLD = 0.1 # 相似度阈值，可以根据需要调整 25 | 26 | accumulated_audio = np.array([]) # 用于存储累积的音频数据 27 | chunk_size = 200 # ms 28 | sample_rate = 16000 29 | chunk_stride = int(chunk_size * sample_rate / 1000) 30 | 31 | 32 | # 初始化声纹识别模型 33 | sv_pipeline = pipeline( 34 | task='speaker-verification', 35 | model='models/speech_campplus_sv_zh-cn_16k-common', 36 | model_revision='v1.0.0' 37 | ) 38 | 39 | # 初始化 SenseVoiceSmall 模型 40 | model_dir = "models/SenseVoiceSmall" 41 | m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0") 42 | m.eval() 43 | 44 | 45 | def is_silent(data_chunk): 46 | return max(data_chunk) < THRESHOLD 47 | 48 | 49 | def record_audio(): 50 | p = pyaudio.PyAudio() 51 | stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK) 52 | 53 | print("开始监听...") 54 | audio_buffer = [] 55 | silence_start = None 56 | is_recording = False 57 | 58 | while True: 59 | data = stream.read(CHUNK) 60 | audio_buffer.append(data) 61 | 62 | if len(audio_buffer) > RATE / CHUNK * MAX_TIME: 63 | audio_buffer.pop(0) 64 | 65 | if not is_recording: 66 | if not is_silent(np.frombuffer(data, dtype=np.int16)): 67 | print("检测到声音，开始录音...") 68 | is_recording = True 69 | silence_start = None 70 | else: 71 | if is_silent(np.frombuffer(data, dtype=np.int16)): 72 | if silence_start is None: 73 | silence_start = time.time() 74 | elif time.time() - silence_start > SILENCE_LIMIT: 75 | print("检测到静音，停止录音") 76 | break 77 | else: 78 | silence_start = None 79 | 80 | stream.stop_stream() 81 | stream.close() 82 | p.terminate() 83 | 84 | return b''.join(audio_buffer) 85 | 86 | def save_audio(data, filename): 87 | wf = wave.open(filename, 'wb') 88 | wf.setnchannels(CHANNELS) 89 | wf.setsampwidth(pyaudio.PyAudio().get_sample_size(FORMAT)) 90 | wf.setframerate(RATE) 91 | wf.writeframes(data) 92 | wf.close() 93 | 94 | 95 | def verify_voice(reference_audio, input_audio): 96 | result = sv_pipeline([reference_audio, input_audio]) 97 | print(result) 98 | return result['score'] # 返回相似度得分 99 | 100 | 101 | def transcribe_audio(audio_file): 102 | res = m.inference( 103 | data_in=audio_file, 104 | language="auto", 105 | use_itn=False, 106 | ban_emo_unk=False, 107 | **kwargs, 108 | ) 109 | 110 | text = rich_transcription_postprocess(res[0][0]["text"]) 111 | return text 112 | 113 | 114 | # 手动热词 115 | def replace_diy_hotword(sentence): 116 | # 定义一个Map，用于存储高频错词和对应的替换词 117 | error_hotkey_map = { 118 | '你冇': '蕾姆', 119 | '我冇': '蕾姆', 120 | '雷母': '蕾姆', 121 | '雷姆': '蕾姆', 122 | '蕾母': '蕾姆', 123 | '雷冇': '蕾姆', 124 | '蕾冇': '蕾姆', 125 | '人母': '蕾姆', 126 | '你悟': '蕾姆', 127 | '你姆': '蕾姆', 128 | '人冇': '蕾姆', 129 | '人姆': '蕾姆', 130 | '李慕': '蕾姆', 131 | # 添加更多的错词和替换词 132 | } 133 | for wrong_word, correct_word in error_hotkey_map.items(): 134 | # 每次替换都更新 sentence 135 | sentence = sentence.replace(wrong_word, correct_word) 136 | # print("corrected_text:{}".format(sentence)) 137 | return sentence 138 | 139 | 140 | def main(): 141 | # 创建保存目录（如果目录不存在） 142 | save_directory = "audio_logs" 143 | os.makedirs(save_directory, exist_ok=True) 144 | # 加载声纹锁示例音频 145 | reference_audio = "speakers/speaker_mine_converted.wav" 146 | 147 | max_files = 10 # 最多保留的文件数量 148 | 149 | while True: 150 | audio_data = record_audio() 151 | # 使用 os.path.join 来拼接保存路径 152 | output_filename = os.path.join(save_directory, f"recorded_audio_{int(time.time())}.wav") 153 | save_audio(audio_data, output_filename) 154 | 155 | # 检查并删除最旧的文件（如果文件数量超过 max_files） 156 | existing_files = glob.glob(os.path.join(save_directory, "*.wav")) 157 | if len(existing_files) > max_files: 158 | # 按文件的修改时间排序 159 | existing_files.sort(key=os.path.getmtime) 160 | # 删除最旧的文件 161 | os.remove(existing_files[0]) 162 | 163 | print("正在进行声纹验证...") 164 | similarity = verify_voice(reference_audio, output_filename) 165 | 166 | if similarity >= SIMILARITY_THRESHOLD: 167 | print(f"声纹验证通过 (相似度: {similarity:.2f})") 168 | print("正在进行语音识别...") 169 | transcribed_text = replace_diy_hotword(transcribe_audio(output_filename)) 170 | print("识别结果:") 171 | print(transcribed_text) 172 | 173 | log_dir = 'speak_log' 174 | if not os.path.exists(log_dir): 175 | os.makedirs(log_dir) 176 | with open(os.path.join(log_dir, 'log.txt'), 'a', encoding='utf-8') as f: 177 | if transcribed_text is not None or transcribed_text != "": 178 | f.write(transcribed_text) 179 | f.write("\n") 180 | f.flush() # 刷新缓冲区，确保数据写入磁盘 181 | else: 182 | print(f"声纹验证失败 (相似度: {similarity:.2f})") 183 | 184 | print("\n准备进行下一次录音，按 Ctrl+C 退出程序") 185 | 186 | 187 | if __name__ == "__main__": 188 | try: 189 | main() 190 | except KeyboardInterrupt: 191 | print("程序已退出") 192 | -------------------------------------------------------------------------------- /demo_vad_voice_translate_async.py: -------------------------------------------------------------------------------- 1 | import gc 2 | import glob 3 | import time 4 | import numpy as np 5 | import sounddevice as sd 6 | import sherpa_onnx 7 | import os 8 | import scipy.io.wavfile as wav 9 | import torch 10 | from funasr import AutoModel 11 | from concurrent.futures import ThreadPoolExecutor 12 | 13 | g_sample_rate = 16000 # 采样率，只能16k哦 14 | vad_model_path = './models/vad_onnx/silero_vad.onnx' # VAD onnx模型路径 15 | audio_save_path = './audio_logs' # 保存临时音频文件的路径 16 | max_files = 10 # 最多保留的临时音频数量 17 | window_size = 512 # 语音窗口大小 18 | min_silence_duration = 1.5 # 最小静音持续时间(s) 19 | min_speech_duration = 0.05 # 最小语音持续时间(s) 20 | segment_count = 0 # 用于生成音频文件的命名 21 | 22 | # 初始化ASR、PUNC、SPK模型 23 | # 为什么要加一个VAD模型呢在这？因为这里不加VAD的话PUNC就不生效，我也懒得再分开了，感觉也不吃性能。 24 | # 觉得膈应的可以把AutoModel里的VAD去掉，然后把PUNC单独拆开，示例： 25 | # model = AutoModel(model="ct-punc", model_revision="v2.0.4") 26 | # res = model.generate(input="那今天的会就到这里吧 happy new year 明年见") 27 | asr_model = AutoModel(model="./models/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 28 | model_revision="v2.0.4", 29 | vad_model="./models/speech_fsmn_vad_zh-cn-16k-common-pytorch", 30 | vad_model_revision="v2.0.4", 31 | punc_model="./models/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 32 | punc_model_revision="v2.0.4", 33 | # 想搞说话人确认的在这加进来 34 | # spk_model="./models/speech_campplus_sv_zh-cn_16k-common", 35 | # spk_model_revision="v2.0.2", 36 | device="cuda:0") 37 | 38 | 39 | def save_audio_to_file(audio_data, file_name): 40 | output_dir = os.path.dirname(file_name) 41 | if not os.path.exists(output_dir): 42 | os.makedirs(output_dir) 43 | 44 | amplified_audio = audio_data * 2 45 | amplified_audio = np.clip(amplified_audio, -1.0, 1.0) 46 | wav.write(file_name, g_sample_rate, amplified_audio.astype(np.float32)) 47 | print(f"Audio saved to {file_name}") 48 | 49 | 50 | def rm_file(): 51 | # 检查并删除最旧的文件（如果文件数量超过 max_files） 52 | existing_files = glob.glob(os.path.join(audio_save_path, "*.wav")) 53 | if len(existing_files) > max_files: 54 | # 按文件的修改时间排序 55 | existing_files.sort(key=os.path.getmtime) 56 | # 删除最旧的文件 57 | os.remove(existing_files[0]) 58 | 59 | 60 | def asr_inference_task(file_name, segment_count): 61 | """ 62 | ASR 推理任务：从文件加载音频并进行识别。 63 | """ 64 | try: 65 | asr_res = asr_model.generate(input=file_name, 66 | batch_size_s=300, 67 | hotword='蕾姆拉姆') 68 | for resi in asr_res: 69 | print(f"[{segment_count}] 结果：{resi['text']}") 70 | print("--" * 150) 71 | finally: 72 | # 手动清理显存和对象 73 | del asr_res # 删除结果对象 74 | torch.cuda.empty_cache() # 释放显存 75 | gc.collect() # 强制进行垃圾回收 76 | 77 | 78 | def start_recording_with_vad(vad_model_path, window_size): 79 | global segment_count 80 | vad_config = sherpa_onnx.VadModelConfig() 81 | vad_config.silero_vad.model = vad_model_path 82 | vad_config.silero_vad.min_silence_duration = min_silence_duration 83 | vad_config.silero_vad.min_speech_duration = min_speech_duration 84 | vad_config.sample_rate = g_sample_rate 85 | 86 | vad = sherpa_onnx.VoiceActivityDetector(vad_config, buffer_size_in_seconds=100) 87 | 88 | samples_per_read = int(0.1 * g_sample_rate) 89 | 90 | print("--" * 150) 91 | devices = sd.query_devices() 92 | if len(devices) == 0: 93 | print("No microphone devices found") 94 | return 95 | 96 | default_input_device_idx = sd.default.device[0] 97 | print(f'当前使用默认设备: 【{devices[default_input_device_idx]["name"]}】') 98 | print("开始监听...") 99 | 100 | buffer = np.array([]) # 初始化缓冲区 101 | preroll_buffer = np.array([]) # 用于存储前导音频 102 | preroll_duration = 0.3 # 希望保留的前导音频的持续时间（s） 103 | preroll_samples = int(preroll_duration * g_sample_rate) # 希望保留的前导音频的采样点数（0.2s * 16000 = 3200个采样点） 104 | 105 | # 初始化线程池执行器 106 | with ThreadPoolExecutor(max_workers=2) as executor: 107 | with sd.InputStream(channels=1, dtype="float32", samplerate=g_sample_rate) as stream: 108 | while True: 109 | samples, _ = stream.read(samples_per_read) # 每次读取 0.1 秒的音频数据 110 | samples = samples.reshape(-1) # 将音频数据展平为一维数组 111 | 112 | # 更新前导缓冲区 113 | preroll_buffer = np.concatenate([preroll_buffer, samples]) 114 | if len(preroll_buffer) > preroll_samples: 115 | preroll_buffer = preroll_buffer[-preroll_samples:] 116 | 117 | buffer = np.concatenate([buffer, samples]) # 将新读取的音频数据添加到缓冲区 118 | 119 | while len(buffer) > window_size: 120 | start_time = time.time() 121 | vad.accept_waveform(buffer[:window_size]) # 向 VAD 模型输入窗口大小的音频数据 122 | buffer = buffer[window_size:] 123 | 124 | while not vad.empty(): 125 | end_time = time.time() 126 | detection_time = end_time - start_time 127 | segment = vad.front 128 | 129 | if len(segment.samples) < 0.5 * g_sample_rate: 130 | vad.pop() 131 | continue 132 | 133 | print(f"检测到语音活动, VAD检测耗时： {detection_time:.6f} s") 134 | 135 | # 合并前导音频和检测到的语音段 136 | full_segment = np.concatenate([preroll_buffer, np.array(segment.samples)]) 137 | segment_count += 1 138 | file_name = f"{audio_save_path}/speech_segment_{segment_count}.wav" 139 | save_audio_to_file(np.array(full_segment), file_name) 140 | 141 | # 将ASR推理任务提交到线程池中进行异步处理 142 | executor.submit(asr_inference_task, file_name, segment_count) 143 | 144 | vad.pop() 145 | 146 | 147 | if __name__ == "__main__": 148 | try: 149 | start_recording_with_vad(vad_model_path, window_size) 150 | except KeyboardInterrupt: 151 | print("加，马达捏~") 152 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | 2 | import time 3 | import torch 4 | from torch import nn 5 | import torch.nn.functional as F 6 | from typing import Iterable, Optional 7 | 8 | from funasr.register import tables 9 | from funasr.models.ctc.ctc import CTC 10 | from funasr.utils.datadir_writer import DatadirWriter 11 | from funasr.models.paraformer.search import Hypothesis 12 | from funasr.train_utils.device_funcs import force_gatherable 13 | from funasr.losses.label_smoothing_loss import LabelSmoothingLoss 14 | from funasr.metrics.compute_acc import compute_accuracy, th_accuracy 15 | from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank 16 | 17 | 18 | class SinusoidalPositionEncoder(torch.nn.Module): 19 | """ """ 20 | 21 | def __int__(self, d_model=80, dropout_rate=0.1): 22 | pass 23 | 24 | def encode( 25 | self, positions: torch.Tensor = None, depth: int = None, dtype: torch.dtype = torch.float32 26 | ): 27 | batch_size = positions.size(0) 28 | positions = positions.type(dtype) 29 | device = positions.device 30 | log_timescale_increment = torch.log(torch.tensor([10000], dtype=dtype, device=device)) / ( 31 | depth / 2 - 1 32 | ) 33 | inv_timescales = torch.exp( 34 | torch.arange(depth / 2, device=device).type(dtype) * (-log_timescale_increment) 35 | ) 36 | inv_timescales = torch.reshape(inv_timescales, [batch_size, -1]) 37 | scaled_time = torch.reshape(positions, [1, -1, 1]) * torch.reshape( 38 | inv_timescales, [1, 1, -1] 39 | ) 40 | encoding = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=2) 41 | return encoding.type(dtype) 42 | 43 | def forward(self, x): 44 | batch_size, timesteps, input_dim = x.size() 45 | positions = torch.arange(1, timesteps + 1, device=x.device)[None, :] 46 | position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device) 47 | 48 | return x + position_encoding 49 | 50 | 51 | class PositionwiseFeedForward(torch.nn.Module): 52 | """Positionwise feed forward layer. 53 | 54 | Args: 55 | idim (int): Input dimenstion. 56 | hidden_units (int): The number of hidden units. 57 | dropout_rate (float): Dropout rate. 58 | 59 | """ 60 | 61 | def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()): 62 | """Construct an PositionwiseFeedForward object.""" 63 | super(PositionwiseFeedForward, self).__init__() 64 | self.w_1 = torch.nn.Linear(idim, hidden_units) 65 | self.w_2 = torch.nn.Linear(hidden_units, idim) 66 | self.dropout = torch.nn.Dropout(dropout_rate) 67 | self.activation = activation 68 | 69 | def forward(self, x): 70 | """Forward function.""" 71 | return self.w_2(self.dropout(self.activation(self.w_1(x)))) 72 | 73 | 74 | class MultiHeadedAttentionSANM(nn.Module): 75 | """Multi-Head Attention layer. 76 | 77 | Args: 78 | n_head (int): The number of heads. 79 | n_feat (int): The number of features. 80 | dropout_rate (float): Dropout rate. 81 | 82 | """ 83 | 84 | def __init__( 85 | self, 86 | n_head, 87 | in_feat, 88 | n_feat, 89 | dropout_rate, 90 | kernel_size, 91 | sanm_shfit=0, 92 | lora_list=None, 93 | lora_rank=8, 94 | lora_alpha=16, 95 | lora_dropout=0.1, 96 | ): 97 | """Construct an MultiHeadedAttention object.""" 98 | super().__init__() 99 | assert n_feat % n_head == 0 100 | # We assume d_v always equals d_k 101 | self.d_k = n_feat // n_head 102 | self.h = n_head 103 | # self.linear_q = nn.Linear(n_feat, n_feat) 104 | # self.linear_k = nn.Linear(n_feat, n_feat) 105 | # self.linear_v = nn.Linear(n_feat, n_feat) 106 | 107 | self.linear_out = nn.Linear(n_feat, n_feat) 108 | self.linear_q_k_v = nn.Linear(in_feat, n_feat * 3) 109 | self.attn = None 110 | self.dropout = nn.Dropout(p=dropout_rate) 111 | 112 | self.fsmn_block = nn.Conv1d( 113 | n_feat, n_feat, kernel_size, stride=1, padding=0, groups=n_feat, bias=False 114 | ) 115 | # padding 116 | left_padding = (kernel_size - 1) // 2 117 | if sanm_shfit > 0: 118 | left_padding = left_padding + sanm_shfit 119 | right_padding = kernel_size - 1 - left_padding 120 | self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0) 121 | 122 | def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None): 123 | b, t, d = inputs.size() 124 | if mask is not None: 125 | mask = torch.reshape(mask, (b, -1, 1)) 126 | if mask_shfit_chunk is not None: 127 | mask = mask * mask_shfit_chunk 128 | inputs = inputs * mask 129 | 130 | x = inputs.transpose(1, 2) 131 | x = self.pad_fn(x) 132 | x = self.fsmn_block(x) 133 | x = x.transpose(1, 2) 134 | x += inputs 135 | x = self.dropout(x) 136 | if mask is not None: 137 | x = x * mask 138 | return x 139 | 140 | def forward_qkv(self, x): 141 | """Transform query, key and value. 142 | 143 | Args: 144 | query (torch.Tensor): Query tensor (#batch, time1, size). 145 | key (torch.Tensor): Key tensor (#batch, time2, size). 146 | value (torch.Tensor): Value tensor (#batch, time2, size). 147 | 148 | Returns: 149 | torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k). 150 | torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k). 151 | torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k). 152 | 153 | """ 154 | b, t, d = x.size() 155 | q_k_v = self.linear_q_k_v(x) 156 | q, k, v = torch.split(q_k_v, int(self.h * self.d_k), dim=-1) 157 | q_h = torch.reshape(q, (b, t, self.h, self.d_k)).transpose( 158 | 1, 2 159 | ) # (batch, head, time1, d_k) 160 | k_h = torch.reshape(k, (b, t, self.h, self.d_k)).transpose( 161 | 1, 2 162 | ) # (batch, head, time2, d_k) 163 | v_h = torch.reshape(v, (b, t, self.h, self.d_k)).transpose( 164 | 1, 2 165 | ) # (batch, head, time2, d_k) 166 | 167 | return q_h, k_h, v_h, v 168 | 169 | def forward_attention(self, value, scores, mask, mask_att_chunk_encoder=None): 170 | """Compute attention context vector. 171 | 172 | Args: 173 | value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k). 174 | scores (torch.Tensor): Attention score (#batch, n_head, time1, time2). 175 | mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2). 176 | 177 | Returns: 178 | torch.Tensor: Transformed value (#batch, time1, d_model) 179 | weighted by the attention score (#batch, time1, time2). 180 | 181 | """ 182 | n_batch = value.size(0) 183 | if mask is not None: 184 | if mask_att_chunk_encoder is not None: 185 | mask = mask * mask_att_chunk_encoder 186 | 187 | mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) 188 | 189 | min_value = -float( 190 | "inf" 191 | ) # float(numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min) 192 | scores = scores.masked_fill(mask, min_value) 193 | self.attn = torch.softmax(scores, dim=-1).masked_fill( 194 | mask, 0.0 195 | ) # (batch, head, time1, time2) 196 | else: 197 | self.attn = torch.softmax(scores, dim=-1) # (batch, head, time1, time2) 198 | 199 | p_attn = self.dropout(self.attn) 200 | x = torch.matmul(p_attn, value) # (batch, head, time1, d_k) 201 | x = ( 202 | x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k) 203 | ) # (batch, time1, d_model) 204 | 205 | return self.linear_out(x) # (batch, time1, d_model) 206 | 207 | def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None): 208 | """Compute scaled dot product attention. 209 | 210 | Args: 211 | query (torch.Tensor): Query tensor (#batch, time1, size). 212 | key (torch.Tensor): Key tensor (#batch, time2, size). 213 | value (torch.Tensor): Value tensor (#batch, time2, size). 214 | mask (torch.Tensor): Mask tensor (#batch, 1, time2) or 215 | (#batch, time1, time2). 216 | 217 | Returns: 218 | torch.Tensor: Output tensor (#batch, time1, d_model). 219 | 220 | """ 221 | q_h, k_h, v_h, v = self.forward_qkv(x) 222 | fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk) 223 | q_h = q_h * self.d_k ** (-0.5) 224 | scores = torch.matmul(q_h, k_h.transpose(-2, -1)) 225 | att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder) 226 | return att_outs + fsmn_memory 227 | 228 | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): 229 | """Compute scaled dot product attention. 230 | 231 | Args: 232 | query (torch.Tensor): Query tensor (#batch, time1, size). 233 | key (torch.Tensor): Key tensor (#batch, time2, size). 234 | value (torch.Tensor): Value tensor (#batch, time2, size). 235 | mask (torch.Tensor): Mask tensor (#batch, 1, time2) or 236 | (#batch, time1, time2). 237 | 238 | Returns: 239 | torch.Tensor: Output tensor (#batch, time1, d_model). 240 | 241 | """ 242 | q_h, k_h, v_h, v = self.forward_qkv(x) 243 | if chunk_size is not None and look_back > 0 or look_back == -1: 244 | if cache is not None: 245 | k_h_stride = k_h[:, :, : -(chunk_size[2]), :] 246 | v_h_stride = v_h[:, :, : -(chunk_size[2]), :] 247 | k_h = torch.cat((cache["k"], k_h), dim=2) 248 | v_h = torch.cat((cache["v"], v_h), dim=2) 249 | 250 | cache["k"] = torch.cat((cache["k"], k_h_stride), dim=2) 251 | cache["v"] = torch.cat((cache["v"], v_h_stride), dim=2) 252 | if look_back != -1: 253 | cache["k"] = cache["k"][:, :, -(look_back * chunk_size[1]) :, :] 254 | cache["v"] = cache["v"][:, :, -(look_back * chunk_size[1]) :, :] 255 | else: 256 | cache_tmp = { 257 | "k": k_h[:, :, : -(chunk_size[2]), :], 258 | "v": v_h[:, :, : -(chunk_size[2]), :], 259 | } 260 | cache = cache_tmp 261 | fsmn_memory = self.forward_fsmn(v, None) 262 | q_h = q_h * self.d_k ** (-0.5) 263 | scores = torch.matmul(q_h, k_h.transpose(-2, -1)) 264 | att_outs = self.forward_attention(v_h, scores, None) 265 | return att_outs + fsmn_memory, cache 266 | 267 | 268 | class LayerNorm(nn.LayerNorm): 269 | def __init__(self, *args, **kwargs): 270 | super().__init__(*args, **kwargs) 271 | 272 | def forward(self, input): 273 | output = F.layer_norm( 274 | input.float(), 275 | self.normalized_shape, 276 | self.weight.float() if self.weight is not None else None, 277 | self.bias.float() if self.bias is not None else None, 278 | self.eps, 279 | ) 280 | return output.type_as(input) 281 | 282 | 283 | def sequence_mask(lengths, maxlen=None, dtype=torch.float32, device=None): 284 | if maxlen is None: 285 | maxlen = lengths.max() 286 | row_vector = torch.arange(0, maxlen, 1).to(lengths.device) 287 | matrix = torch.unsqueeze(lengths, dim=-1) 288 | mask = row_vector < matrix 289 | mask = mask.detach() 290 | 291 | return mask.type(dtype).to(device) if device is not None else mask.type(dtype) 292 | 293 | 294 | class EncoderLayerSANM(nn.Module): 295 | def __init__( 296 | self, 297 | in_size, 298 | size, 299 | self_attn, 300 | feed_forward, 301 | dropout_rate, 302 | normalize_before=True, 303 | concat_after=False, 304 | stochastic_depth_rate=0.0, 305 | ): 306 | """Construct an EncoderLayer object.""" 307 | super(EncoderLayerSANM, self).__init__() 308 | self.self_attn = self_attn 309 | self.feed_forward = feed_forward 310 | self.norm1 = LayerNorm(in_size) 311 | self.norm2 = LayerNorm(size) 312 | self.dropout = nn.Dropout(dropout_rate) 313 | self.in_size = in_size 314 | self.size = size 315 | self.normalize_before = normalize_before 316 | self.concat_after = concat_after 317 | if self.concat_after: 318 | self.concat_linear = nn.Linear(size + size, size) 319 | self.stochastic_depth_rate = stochastic_depth_rate 320 | self.dropout_rate = dropout_rate 321 | 322 | def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None): 323 | """Compute encoded features. 324 | 325 | Args: 326 | x_input (torch.Tensor): Input tensor (#batch, time, size). 327 | mask (torch.Tensor): Mask tensor for the input (#batch, time). 328 | cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size). 329 | 330 | Returns: 331 | torch.Tensor: Output tensor (#batch, time, size). 332 | torch.Tensor: Mask tensor (#batch, time). 333 | 334 | """ 335 | skip_layer = False 336 | # with stochastic depth, residual connection `x + f(x)` becomes 337 | # `x <- x + 1 / (1 - p) * f(x)` at training time. 338 | stoch_layer_coeff = 1.0 339 | if self.training and self.stochastic_depth_rate > 0: 340 | skip_layer = torch.rand(1).item() < self.stochastic_depth_rate 341 | stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate) 342 | 343 | if skip_layer: 344 | if cache is not None: 345 | x = torch.cat([cache, x], dim=1) 346 | return x, mask 347 | 348 | residual = x 349 | if self.normalize_before: 350 | x = self.norm1(x) 351 | 352 | if self.concat_after: 353 | x_concat = torch.cat( 354 | ( 355 | x, 356 | self.self_attn( 357 | x, 358 | mask, 359 | mask_shfit_chunk=mask_shfit_chunk, 360 | mask_att_chunk_encoder=mask_att_chunk_encoder, 361 | ), 362 | ), 363 | dim=-1, 364 | ) 365 | if self.in_size == self.size: 366 | x = residual + stoch_layer_coeff * self.concat_linear(x_concat) 367 | else: 368 | x = stoch_layer_coeff * self.concat_linear(x_concat) 369 | else: 370 | if self.in_size == self.size: 371 | x = residual + stoch_layer_coeff * self.dropout( 372 | self.self_attn( 373 | x, 374 | mask, 375 | mask_shfit_chunk=mask_shfit_chunk, 376 | mask_att_chunk_encoder=mask_att_chunk_encoder, 377 | ) 378 | ) 379 | else: 380 | x = stoch_layer_coeff * self.dropout( 381 | self.self_attn( 382 | x, 383 | mask, 384 | mask_shfit_chunk=mask_shfit_chunk, 385 | mask_att_chunk_encoder=mask_att_chunk_encoder, 386 | ) 387 | ) 388 | if not self.normalize_before: 389 | x = self.norm1(x) 390 | 391 | residual = x 392 | if self.normalize_before: 393 | x = self.norm2(x) 394 | x = residual + stoch_layer_coeff * self.dropout(self.feed_forward(x)) 395 | if not self.normalize_before: 396 | x = self.norm2(x) 397 | 398 | return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder 399 | 400 | def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0): 401 | """Compute encoded features. 402 | 403 | Args: 404 | x_input (torch.Tensor): Input tensor (#batch, time, size). 405 | mask (torch.Tensor): Mask tensor for the input (#batch, time). 406 | cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size). 407 | 408 | Returns: 409 | torch.Tensor: Output tensor (#batch, time, size). 410 | torch.Tensor: Mask tensor (#batch, time). 411 | 412 | """ 413 | 414 | residual = x 415 | if self.normalize_before: 416 | x = self.norm1(x) 417 | 418 | if self.in_size == self.size: 419 | attn, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back) 420 | x = residual + attn 421 | else: 422 | x, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back) 423 | 424 | if not self.normalize_before: 425 | x = self.norm1(x) 426 | 427 | residual = x 428 | if self.normalize_before: 429 | x = self.norm2(x) 430 | x = residual + self.feed_forward(x) 431 | if not self.normalize_before: 432 | x = self.norm2(x) 433 | 434 | return x, cache 435 | 436 | 437 | @tables.register("encoder_classes", "SenseVoiceEncoderSmall") 438 | class SenseVoiceEncoderSmall(nn.Module): 439 | """ 440 | Author: Speech Lab of DAMO Academy, Alibaba Group 441 | SCAMA: Streaming chunk-aware multihead attention for online end-to-end speech recognition 442 | https://arxiv.org/abs/2006.01713 443 | """ 444 | 445 | def __init__( 446 | self, 447 | input_size: int, 448 | output_size: int = 256, 449 | attention_heads: int = 4, 450 | linear_units: int = 2048, 451 | num_blocks: int = 6, 452 | tp_blocks: int = 0, 453 | dropout_rate: float = 0.1, 454 | positional_dropout_rate: float = 0.1, 455 | attention_dropout_rate: float = 0.0, 456 | stochastic_depth_rate: float = 0.0, 457 | input_layer: Optional[str] = "conv2d", 458 | pos_enc_class=SinusoidalPositionEncoder, 459 | normalize_before: bool = True, 460 | concat_after: bool = False, 461 | positionwise_layer_type: str = "linear", 462 | positionwise_conv_kernel_size: int = 1, 463 | padding_idx: int = -1, 464 | kernel_size: int = 11, 465 | sanm_shfit: int = 0, 466 | selfattention_layer_type: str = "sanm", 467 | **kwargs, 468 | ): 469 | super().__init__() 470 | self._output_size = output_size 471 | 472 | self.embed = SinusoidalPositionEncoder() 473 | 474 | self.normalize_before = normalize_before 475 | 476 | positionwise_layer = PositionwiseFeedForward 477 | positionwise_layer_args = ( 478 | output_size, 479 | linear_units, 480 | dropout_rate, 481 | ) 482 | 483 | encoder_selfattn_layer = MultiHeadedAttentionSANM 484 | encoder_selfattn_layer_args0 = ( 485 | attention_heads, 486 | input_size, 487 | output_size, 488 | attention_dropout_rate, 489 | kernel_size, 490 | sanm_shfit, 491 | ) 492 | encoder_selfattn_layer_args = ( 493 | attention_heads, 494 | output_size, 495 | output_size, 496 | attention_dropout_rate, 497 | kernel_size, 498 | sanm_shfit, 499 | ) 500 | 501 | self.encoders0 = nn.ModuleList( 502 | [ 503 | EncoderLayerSANM( 504 | input_size, 505 | output_size, 506 | encoder_selfattn_layer(*encoder_selfattn_layer_args0), 507 | positionwise_layer(*positionwise_layer_args), 508 | dropout_rate, 509 | ) 510 | for i in range(1) 511 | ] 512 | ) 513 | self.encoders = nn.ModuleList( 514 | [ 515 | EncoderLayerSANM( 516 | output_size, 517 | output_size, 518 | encoder_selfattn_layer(*encoder_selfattn_layer_args), 519 | positionwise_layer(*positionwise_layer_args), 520 | dropout_rate, 521 | ) 522 | for i in range(num_blocks - 1) 523 | ] 524 | ) 525 | 526 | self.tp_encoders = nn.ModuleList( 527 | [ 528 | EncoderLayerSANM( 529 | output_size, 530 | output_size, 531 | encoder_selfattn_layer(*encoder_selfattn_layer_args), 532 | positionwise_layer(*positionwise_layer_args), 533 | dropout_rate, 534 | ) 535 | for i in range(tp_blocks) 536 | ] 537 | ) 538 | 539 | self.after_norm = LayerNorm(output_size) 540 | 541 | self.tp_norm = LayerNorm(output_size) 542 | 543 | def output_size(self) -> int: 544 | return self._output_size 545 | 546 | def forward( 547 | self, 548 | xs_pad: torch.Tensor, 549 | ilens: torch.Tensor, 550 | ): 551 | """Embed positions in tensor.""" 552 | masks = sequence_mask(ilens, device=ilens.device)[:, None, :] 553 | 554 | xs_pad *= self.output_size() ** 0.5 555 | 556 | xs_pad = self.embed(xs_pad) 557 | 558 | # forward encoder1 559 | for layer_idx, encoder_layer in enumerate(self.encoders0): 560 | encoder_outs = encoder_layer(xs_pad, masks) 561 | xs_pad, masks = encoder_outs[0], encoder_outs[1] 562 | 563 | for layer_idx, encoder_layer in enumerate(self.encoders): 564 | encoder_outs = encoder_layer(xs_pad, masks) 565 | xs_pad, masks = encoder_outs[0], encoder_outs[1] 566 | 567 | xs_pad = self.after_norm(xs_pad) 568 | 569 | # forward encoder2 570 | olens = masks.squeeze(1).sum(1).int() 571 | 572 | for layer_idx, encoder_layer in enumerate(self.tp_encoders): 573 | encoder_outs = encoder_layer(xs_pad, masks) 574 | xs_pad, masks = encoder_outs[0], encoder_outs[1] 575 | 576 | xs_pad = self.tp_norm(xs_pad) 577 | return xs_pad, olens 578 | 579 | 580 | @tables.register("model_classes", "SenseVoiceSmall") 581 | class SenseVoiceSmall(nn.Module): 582 | """CTC-attention hybrid Encoder-Decoder model""" 583 | 584 | def __init__( 585 | self, 586 | specaug: str = None, 587 | specaug_conf: dict = None, 588 | normalize: str = None, 589 | normalize_conf: dict = None, 590 | encoder: str = None, 591 | encoder_conf: dict = None, 592 | ctc_conf: dict = None, 593 | input_size: int = 80, 594 | vocab_size: int = -1, 595 | ignore_id: int = -1, 596 | blank_id: int = 0, 597 | sos: int = 1, 598 | eos: int = 2, 599 | length_normalized_loss: bool = False, 600 | **kwargs, 601 | ): 602 | 603 | super().__init__() 604 | 605 | if specaug is not None: 606 | specaug_class = tables.specaug_classes.get(specaug) 607 | specaug = specaug_class(**specaug_conf) 608 | if normalize is not None: 609 | normalize_class = tables.normalize_classes.get(normalize) 610 | normalize = normalize_class(**normalize_conf) 611 | encoder_class = tables.encoder_classes.get(encoder) 612 | encoder = encoder_class(input_size=input_size, **encoder_conf) 613 | encoder_output_size = encoder.output_size() 614 | 615 | if ctc_conf is None: 616 | ctc_conf = {} 617 | ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **ctc_conf) 618 | 619 | self.blank_id = blank_id 620 | self.sos = sos if sos is not None else vocab_size - 1 621 | self.eos = eos if eos is not None else vocab_size - 1 622 | self.vocab_size = vocab_size 623 | self.ignore_id = ignore_id 624 | self.specaug = specaug 625 | self.normalize = normalize 626 | self.encoder = encoder 627 | self.error_calculator = None 628 | 629 | self.ctc = ctc 630 | 631 | self.length_normalized_loss = length_normalized_loss 632 | self.encoder_output_size = encoder_output_size 633 | 634 | self.lid_dict = {"auto": 0, "zh": 3, "en": 4, "yue": 7, "ja": 11, "ko": 12, "nospeech": 13} 635 | self.lid_int_dict = {24884: 3, 24885: 4, 24888: 7, 24892: 11, 24896: 12, 24992: 13} 636 | self.textnorm_dict = {"withitn": 14, "woitn": 15} 637 | self.textnorm_int_dict = {25016: 14, 25017: 15} 638 | self.embed = torch.nn.Embedding(7 + len(self.lid_dict) + len(self.textnorm_dict), input_size) 639 | self.emo_dict = {"unk": 25009, "happy": 25001, "sad": 25002, "angry": 25003, "neutral": 25004} 640 | 641 | self.criterion_att = LabelSmoothingLoss( 642 | size=self.vocab_size, 643 | padding_idx=self.ignore_id, 644 | smoothing=kwargs.get("lsm_weight", 0.0), 645 | normalize_length=self.length_normalized_loss, 646 | ) 647 | 648 | @staticmethod 649 | def from_pretrained(model:str=None, **kwargs): 650 | from funasr import AutoModel 651 | model, kwargs = AutoModel.build_model(model=model, trust_remote_code=True, **kwargs) 652 | 653 | return model, kwargs 654 | 655 | def forward( 656 | self, 657 | speech: torch.Tensor, 658 | speech_lengths: torch.Tensor, 659 | text: torch.Tensor, 660 | text_lengths: torch.Tensor, 661 | **kwargs, 662 | ): 663 | """Encoder + Decoder + Calc loss 664 | Args: 665 | speech: (Batch, Length, ...) 666 | speech_lengths: (Batch, ) 667 | text: (Batch, Length) 668 | text_lengths: (Batch,) 669 | """ 670 | # import pdb; 671 | # pdb.set_trace() 672 | if len(text_lengths.size()) > 1: 673 | text_lengths = text_lengths[:, 0] 674 | if len(speech_lengths.size()) > 1: 675 | speech_lengths = speech_lengths[:, 0] 676 | 677 | batch_size = speech.shape[0] 678 | 679 | # 1. Encoder 680 | encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, text) 681 | 682 | loss_ctc, cer_ctc = None, None 683 | loss_rich, acc_rich = None, None 684 | stats = dict() 685 | 686 | loss_ctc, cer_ctc = self._calc_ctc_loss( 687 | encoder_out[:, 4:, :], encoder_out_lens - 4, text[:, 4:], text_lengths - 4 688 | ) 689 | 690 | loss_rich, acc_rich = self._calc_rich_ce_loss( 691 | encoder_out[:, :4, :], text[:, :4] 692 | ) 693 | 694 | loss = loss_ctc + loss_rich 695 | # Collect total loss stats 696 | stats["loss_ctc"] = torch.clone(loss_ctc.detach()) if loss_ctc is not None else None 697 | stats["loss_rich"] = torch.clone(loss_rich.detach()) if loss_rich is not None else None 698 | stats["loss"] = torch.clone(loss.detach()) if loss is not None else None 699 | stats["acc_rich"] = acc_rich 700 | 701 | # force_gatherable: to-device and to-tensor if scalar for DataParallel 702 | if self.length_normalized_loss: 703 | batch_size = int((text_lengths + 1).sum()) 704 | loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device) 705 | return loss, stats, weight 706 | 707 | def encode( 708 | self, 709 | speech: torch.Tensor, 710 | speech_lengths: torch.Tensor, 711 | text: torch.Tensor, 712 | **kwargs, 713 | ): 714 | """Frontend + Encoder. Note that this method is used by asr_inference.py 715 | Args: 716 | speech: (Batch, Length, ...) 717 | speech_lengths: (Batch, ) 718 | ind: int 719 | """ 720 | 721 | # Data augmentation 722 | if self.specaug is not None and self.training: 723 | speech, speech_lengths = self.specaug(speech, speech_lengths) 724 | 725 | # Normalization for feature: e.g. Global-CMVN, Utterance-CMVN 726 | if self.normalize is not None: 727 | speech, speech_lengths = self.normalize(speech, speech_lengths) 728 | 729 | 730 | lids = torch.LongTensor([[self.lid_int_dict[int(lid)] if torch.rand(1) > 0.2 and int(lid) in self.lid_int_dict else 0 ] for lid in text[:, 0]]).to(speech.device) 731 | language_query = self.embed(lids) 732 | 733 | styles = torch.LongTensor([[self.textnorm_int_dict[int(style)]] for style in text[:, 3]]).to(speech.device) 734 | style_query = self.embed(styles) 735 | speech = torch.cat((style_query, speech), dim=1) 736 | speech_lengths += 1 737 | 738 | event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(speech.size(0), 1, 1) 739 | input_query = torch.cat((language_query, event_emo_query), dim=1) 740 | speech = torch.cat((input_query, speech), dim=1) 741 | speech_lengths += 3 742 | 743 | encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths) 744 | 745 | return encoder_out, encoder_out_lens 746 | 747 | def _calc_ctc_loss( 748 | self, 749 | encoder_out: torch.Tensor, 750 | encoder_out_lens: torch.Tensor, 751 | ys_pad: torch.Tensor, 752 | ys_pad_lens: torch.Tensor, 753 | ): 754 | # Calc CTC loss 755 | loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens) 756 | 757 | # Calc CER using CTC 758 | cer_ctc = None 759 | if not self.training and self.error_calculator is not None: 760 | ys_hat = self.ctc.argmax(encoder_out).data 761 | cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True) 762 | return loss_ctc, cer_ctc 763 | 764 | def _calc_rich_ce_loss( 765 | self, 766 | encoder_out: torch.Tensor, 767 | ys_pad: torch.Tensor, 768 | ): 769 | decoder_out = self.ctc.ctc_lo(encoder_out) 770 | # 2. Compute attention loss 771 | loss_rich = self.criterion_att(decoder_out, ys_pad.contiguous()) 772 | acc_rich = th_accuracy( 773 | decoder_out.view(-1, self.vocab_size), 774 | ys_pad.contiguous(), 775 | ignore_label=self.ignore_id, 776 | ) 777 | 778 | return loss_rich, acc_rich 779 | 780 | 781 | def inference( 782 | self, 783 | data_in, 784 | data_lengths=None, 785 | key: list = ["wav_file_tmp_name"], 786 | tokenizer=None, 787 | frontend=None, 788 | **kwargs, 789 | ): 790 | 791 | 792 | meta_data = {} 793 | if ( 794 | isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank" 795 | ): # fbank 796 | speech, speech_lengths = data_in, data_lengths 797 | if len(speech.shape) < 3: 798 | speech = speech[None, :, :] 799 | if speech_lengths is None: 800 | speech_lengths = speech.shape[1] 801 | else: 802 | # extract fbank feats 803 | time1 = time.perf_counter() 804 | audio_sample_list = load_audio_text_image_video( 805 | data_in, 806 | fs=frontend.fs, 807 | audio_fs=kwargs.get("fs", 16000), 808 | data_type=kwargs.get("data_type", "sound"), 809 | tokenizer=tokenizer, 810 | ) 811 | time2 = time.perf_counter() 812 | meta_data["load_data"] = f"{time2 - time1:0.3f}" 813 | speech, speech_lengths = extract_fbank( 814 | audio_sample_list, data_type=kwargs.get("data_type", "sound"), frontend=frontend 815 | ) 816 | time3 = time.perf_counter() 817 | meta_data["extract_feat"] = f"{time3 - time2:0.3f}" 818 | meta_data["batch_data_time"] = ( 819 | speech_lengths.sum().item() * frontend.frame_shift * frontend.lfr_n / 1000 820 | ) 821 | 822 | speech = speech.to(device=kwargs["device"]) 823 | speech_lengths = speech_lengths.to(device=kwargs["device"]) 824 | 825 | language = kwargs.get("language", "auto") 826 | language_query = self.embed( 827 | torch.LongTensor( 828 | [[self.lid_dict[language] if language in self.lid_dict else 0]] 829 | ).to(speech.device) 830 | ).repeat(speech.size(0), 1, 1) 831 | 832 | use_itn = kwargs.get("use_itn", False) 833 | textnorm = kwargs.get("text_norm", None) 834 | if textnorm is None: 835 | textnorm = "withitn" if use_itn else "woitn" 836 | textnorm_query = self.embed( 837 | torch.LongTensor([[self.textnorm_dict[textnorm]]]).to(speech.device) 838 | ).repeat(speech.size(0), 1, 1) 839 | speech = torch.cat((textnorm_query, speech), dim=1) 840 | speech_lengths += 1 841 | 842 | event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat( 843 | speech.size(0), 1, 1 844 | ) 845 | input_query = torch.cat((language_query, event_emo_query), dim=1) 846 | speech = torch.cat((input_query, speech), dim=1) 847 | speech_lengths += 3 848 | 849 | # Encoder 850 | encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths) 851 | if isinstance(encoder_out, tuple): 852 | encoder_out = encoder_out[0] 853 | 854 | # c. Passed the encoder result and the beam search 855 | ctc_logits = self.ctc.log_softmax(encoder_out) 856 | if kwargs.get("ban_emo_unk", False): 857 | ctc_logits[:, :, self.emo_dict["unk"]] = -float("inf") 858 | 859 | results = [] 860 | b, n, d = encoder_out.size() 861 | if isinstance(key[0], (list, tuple)): 862 | key = key[0] 863 | if len(key) < b: 864 | key = key * b 865 | for i in range(b): 866 | x = ctc_logits[i, : encoder_out_lens[i].item(), :] 867 | yseq = x.argmax(dim=-1) 868 | yseq = torch.unique_consecutive(yseq, dim=-1) 869 | 870 | ibest_writer = None 871 | if kwargs.get("output_dir") is not None: 872 | if not hasattr(self, "writer"): 873 | self.writer = DatadirWriter(kwargs.get("output_dir")) 874 | ibest_writer = self.writer[f"1best_recog"] 875 | 876 | mask = yseq != self.blank_id 877 | token_int = yseq[mask].tolist() 878 | 879 | # Change integer-ids to tokens 880 | text = tokenizer.decode(token_int) 881 | 882 | result_i = {"key": key[i], "text": text} 883 | results.append(result_i) 884 | 885 | if ibest_writer is not None: 886 | ibest_writer["text"][key[i]] = text 887 | 888 | return results, meta_data 889 | 890 | def export(self, **kwargs): 891 | from export_meta import export_rebuild_model 892 | 893 | if "max_seq_len" not in kwargs: 894 | kwargs["max_seq_len"] = 512 895 | models = export_rebuild_model(model=self, **kwargs) 896 | return models 897 | -------------------------------------------------------------------------------- /models/vad_onnx/silero_vad.onnx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/models/vad_onnx/silero_vad.onnx -------------------------------------------------------------------------------- /pics/运行截图.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/pics/运行截图.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | modelscope 2 | huggingface 3 | huggingface_hub 4 | funasr>=1.1.3 5 | numpy<=1.26.4 6 | gradio 7 | pyaudio 8 | addict 9 | datasets 10 | simplejson 11 | sortedcontainers 12 | sounddevice 13 | soundfile 14 | sherpa-onnx 15 | scipy -------------------------------------------------------------------------------- /speak_log/log.txt: -------------------------------------------------------------------------------- 1 | 窗外的麻雀在电线杆上多嘴 2 | 你说这一句很有夏天的感觉 3 | -------------------------------------------------------------------------------- /speakers/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/speakers/.gitkeep -------------------------------------------------------------------------------- /speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RemSynch/SenseVoice-Real-Time/71e14bac9ada613a3e64061c7bfbab73efc9dcc7/speech_campplus_sv_zh_en_16k-common_advanced/.gitkeep --------------------------------------------------------------------------------