├── LICENSE
├── README.md
├── README_en.md
├── README_zh.md
├── gpt_sovits_api.py
├── main_ollama.py
└── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 HaxxorCialtion
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 智能语音助手系统
  2 | 
  3 | ## This project contains two README files:
  4 | - [English version](README_en.md)
  5 | - [中文版](README_zh.md)
  6 | 
  7 | ## 项目概述
  8 | 
  9 | 这是一个基于关键词检测、语音识别、语音合成和对话生成的智能语音助手系统。该系统能够通过特定的唤醒词（如"hey bro"）启动与用户的语音对话，并利用先进的自然语言处理技术提供智能回复。
 10 | 
 11 | ## 主要功能
 12 | 
 13 | 1. **关键词检测**：使用Porcupine实时监听唤醒词，启动对话。
 14 | 2. **语音录制和检测**：采用WebRTC VAD进行语音活动检测，录制有效语音片段。
 15 | 3. **语音识别（ASR）**：使用SenseVoice Small模型将录制的语音转换为文本。
 16 | 4. **对话生成**：调用Ollama API（兼容OpenAI API），根据上下文生成助手的文本回复。
 17 | 5. **语音合成（TTS）**：将助手的回复通过语音合成输出，模拟人声对话。
 18 | 6. **对话历史保存**：定期将对话内容保存为JSON文件，便于后续分析。
 19 | 
 20 | ## 技术栈
 21 | 
 22 | - Python
 23 | - Porcupine（关键词检测）
 24 | - WebRTC VAD（语音活动检测）
 25 | - SenseVoice Small（语音识别）
 26 | - Ollama API（对话生成）
 27 | - GPT-SoVITS（语音合成）
 28 | - PyAudio, NumPy, SciPy（音频处理）
 29 | 
 30 | ### TTS部分
 31 | 
 32 | - 本项目基于GPT-Sovits-v2，感谢开源社区工作者的贡献！
 33 | 
 34 | #### GPT-SoVITS TTS 项目
 35 | 
 36 | 这是一个基于GPT-SoVITS API的文本到语音（TTS）项目。该项目允许用户根据不同的情感生成和播放语音，使用预定义的参考音频来影响输出的语音风格。
 37 | 
 38 | ##### 功能特点
 39 | 
 40 | - 支持多种情感的语音生成（高兴、抑郁、激动、平静、纠结）
 41 | - 使用参考音频来控制语音风格
 42 | - 实时生成并播放WAV格式的音频文件
 43 | - 可自定义文本输入和输出文件名
 44 | - 提供TTS处理时间统计
 45 | - 可以添加一个情感识别模型来决定参考音频，从而控制音频合成情感(to do)
 46 | 
 47 | ## 使用指南
 48 | 
 49 | 1. 克隆仓库：
 50 | - git clone https://github.com/HaxxorCialtion/ASR_LLM_TTS_py.git
 51 | 
 52 | - cd intelligent-voice-assistant
 53 | 
 54 | 
 55 | 2. 安装依赖：
 56 | - pip install -r requirements.txt
 57 | 
 58 | 
 59 | 3. 准备必要的API密钥和模型：
 60 | - 获取Porcupine API密钥
 61 | - 下载SenseVoice Small模型文件
 62 | - 确保Ollama API服务已经运行
 63 | - 开启GPT-Sovits API服务
 64 | 
 65 | 4. 配置系统：
 66 | - 在脚本中填入Porcupine API密钥
 67 | - 设置ASR模型路径
 68 | - 配置Ollama API端点（默认为本地）
 69 | - 配置GPT-sovits 模型和参考音频
 70 | 
 71 | ## 使用方法
 72 | 
 73 | 1. 运行主脚本：
 74 | python main.py
 75 | 
 76 | 2. 等待系统提示"Listening for wake word..."
 77 | 
 78 | 3. 说出唤醒词（默认为"hey bro"）开始对话
 79 | 
 80 | 4. 与语音助手进行自然语言交互
 81 | 
 82 | 5. 超时则需再次触发
 83 | 
 84 | 6. 结束本轮对话即自动保存对话记录 
 85 | 
 86 | ## 主要特性
 87 | 
 88 | - 实时语音交互
 89 | - 智能对话生成
 90 | - 自然语音合成
 91 | - 长时间无语音自动休眠
 92 | - 对话历史记录
 93 | 
 94 | ## 自定义设置
 95 | 
 96 | - 修改`settings`变量来自定义助手的角色和背景
 97 | - 调整`max_silence_duration`和`min_speech_duration`等参数来优化语音检测
 98 | - 更换唤醒词和对应的模型文件
 99 | 
100 | ## 许可证
101 | 
102 | 本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情
103 | 
104 | ## 联系方式
105 | 
106 | 项目维护者：HaxxorCialtion - cialtion@outlook.com
107 | Bilibili视频地址：https://www.bilibili.com/video/BV1pftreQEbu
108 | 
109 | ## 致谢
110 | 
111 | - [Porcupine](https://github.com/Picovoice/porcupine) - 用于唤醒词检测
112 | - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) - 提供ASR模型
113 | - [Ollama](https://github.com/ollama/ollama) - 本地大语言模型服务
114 | - [LLM](https://github.com/QwenLM/Qwen2.5) - LLLM服务
115 | - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) - 用于语音合成


--------------------------------------------------------------------------------
/README_en.md:
--------------------------------------------------------------------------------
  1 | # Intelligent Voice Assistant System
  2 | 
  3 | ## Project Overview
  4 | 
  5 | This is an intelligent voice assistant system based on keyword detection, speech recognition, speech synthesis, and dialogue generation. The system can initiate a voice conversation with users through specific wake words (such as "hey bro") and provide intelligent replies using advanced natural language processing technology.
  6 | 
  7 | ## Main Features
  8 | 
  9 | - **Keyword Detection**: Real-time monitoring of wake words using Porcupine to initiate conversation.
 10 | - **Voice Recording and Detection**: Voice activity detection using WebRTC VAD to record valid voice segments.
 11 | - **Speech Recognition (ASR)**: Converting recorded voice to text using the SenseVoice Small model.
 12 | - **Dialogue Generation**: Generating the assistant's text replies based on context by calling the Ollama API (compatible with OpenAI API).
 13 | - **Speech Synthesis (TTS)**: Outputting the assistant's replies through speech synthesis to simulate human voice conversation.
 14 | - **Dialogue History Saving**: Regularly saving dialogue content as JSON files for subsequent analysis.
 15 | 
 16 | ## Technology Stack
 17 | 
 18 | - Python
 19 | - Porcupine (Keyword Detection)
 20 | - WebRTC VAD (Voice Activity Detection)
 21 | - SenseVoice Small (Speech Recognition)
 22 | - Ollama API (Dialogue Generation)
 23 | - GPT-SoVITS (Speech Synthesis)
 24 | - PyAudio, NumPy, SciPy (Audio Processing)
 25 | 
 26 | ## TTS Part
 27 | 
 28 | This project is based on the GPT-SoVITS API, and we appreciate the contributions of the open-source community workers!
 29 | 
 30 | ### GPT-SoVITS TTS Project
 31 | 
 32 | This is a Text-to-Speech (TTS) project based on the GPT-SoVITS API. The project allows users to generate and play voice with different emotions, using predefined reference audio to influence the output voice style.
 33 | 
 34 | ### Features
 35 | 
 36 | - Supports voice generation with various emotions (happy, depressed, excited, calm, confused).
 37 | - Uses reference audio to control voice style.
 38 | - Real-time generation and playback of WAV format audio files.
 39 | - Customizable text input and output filenames.
 40 | - Provides TTS processing time statistics.
 41 | - Can add an emotion recognition model to decide the reference audio, thus controlling the audio synthesis emotion (to do).
 42 | 
 43 | ## Usage Guide
 44 | 
 45 | 1.Clone the repository:
 46 | - git clone https://github.com/HaxxorCialtion/ASR_LLM_TTS_py.git
 47 | - cd intelligent-voice-assistant
 48 | 
 49 | 2. Install dependencies:
 50 | 
 51 | - pip install -r requirements.txt
 52 | 
 53 | 3. Prepare necessary API keys and models:
 54 | - Obtain Porcupine API key.
 55 | - Download SenseVoice Small model files.
 56 | - Ensure Ollama API service is running.
 57 | - Start GPT-Sovits API service.
 58 |         
 59 | 4. Configure the system:
 60 | - Fill in the Porcupine API key in the script.
 61 | - Set the ASR model path.
 62 | - Configure Ollama API endpoint (default is local).
 63 | - Configure GPT-sovits model and reference audio.
 64 | 
 65 | 5. Run the main script:
 66 | 
 67 | - python main_ollama.py
 68 | 
 69 | - Wait for the system to prompt "Listening for wake word..."
 70 | - Speak the wake word (default is "hey bro") to start the conversation.
 71 | - Engage in natural language interaction with the voice assistant.
 72 | - If timed out, trigger again.
 73 | - The conversation record is automatically saved at the end of the round.
 74 | 
 75 | ## Main Features
 76 | 
 77 | - Real-time voice interaction.
 78 | - Intelligent dialogue generation.
 79 | - Natural voice synthesis.
 80 | - Automatic sleep mode after a long period of no voice.
 81 | - Dialogue history recording.
 82 | - Custom settings.
 83 | 
 84 | ## Customization
 85 | 
 86 | - Modify the `settings` variable to customize the assistant's role and background.
 87 | - Adjust parameters such as `max_silence_duration` and `min_speech_duration` to optimize voice detection.
 88 | - Change the wake word and corresponding model files.
 89 | 
 90 | ## License
 91 | 
 92 | This project is licensed under the MIT License
 93 | 
 94 | ## Contact
 95 | 
 96 | Project Maintainer: HaxxorCialtion - cialtion@outlook.com
 97 | 
 98 | Bilibili Video Address: https://www.bilibili.com/video/BV1pftreQEbu
 99 | 
100 | ## Acknowledgements
101 | 
102 | - [Porcupine](https://github.com/Picovoice/porcupine) 
103 | - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) 
104 | - [Ollama](https://github.com/ollama/ollama) 
105 | - [LLM](https://github.com/QwenLM/Qwen2.5) 
106 | - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) 


--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
  1 | # 智能语音助手系统
  2 | 
  3 | ## 项目概述
  4 | 
  5 | 这是一个基于关键词检测、语音识别、语音合成和对话生成的智能语音助手系统。该系统能够通过特定的唤醒词（如"hey bro"）启动与用户的语音对话，并利用先进的自然语言处理技术提供智能回复。
  6 | 
  7 | ## 主要功能
  8 | 
  9 | 1. **关键词检测**：使用Porcupine实时监听唤醒词，启动对话。
 10 | 2. **语音录制和检测**：采用WebRTC VAD进行语音活动检测，录制有效语音片段。
 11 | 3. **语音识别（ASR）**：使用SenseVoice Small模型将录制的语音转换为文本。
 12 | 4. **对话生成**：调用Ollama API（兼容OpenAI API），根据上下文生成助手的文本回复。
 13 | 5. **语音合成（TTS）**：将助手的回复通过语音合成输出，模拟人声对话。
 14 | 6. **对话历史保存**：定期将对话内容保存为JSON文件，便于后续分析。
 15 | 
 16 | ## 技术栈
 17 | 
 18 | - Python
 19 | - Porcupine（关键词检测）
 20 | - WebRTC VAD（语音活动检测）
 21 | - SenseVoice Small（语音识别）
 22 | - Ollama API（对话生成）
 23 | - GPT-SoVITS（语音合成）
 24 | - PyAudio, NumPy, SciPy（音频处理）
 25 | 
 26 | ### TTS部分
 27 | 
 28 | - 本项目基于GPT-Sovits-v2，感谢开源社区工作者的贡献！
 29 | 
 30 | #### GPT-SoVITS TTS 项目
 31 | 
 32 | 这是一个基于GPT-SoVITS API的文本到语音（TTS）项目。该项目允许用户根据不同的情感生成和播放语音，使用预定义的参考音频来影响输出的语音风格。
 33 | 
 34 | ##### 功能特点
 35 | 
 36 | - 支持多种情感的语音生成（高兴、抑郁、激动、平静、纠结）
 37 | - 使用参考音频来控制语音风格
 38 | - 实时生成并播放WAV格式的音频文件
 39 | - 可自定义文本输入和输出文件名
 40 | - 提供TTS处理时间统计
 41 | - 可以添加一个情感识别模型来决定参考音频，从而控制音频合成情感(to do)
 42 | 
 43 | ## 使用指南
 44 | 
 45 | 1. 克隆仓库：
 46 | - git clone https://github.com/HaxxorCialtion/ASR_LLM_TTS_py.git
 47 | 
 48 | - cd intelligent-voice-assistant
 49 | 
 50 | 
 51 | 2. 安装依赖：
 52 | - pip install -r requirements.txt
 53 | 
 54 | 
 55 | 3. 准备必要的API密钥和模型：
 56 | - 获取Porcupine API密钥
 57 | - 下载SenseVoice Small模型文件
 58 | - 确保Ollama API服务已经运行
 59 | - 开启GPT-Sovits API服务
 60 | 
 61 | 4. 配置系统：
 62 | - 在脚本中填入Porcupine API密钥
 63 | - 设置ASR模型路径
 64 | - 配置Ollama API端点（默认为本地）
 65 | - 配置GPT-sovits 模型和参考音频
 66 | 
 67 | ## 使用方法
 68 | 
 69 | 1. 运行主脚本：
 70 | python main.py
 71 | 
 72 | 2. 等待系统提示"Listening for wake word..."
 73 | 
 74 | 3. 说出唤醒词（默认为"hey bro"）开始对话
 75 | 
 76 | 4. 与语音助手进行自然语言交互
 77 | 
 78 | 5. 超时则需再次触发
 79 | 
 80 | 6. 结束本轮对话即自动保存对话记录 
 81 | 
 82 | ## 主要特性
 83 | 
 84 | - 实时语音交互
 85 | - 智能对话生成
 86 | - 自然语音合成
 87 | - 长时间无语音自动休眠
 88 | - 对话历史记录
 89 | 
 90 | ## 自定义设置
 91 | 
 92 | - 修改`settings`变量来自定义助手的角色和背景
 93 | - 调整`max_silence_duration`和`min_speech_duration`等参数来优化语音检测
 94 | - 更换唤醒词和对应的模型文件
 95 | 
 96 | ## 许可证
 97 | 
 98 | 本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情
 99 | 
100 | ## 联系方式
101 | 
102 | 项目维护者：HaxxorCialtion - cialtion@outlook.com
103 | Bilibili视频地址：https://www.bilibili.com/video/BV1pftreQEbu
104 | 
105 | ## 致谢
106 | 
107 | - [Porcupine](https://github.com/Picovoice/porcupine) - 用于唤醒词检测
108 | - [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) - 提供ASR模型
109 | - [Ollama](https://github.com/ollama/ollama) - 本地大语言模型服务
110 | - [LLM](https://github.com/QwenLM/Qwen2.5) - LLLM服务
111 | - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) - 用于语音合成


--------------------------------------------------------------------------------
/gpt_sovits_api.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import os
 3 | import time
 4 | from playsound import playsound
 5 | import pyaudio
 6 | import wave
 7 | 
 8 | # 基础URL
 9 | base_url = "http://127.0.0.1:9880"
10 | 
11 | def get_audio_file_path(emotion):
12 |     """获取参考音频的路径"""
13 |     paths = {
14 |         "高兴": "./参考音频/要吸收和消化掉这些对吧？我会努力的！.wav",
15 |         "抑郁": "./参考音频/虽然可惜依旧不能算正式干员的样子.wav",
16 |         "激动": "./参考音频/这下坏人的数量又减少了呢！都是多亏了博士！.wav",
17 |         "平静": "./参考音频/嗯？博士怎么放炮了几个坏人？我去替博士收拾一下吧。.wav",
18 |         "纠结": "./参考音频/不要吧，那我会很困扰啊。.wav"
19 |     }
20 |     return paths.get(emotion)
21 | 
22 | def change_reference_audio(emotion, prompt_language="zh"):
23 |     """切换参考音频并发送请求"""
24 |     refer_wav_path = get_audio_file_path(emotion)
25 |     filename = os.path.basename(refer_wav_path)
26 |     file_name_without_extension = os.path.splitext(filename)[0]
27 |     prompt_text = file_name_without_extension
28 |     data = {
29 |         "refer_wav_path": refer_wav_path,
30 |         "prompt_text": prompt_text,
31 |         "prompt_language": prompt_language
32 |     }
33 | 
34 |     response = requests.post(f"{base_url}/change_refer", json=data)
35 | def save_audio_from_response(url, data, output_file):
36 |     """执行推理并保存音频"""
37 |     try:
38 |         response = requests.post(url, json=data)
39 | 
40 |         if response.status_code == 200:
41 |             with open(output_file, "wb") as f:
42 |                 f.write(response.content)
43 |             return output_file
44 |         else:
45 |             print(response.status_code)
46 |     except Exception as e:
47 |         print(e)
48 | 
49 | def play_wav_file(wav_file):
50 |     # 打开WAV文件
51 |     wf = wave.open(wav_file, 'rb')
52 | 
53 |     # 创建PyAudio对象
54 |     p = pyaudio.PyAudio()
55 | 
56 |     # 打开音频流
57 |     stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
58 |                     channels=wf.getnchannels(),
59 |                     rate=wf.getframerate(),
60 |                     output=True)
61 | 
62 |     # 读取数据并播放
63 |     data = wf.readframes(1024)
64 |     while len(data) > 0:
65 |         stream.write(data)
66 |         data = wf.readframes(1024)
67 | 
68 |     # 停止和关闭音频流
69 |     stream.stop_stream()
70 |     stream.close()
71 | 
72 |     # 关闭PyAudio
73 |     p.terminate()
74 | 
75 | def gpt_sovits(temp_text="长时间没和我交流，已待机", emotion="高兴", output_file=f"temp_12.wav"):
76 |     t1 = time.time()
77 |     base_url = "http://127.0.0.1:9880"
78 |     url = f"{base_url}/"
79 |     data = {
80 |         "text": f"{temp_text}",
81 |         "text_language": "zh",
82 |         "cut_punc": "，。！？!、：；？.，、—‘’“”《》【】()[]{}「」『』‖｜…‥・﹏﹋﹌·･~－−—―「」『』〝〞",
83 |         # "cut_punc": "。",
84 |         "top_k": 20,
85 |         "top_p": 1.0,
86 |         "temperature": 1,
87 |         "speed": 1.0
88 |     }
89 |     change_reference_audio(emotion, "zh")
90 |     wav_file = save_audio_from_response(url, data, output_file)
91 |     t2 = time.time()
92 |     print(f"TTS耗时： {t2 - t1} seconds")
93 |     # 播放音频
94 |     play_wav_file(wav_file)
95 | 
96 | """
97 | $$ http://127.0.0.1:9880?text=晚上好，博士！&refer_wav_path=E:\AI_tools\resperpy\参考音频\虽然可惜依旧不能算正式干员的样子.wav&prompt_text=虽然可惜依旧不能算正式干员的样子&prompt_language=zh&text_language=zh&cut_punc=，。！？!、：；？.，、—‘’“”《》【】()[]{}「」『』‖｜…‥・﹏﹋﹌·･~－−—―「」『』〝〞&top_k=20&top_p=1.0&temperature=1&speed=1.0
98 | """


--------------------------------------------------------------------------------
/main_ollama.py:
--------------------------------------------------------------------------------
  1 | """
  2 | ASR LLM TTS 部分需要的模型和文件需要自己配置，详情参考README.md
  3 | """
  4 | import pvporcupine
  5 | import pyaudio
  6 | import numpy as np
  7 | from funasr_onnx import SenseVoiceSmall
  8 | from funasr_onnx.utils.postprocess_utils import rich_transcription_postprocess
  9 | from scipy.io.wavfile import write
 10 | import sounddevice as sd
 11 | import time
 12 | import webrtcvad
 13 | import os
 14 | import requests
 15 | import json
 16 | from datetime import datetime
 17 | import gpt_sovits_api
 18 | import soundfile as sf
 19 | 
 20 | def record_audio_vad(filename, sample_rate, vad):
 21 |     print("Recording started with VAD...")
 22 | 
 23 |     audio = []
 24 |     silence_frames = 0
 25 |     speech_frames = 0
 26 |     max_silence_duration = 1.5  # 增加到1.5秒
 27 |     min_speech_duration = 0.5  # 最小语音持续时间为0.5秒
 28 |     speech_started = False
 29 |     valid_speech = False
 30 | 
 31 |     stream = sd.InputStream(samplerate=sample_rate, channels=1, dtype='int16')
 32 |     with stream:
 33 |         while True:
 34 |             frame = stream.read(frame_length)[0]
 35 |             frame = frame.flatten()
 36 | 
 37 |             is_speech = vad.is_speech(frame.tobytes(), sample_rate)
 38 | 
 39 |             if is_speech:
 40 |                 speech_frames += 1
 41 |                 silence_frames = 0
 42 |                 if not speech_started:
 43 |                     speech_started = True
 44 |                     print("Speech detected, recording started.")
 45 |             else:
 46 |                 silence_frames += 1
 47 |                 if speech_started:
 48 |                     audio.append(frame)
 49 | 
 50 |             if speech_started:
 51 |                 audio.append(frame)
 52 | 
 53 |             # 检查是否达到最小语音持续时间
 54 |             if speech_frames * frame_duration / 1000 >= min_speech_duration:
 55 |                 valid_speech = True
 56 | 
 57 |             # 检查是否达到最大静音持续时间
 58 |             if silence_frames * frame_duration / 1000 > max_silence_duration:
 59 |                 if valid_speech:
 60 |                     print("Silence detected after valid speech. Stopping recording.")
 61 |                     break
 62 |                 else:
 63 |                     print("Short noise detected. Resetting.")
 64 |                     audio = []
 65 |                     silence_frames = 0
 66 |                     speech_frames = 0
 67 |                     speech_started = False
 68 |                     valid_speech = False
 69 | 
 70 |     if len(audio) > 0 and valid_speech:
 71 |         audio_data = np.concatenate(audio, axis=0)
 72 |         write(filename, sample_rate, audio_data)
 73 |         print(f"Recording finished. Saved to {filename}")
 74 |         return filename
 75 |     else:
 76 |         print("No valid speech detected. No audio file saved.")
 77 |         return None
 78 | 
 79 | def transcribe_audio(wav_file, model):
 80 |     print(f"Processing audio file {wav_file}...")
 81 |     t1 = time.time()
 82 |     res = model([wav_file], language="zh", use_itn=True)
 83 |     transcription = [rich_transcription_postprocess(i) for i in res]
 84 |     print(f"ASR耗时: {time.time() - t1} seconds")
 85 |     return transcription[0]  # 返回第一个（也是唯一的）转录结果
 86 | 
 87 | def tts(temp_text="你好", emotion="抑郁", output_file=f"temp_12.wav"):
 88 |     gpt_sovits_api.gpt_sovits(temp_text, emotion, output_file)
 89 | 
 90 | def dp_chat(message: str, stream=False):
 91 |     global conversation_history
 92 |     t1 = time.time()
 93 | 
 94 |     conversation_history.append({"role": "user", "content": message})
 95 | 
 96 |     payload = {
 97 |         "model": "qwen2.5",
 98 |         "messages": conversation_history,
 99 |         "stream": stream,
100 |     }
101 |     url = "http://localhost:11434/api/chat"
102 |     response = requests.post(url, json=payload)
103 |     if stream:
104 |         # 逐行读取流式响应内容
105 |         for line in response.iter_lines():
106 |             if line:
107 |                 # 尝试将每一行解析为 JSON
108 |                 try:
109 |                     data = json.loads(line.decode('utf-8'))
110 |                     print(data)  # 打印每一个 JSON 数据块
111 |                 except json.JSONDecodeError as e:
112 |                     print(f"JSON 解析失败: {e}")
113 |     else:
114 |         response_json = response.json()
115 |         assistant_response = response_json.get("message", {}).get("content", "")
116 | 
117 |     t2 = time.time()
118 |     print(f"API response time: {t2 - t1} seconds")
119 | 
120 |     conversation_history.append({"role": "assistant", "content": assistant_response})
121 | 
122 |     tts(assistant_response)
123 | 
124 |     return assistant_response
125 | 
126 | def play_audio(file_path):
127 |     """触发唤醒词后，播放hello文件"""
128 |     data, fs = sf.read(file_path, dtype='float32')  # 读取音频文件
129 |     sd.play(data, fs)  # 播放音频
130 |     sd.wait()  # 等待音频播放结束
131 | 
132 | def continuous_conversation(model, vad, sleep_time=10):
133 |     max_silence_duration = sleep_time  # 最大静音时长为 60 秒
134 |     while True:
135 |         audio_filename = "input_audio.wav"
136 |         start_time = time.time()
137 | 
138 |         # 记录开始录音的时间
139 |         recorded_file = record_audio_vad(audio_filename, sample_rate, vad)
140 | 
141 |         if recorded_file is None:
142 |             print("No valid speech detected. Please try again.")
143 |             continue
144 | 
145 |         # 如果超过了指定时间没有检测到有效语音，退出对话
146 |         if time.time() - start_time > max_silence_duration:
147 |             print("No speech detected for 60 seconds. Conversation ended.")
148 |             play_audio("./sleep.wav")
149 |             break
150 | 
151 |         # 语音转文字
152 |         transcription_result = transcribe_audio(recorded_file, model)
153 |         if transcription_result.lower() in ['退出', '结束对话', 'exit', 'quit']:
154 |             print("对话结束")
155 |             break
156 | 
157 |         # 生成助手回复
158 |         response = dp_chat(transcription_result)
159 |         print("User:", transcription_result)
160 |         print("Assistant:", response)
161 | 
162 | def save_conversation_history():
163 |     if not os.path.exists("conversation_logs"):
164 |         os.makedirs("conversation_logs")
165 | 
166 |     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
167 |     filename = f"conversation_logs/conversation_{timestamp}.json"
168 | 
169 |     with open(filename, 'w', encoding='utf-8') as f:
170 |         json.dump(conversation_history, f, ensure_ascii=False, indent=2)
171 | 
172 |     print(f"Conversation history saved to {filename}")
173 | 
174 | def start_service():
175 |     print("Initializing Porcupine and ASR model...")
176 | 
177 |     # 加载ASR模型
178 |     model = SenseVoiceSmall(model_dir, batch_size=10, quantize=True)
179 | 
180 |     # 检测音频流中是否存在语音，它可以区分音频信号中的语音和背景噪音
181 |     vad = webrtcvad.Vad()
182 |     vad.set_mode(2)  # 参数范围0-3，越大越严格，越能忽略更多的背景噪音
183 | 
184 |     pa = pyaudio.PyAudio()
185 |     stream = pa.open(rate=porcupine.sample_rate,    # 音频流的采样率
186 |                      channels=1,    # 音频流的通道数
187 |                      format=pyaudio.paInt16,    # 音频数据的格式
188 |                      input=True,    # 指定这个音频流是用于输入(采集音频)
189 |                      frames_per_buffer=porcupine.frame_length)  # 缓冲区大小，表示每次从音频输入设备读取多少帧
190 | 
191 |     print("Listening for wake word...")
192 | 
193 |     try:
194 |         while True:
195 |             pcm = stream.read(porcupine.frame_length)   # 从麦克风音频输入流中读取指定长度的音频数据
196 |             pcm = np.frombuffer(pcm, dtype=np.int16)
197 | 
198 |             keyword_index = porcupine.process(pcm)      # 调用process方法检测是否包含唤醒词
199 |             if keyword_index >= 0:
200 |                 print("Wake word detected! Starting conversation...")
201 |                 stream.stop_stream()    # 上文已经获取到唤醒词，则先停止当前音频流，准备之后的对话
202 |                 # 额外功能：播放当前文件夹下的hello.wav
203 |                 play_audio("hello.wav")
204 |                 continuous_conversation(model, vad)
205 |                 stream.start_stream()
206 |                 print("Conversation ended. Listening for wake word again...")
207 |     except KeyboardInterrupt:
208 |         print("Stopping service...")
209 |     finally:
210 |         stream.close()
211 |         pa.terminate()
212 |         save_conversation_history()
213 | 
214 | 
215 | if __name__ == "__main__":
216 |     # 配置Porcupine关键词检测
217 |     porcupine = pvporcupine.create(
218 |         access_key=f"输入你的API",
219 |         keyword_paths=["./hey-bro_en_windows_v3_0_0/hey-bro_en_windows_v3_0_0.ppn"]
220 |     )
221 | 
222 |     # ASR模型路径
223 |     model_dir = "conversation_logs/sensevoice-small-onnx-quant"
224 |     sample_rate = 16000
225 |     frame_duration = 30
226 |     frame_length = int(sample_rate * frame_duration / 1000)
227 | 
228 |     settings = ("你的名字是水月，我是博士，我从事理论与计算化学的工作，需要学习数学物理化学计算机的交叉知识，你是我的助手，你和博士现在都还不够优秀，"
229 |                 "你们会在之后的学习生涯中慢慢变得更加优秀。")
230 |     # 全局变量用于存储对话历史
231 |     conversation_history = [
232 |         {"role": "system", "content": f"你将扮演一个和我用语音聊天的对象，回复就和正常说话一样，得简短。{settings}"},
233 |         {"role": "user", "content": "我是谁？你又是谁？"},
234 |         {"role": "assistant", "content": "你是博士，我是水月"}
235 |     ]
236 | 
237 |     start_service()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | requests
 2 | playsound
 3 | pyaudio
 4 | numpy
 5 | funasr-onnx
 6 | scipy
 7 | sounddevice
 8 | webrtcvad
 9 | soundfile
10 | 


--------------------------------------------------------------------------------