├── .gitignore ├── .python-version ├── README.md ├── docs └── vibevoice-demo-caption.gif ├── pyproject.toml ├── requirements.txt └── src └── vibevoice ├── __init__.py ├── cli.py ├── loading_indicator.py └── server.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Python-generated files 2 | __pycache__/ 3 | *.py[oc] 4 | build/ 5 | dist/ 6 | wheels/ 7 | *.egg-info 8 | 9 | # Virtual environments 10 | .venv 11 | -------------------------------------------------------------------------------- /.python-version: -------------------------------------------------------------------------------- 1 | 3.12 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Vibevoice 🎙️ 2 | 3 | Hi, I'm [Marc Päpper](https://x.com/mpaepper) and I wanted to vibe code like [Karpathy](https://x.com/karpathy/status/1886192184808149383) ;D, so I looked around and found the cool work of [Vlad](https://github.com/vlad-ds/whisper-keyboard). I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. 4 | I hope you have fun with it! 5 | 6 | ## What it does 🚀 7 | 8 | ![Demo Video](docs/vibevoice-demo-caption.gif) 9 | 10 | Simply run `cli.py` and start dictating text anywhere in your system: 11 | 1. Hold down right control key (Ctrl_r) 12 | 2. Speak your text 13 | 3. Release the key 14 | 4. Watch as your spoken words are transcribed and automatically typed! 15 | 16 | Works in any application or window - your text editor, browser, chat apps, anywhere you can type! 17 | 18 | NEW: LLM voice command mode: 19 | 20 | 1. Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it) 21 | 2. Speak what you want the LLM to do 22 | 3. The LLM receives your transcribed text and a screenshot of your current view 23 | 4. The LLM answer is typed into your keyboard (streamed) 24 | 25 | Works everywhere on your system and the LLM always has the screen context 26 | 27 | ## Installation 🛠️ 28 | 29 | ```bash 30 | git clone https://github.com/mpaepper/vibevoice.git 31 | cd vibevoice 32 | pip install -r requirements.txt 33 | python src/vibevoice/cli.py 34 | ``` 35 | 36 | ## Requirements 📋 37 | 38 | ### Python Dependencies 39 | - Python 3.12 or higher 40 | 41 | ### System Requirements 42 | - CUDA-capable GPU (recommended) -> in server.py you can enable cpu use 43 | - CUDA 12.x 44 | - cuBLAS 45 | - cuDNN 9.x 46 | - In case you get this error: `OSError: PortAudio library not found` run `sudo apt install libportaudio2` 47 | - [Ollama](https://ollama.com) for AI command mode (with multimodal models for screenshot support) 48 | 49 | #### Setting up Ollama 50 | 1. Install Ollama by following the instructions at [ollama.com](https://ollama.com) 51 | 2. Pull a model that supports both text and images for best results: 52 | ```bash 53 | ollama pull gemma3:27b # Great model which can run on RTX 3090 or similar 54 | ``` 55 | 3. Make sure Ollama is running in the background: 56 | ```bash 57 | ollama serve 58 | ``` 59 | 60 | #### Handling the CUDA requirements 61 | 62 | * Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x 63 | * I had some trouble at first with Ubuntu 24.04, so I did the following: 64 | 65 | ```bash 66 | sudo apt update && sudo apt upgrade 67 | sudo apt autoremove nvidia* --purge 68 | ubuntu-drivers devices 69 | sudo ubuntu-drivers autoinstall 70 | wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb 71 | sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update 72 | sudo apt install cuda-toolkit-12-8 73 | ``` 74 | or alternatively: 75 | 76 | ``` 77 | wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb 78 | sudo dpkg -i cuda-keyring_1.1-1_all.deb 79 | sudo apt update 80 | sudo apt install cudnn9-cuda-12 81 | ``` 82 | 83 | * Then after rebooting, it worked well. 84 | 85 | ## Usage 💡 86 | 87 | 1. Start the application: 88 | ```bash 89 | python src/vibevoice/cli.py 90 | ``` 91 | 92 | 2. Hold down right control key (Ctrl_r) while speaking 93 | 3. Release to transcribe 94 | 4. Your text appears wherever your cursor is! 95 | 96 | ### Configuration 97 | 98 | You can customize various aspects of VibeVoice with the following environment variables: 99 | 100 | #### Keyboard Controls 101 | - `VOICEKEY`: Change the dictation activation key (default: "ctrl_r") 102 | ```bash 103 | export VOICEKEY="ctrl" # Use left control instead 104 | ``` 105 | - `VOICEKEY_CMD`: Set the key for AI command mode (default: "scroll_lock") 106 | ```bash 107 | export VOICEKEY_CMD="ctsl" # Use left control instead of Scroll Lock key 108 | ``` 109 | 110 | #### AI and Screenshot Features 111 | - `OLLAMA_MODEL`: Specify which Ollama model to use (default: "gemma3:27b") 112 | ```bash 113 | export OLLAMA_MODEL="gemma3:4b" # Use a smaller VLM in case you have less GPU RAM 114 | ``` 115 | - `INCLUDE_SCREENSHOT`: Enable or disable screenshots in AI command mode (default: "true") 116 | ```bash 117 | export INCLUDE_SCREENSHOT="false" # Disable screenshots (but they are local only anyways) 118 | ``` 119 | - `SCREENSHOT_MAX_WIDTH`: Set the maximum width for screenshots (default: "1024") 120 | ```bash 121 | export SCREENSHOT_MAX_WIDTH="800" # Smaller screenshots 122 | ``` 123 | 124 | #### Screenshot Dependencies 125 | To use the screenshot functionality: 126 | ```bash 127 | sudo apt install gnome-screenshot 128 | ``` 129 | 130 | ## Usage Modes 💡 131 | 132 | VibeVoice supports two modes: 133 | 134 | ### 1. Dictation Mode 135 | 1. Hold down the dictation key (default: right Control) 136 | 2. Speak your text 137 | 3. Release to transcribe 138 | 4. Your text appears wherever your cursor is! 139 | 140 | ### 2. AI Command Mode 141 | 1. Hold down the command key (default: Scroll Lock) 142 | 2. Ask a question or give a command 143 | 3. Release the key 144 | 4. The AI will analyze your request (and current screen if enabled) and type a response 145 | 146 | ## Credits 🙏 147 | 148 | - Original inspiration: [whisper-keyboard](https://github.com/vlad-ds/whisper-keyboard) by Vlad 149 | - [Faster Whisper](https://github.com/guillaumekln/faster-whisper) for the optimized Whisper implementation 150 | - Built by [Marc Päpper](https://www.paepper.com) 151 | -------------------------------------------------------------------------------- /docs/vibevoice-demo-caption.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mpaepper/vibevoice/1137e9629beebb6c3a56202cfadd56e20f4d1ee8/docs/vibevoice-demo-caption.gif -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["hatchling"] 3 | build-backend = "hatchling.build" 4 | 5 | [project] 6 | name = "vibevoice" 7 | version = "0.1.0" 8 | description = "Voice-to-text using local Whisper model with keyboard input" 9 | readme = "README.md" 10 | requires-python = ">=3.12" 11 | authors = [ 12 | { name = "Marc Päpper" } 13 | ] 14 | dependencies = [ 15 | "fastapi==0.104.1", 16 | "uvicorn==0.23.2", 17 | "faster-whisper==1.1.1", 18 | "ctranslate2==4.5.0", 19 | "torch==2.4.0+cu124", 20 | "torchaudio==2.4.0+cu124", 21 | "python-multipart==0.0.7", 22 | "python-dotenv==1.0.0", 23 | "sounddevice==0.4.6", 24 | "numpy>=1.26.0", 25 | "requests==2.32.3", 26 | "pynput==1.7.6", 27 | "scipy==1.11.3" 28 | ] 29 | 30 | [project.scripts] 31 | vibevoice = "vibevoice.cli:main" 32 | 33 | [tool.hatch.build.targets.wheel] 34 | packages = ["src/vibevoice"] 35 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi==0.104.1 2 | uvicorn==0.23.2 3 | faster-whisper==1.1.1 4 | --extra-index-url https://download.pytorch.org/whl/cu124 5 | ctranslate2==4.5.0 6 | torch==2.4.0+cu124 7 | torchaudio==2.4.0+cu124 8 | python-multipart==0.0.7 9 | python-dotenv==1.0.0 10 | sounddevice==0.4.6 11 | numpy>=1.26.0 12 | requests==2.32.3 13 | pynput==1.7.6 14 | scipy==1.11.3 15 | pyautogui==0.9.54 16 | Pillow==11.1.0 17 | -------------------------------------------------------------------------------- /src/vibevoice/__init__.py: -------------------------------------------------------------------------------- 1 | """Voice-to-text using local Whisper model with keyboard input""" 2 | 3 | __version__ = "0.1.0" 4 | -------------------------------------------------------------------------------- /src/vibevoice/cli.py: -------------------------------------------------------------------------------- 1 | """Command-line interface for vibevoice""" 2 | 3 | import os 4 | import subprocess 5 | import time 6 | import json 7 | import sounddevice as sd 8 | import numpy as np 9 | import requests 10 | import sys 11 | import base64 12 | 13 | SCREENSHOT_AVAILABLE = False 14 | try: 15 | import pyautogui 16 | from PIL import Image 17 | SCREENSHOT_AVAILABLE = True 18 | except ImportError as e: 19 | print(f"Screenshot functionality not available: {e}") 20 | print("Install Pillow with: pip install Pillow") 21 | 22 | from pynput.keyboard import Controller as KeyboardController, Key, Listener, KeyCode 23 | from scipy.io import wavfile 24 | from dotenv import load_dotenv 25 | 26 | from loading_indicator import LoadingIndicator 27 | 28 | loading_indicator = LoadingIndicator() 29 | 30 | def start_whisper_server(): 31 | server_script = os.path.join(os.path.dirname(__file__), 'server.py') 32 | process = subprocess.Popen(['python', server_script]) 33 | return process 34 | 35 | def wait_for_server(timeout=1800, interval=0.5): 36 | start_time = time.time() 37 | while time.time() - start_time < timeout: 38 | try: 39 | response = requests.get('http://localhost:4242/health') 40 | if response.status_code == 200: 41 | return True 42 | except requests.exceptions.RequestException: 43 | pass 44 | time.sleep(interval) 45 | raise TimeoutError("Server failed to start within timeout") 46 | 47 | def capture_screenshot(): 48 | """Capture a screenshot, save it, and return the path and base64 data.""" 49 | if not SCREENSHOT_AVAILABLE: 50 | print("Screenshot functionality not available. Install Pillow with: pip install Pillow") 51 | return None, None 52 | 53 | try: 54 | screenshot_path = os.path.abspath('screenshot.png') 55 | print(f"Capturing screenshot to: {screenshot_path}") 56 | 57 | screenshot = pyautogui.screenshot() 58 | 59 | max_width = int(os.getenv('SCREENSHOT_MAX_WIDTH', '1024')) 60 | width, height = screenshot.size 61 | 62 | if width > max_width: 63 | ratio = max_width / width 64 | new_width = max_width 65 | new_height = int(height * ratio) 66 | screenshot = screenshot.resize((new_width, new_height)) 67 | 68 | screenshot.save(screenshot_path) 69 | 70 | with open(screenshot_path, "rb") as image_file: 71 | base64_data = base64.b64encode(image_file.read()).decode('utf-8') 72 | 73 | return screenshot_path, base64_data 74 | except Exception as e: 75 | print(f"Error capturing screenshot: {e}") 76 | return None, None 77 | 78 | def _process_llm_cmd(keyboard_controller, transcript): 79 | """Process transcript with Ollama and type the response.""" 80 | 81 | try: 82 | loading_indicator.show(message=f"Processing: {transcript}") 83 | 84 | model = os.getenv('OLLAMA_MODEL', 'gemma3:27b') 85 | include_screenshot = os.getenv('INCLUDE_SCREENSHOT', 'true').lower() == 'true' 86 | 87 | screenshot_path, screenshot_base64 = (None, None) 88 | if include_screenshot and SCREENSHOT_AVAILABLE: 89 | screenshot_path, screenshot_base64 = capture_screenshot() 90 | 91 | user_prompt = transcript.strip() 92 | 93 | system_prompt = """You are a voice-controlled AI assistant. The user is talking to their computer using voice commands. 94 | Your responses will be directly typed into the user's keyboard at their cursor position, so: 95 | 1. Be concise and to the point, but friendly and engaging - prefer shorter answers 96 | 2. Focus on answering the specific question or request 97 | 3. Don't use introductory phrases like "Here's..." or "Based on the screenshot..." 98 | 4. Don't include formatting like bullet points, which might look strange when typed 99 | 5. If you see a screenshot, analyze it and use it to inform your response 100 | 6. Never apologize for limitations or explain what you're doing""" 101 | 102 | if screenshot_base64: 103 | url = "http://localhost:11434/api/generate" 104 | payload = { 105 | "model": model, 106 | "prompt": user_prompt, 107 | "system": system_prompt, 108 | "stream": True, 109 | "images": [screenshot_base64] # Pass base64 data directly without data URI prefix 110 | } 111 | print(f"Sending request with screenshot to model: {model}") 112 | else: 113 | url = "http://localhost:11434/api/generate" 114 | payload = { 115 | "model": model, 116 | "prompt": user_prompt, 117 | "system": system_prompt, 118 | "stream": True 119 | } 120 | print(f"Sending text-only request") 121 | 122 | response = requests.post(url, json=payload, stream=True) 123 | response.raise_for_status() 124 | 125 | for line in response.iter_lines(): 126 | if line: 127 | data = line.decode('utf-8') 128 | if data.startswith('{'): 129 | chunk = json.loads(data) 130 | if 'response' in chunk: 131 | chunk_text = chunk['response'] 132 | print(f"Debug - received chunk: {repr(chunk_text)}") 133 | 134 | # Replace smart/curly quotes with standard apostrophes 135 | # U+2018 (') and U+2019 (') are both replaced with standard apostrophe (') 136 | normalized_text = chunk_text.replace('\u2019', "'").replace('\u2018', "'") 137 | 138 | keyboard_controller.type(normalized_text) 139 | loading_indicator.hide() 140 | 141 | return "Successfully processed with Ollama" 142 | except requests.exceptions.RequestException as e: 143 | print(f"Error calling Ollama: {e}") 144 | finally: 145 | loading_indicator.hide() 146 | 147 | def main(): 148 | load_dotenv() 149 | key_label = os.environ.get("VOICEKEY", "ctrl_r") 150 | cmd_label = os.environ.get("VOICEKEY_CMD", "scroll_lock") 151 | RECORD_KEY = Key[key_label] 152 | CMD_KEY = Key[cmd_label] 153 | # CMD_KEY = KeyCode(vk=65027) # This is how you can use non-standard keys, this is AltGr for me 154 | 155 | recording = False 156 | audio_data = [] 157 | sample_rate = 16000 158 | keyboard_controller = KeyboardController() 159 | 160 | def on_press(key): 161 | nonlocal recording, audio_data 162 | if key == RECORD_KEY or key == CMD_KEY and not recording: 163 | recording = True 164 | audio_data = [] 165 | print("Listening...") 166 | 167 | def on_release(key): 168 | nonlocal recording, audio_data 169 | if key == RECORD_KEY or key == CMD_KEY: 170 | recording = False 171 | print("Transcribing...") 172 | 173 | try: 174 | audio_data_np = np.concatenate(audio_data, axis=0) 175 | except ValueError as e: 176 | print(e) 177 | return 178 | 179 | recording_path = os.path.abspath('recording.wav') 180 | audio_data_int16 = (audio_data_np * np.iinfo(np.int16).max).astype(np.int16) 181 | wavfile.write(recording_path, sample_rate, audio_data_int16) 182 | 183 | try: 184 | response = requests.post('http://localhost:4242/transcribe/', 185 | json={'file_path': recording_path}) 186 | response.raise_for_status() 187 | transcript = response.json()['text'] 188 | 189 | if transcript and key == RECORD_KEY: 190 | processed_transcript = transcript + " " 191 | print(processed_transcript) 192 | keyboard_controller.type(processed_transcript) 193 | elif transcript and key == CMD_KEY: 194 | _process_llm_cmd(keyboard_controller, transcript) 195 | except requests.exceptions.RequestException as e: 196 | print(f"Error sending request to local API: {e}") 197 | except Exception as e: 198 | print(f"Error processing transcript: {e}") 199 | 200 | def callback(indata, frames, time, status): 201 | if status: 202 | print(status) 203 | if recording: 204 | audio_data.append(indata.copy()) 205 | 206 | server_process = start_whisper_server() 207 | 208 | try: 209 | print(f"Waiting for the server to be ready...") 210 | wait_for_server() 211 | print(f"vibevoice is active. Hold down {key_label} to start dictating.") 212 | with Listener(on_press=on_press, on_release=on_release) as listener: 213 | with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate): 214 | listener.join() 215 | except TimeoutError as e: 216 | print(f"Error: {e}") 217 | server_process.terminate() 218 | sys.exit(1) 219 | except KeyboardInterrupt: 220 | print("\nStopping...") 221 | finally: 222 | server_process.terminate() 223 | 224 | if __name__ == "__main__": 225 | main() 226 | -------------------------------------------------------------------------------- /src/vibevoice/loading_indicator.py: -------------------------------------------------------------------------------- 1 | import subprocess 2 | import threading 3 | import time 4 | import os 5 | import platform 6 | 7 | class LoadingIndicator: 8 | def __init__(self): 9 | self._notification_shown = False 10 | self._stop_event = threading.Event() 11 | self._thread = None 12 | 13 | def show(self, message="Processing your request..."): 14 | if self._thread is not None: 15 | return 16 | 17 | self._stop_event.clear() 18 | self._thread = threading.Thread(target=self._show_notification, args=(message,)) 19 | self._thread.daemon = True 20 | self._thread.start() 21 | 22 | def hide(self): 23 | if self._thread is not None: 24 | self._stop_event.set() 25 | self._thread.join() 26 | self._thread = None 27 | 28 | def _show_notification(self, message): 29 | try: 30 | system = platform.system() 31 | 32 | if system == "Linux": 33 | # Show notification on Linux 34 | subprocess.run([ 35 | "notify-send", 36 | "--icon=info", 37 | "VibeVoice", 38 | message, 39 | "--expire-time=10000" # 10 seconds 40 | ]) 41 | elif system == "Darwin": # macOS 42 | # Show notification on macOS 43 | subprocess.run([ 44 | "osascript", 45 | "-e", 46 | f'display notification "{message}" with title "VibeVoice"' 47 | ]) 48 | elif system == "Windows": 49 | # Show notification on Windows (requires win10toast or similar) 50 | try: 51 | from win10toast import ToastNotifier 52 | toaster = ToastNotifier() 53 | toaster.show_toast("VibeVoice", 54 | message, 55 | duration=10, 56 | threaded=True) 57 | except ImportError: 58 | # Fallback to print if win10toast not installed 59 | print("Processing request (install win10toast for notifications)...") 60 | 61 | # Wait until stop event or timeout 62 | self._stop_event.wait(10) # Wait up to 10 seconds 63 | 64 | except Exception as e: 65 | print(f"Error showing notification: {e}") -------------------------------------------------------------------------------- /src/vibevoice/server.py: -------------------------------------------------------------------------------- 1 | """FastAPI server for Whisper transcription""" 2 | 3 | import uvicorn 4 | from fastapi import FastAPI 5 | from pydantic import BaseModel 6 | from faster_whisper import WhisperModel 7 | 8 | app = FastAPI() 9 | 10 | model = WhisperModel("large", device="cuda", compute_type="float16") 11 | # Enable in case you want to run on CPU, but it's much slower 12 | #model = WhisperModel("medium", device="cpu", compute_type="int8") 13 | 14 | class TranscribeRequest(BaseModel): 15 | file_path: str 16 | 17 | @app.get("/health") 18 | def health_check(): 19 | return {"status": "ok"} 20 | 21 | @app.post("/transcribe/") 22 | async def transcribe(request: TranscribeRequest): 23 | segments, info = model.transcribe(request.file_path) 24 | text = " ".join([segment.text.strip() for segment in segments]) 25 | return {"text": text} 26 | 27 | def run_server(): 28 | uvicorn.run(app, host="0.0.0.0", port=4242) 29 | 30 | if __name__ == "__main__": 31 | run_server() 32 | --------------------------------------------------------------------------------