├── .gitignore
├── .python-version
├── README.md
├── docs
    └── vibevoice-demo-caption.gif
├── pyproject.toml
├── requirements.txt
└── src
    └── vibevoice
        ├── __init__.py
        ├── cli.py
        ├── loading_indicator.py
        └── server.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Python-generated files
 2 | __pycache__/
 3 | *.py[oc]
 4 | build/
 5 | dist/
 6 | wheels/
 7 | *.egg-info
 8 | 
 9 | # Virtual environments
10 | .venv
11 | 


--------------------------------------------------------------------------------
/.python-version:
--------------------------------------------------------------------------------
1 | 3.12
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Vibevoice 🎙️
  2 | 
  3 | Hi, I'm [Marc Päpper](https://x.com/mpaepper) and I wanted to vibe code like [Karpathy](https://x.com/karpathy/status/1886192184808149383) ;D, so I looked around and found the cool work of [Vlad](https://github.com/vlad-ds/whisper-keyboard). I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens.
  4 | I hope you have fun with it!
  5 | 
  6 | ## What it does 🚀
  7 | 
  8 | ![Demo Video](docs/vibevoice-demo-caption.gif)
  9 | 
 10 | Simply run `cli.py` and start dictating text anywhere in your system:
 11 | 1. Hold down right control key (Ctrl_r)
 12 | 2. Speak your text
 13 | 3. Release the key
 14 | 4. Watch as your spoken words are transcribed and automatically typed!
 15 | 
 16 | Works in any application or window - your text editor, browser, chat apps, anywhere you can type!
 17 | 
 18 | NEW: LLM voice command mode:
 19 | 
 20 | 1. Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
 21 | 2. Speak what you want the LLM to do
 22 | 3. The LLM receives your transcribed text and a screenshot of your current view
 23 | 4. The LLM answer is typed into your keyboard (streamed)
 24 | 
 25 | Works everywhere on your system and the LLM always has the screen context
 26 | 
 27 | ## Installation 🛠️
 28 | 
 29 | ```bash
 30 | git clone https://github.com/mpaepper/vibevoice.git
 31 | cd vibevoice
 32 | pip install -r requirements.txt
 33 | python src/vibevoice/cli.py
 34 | ```
 35 | 
 36 | ## Requirements 📋
 37 | 
 38 | ### Python Dependencies
 39 | - Python 3.12 or higher
 40 | 
 41 | ### System Requirements
 42 | - CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
 43 | - CUDA 12.x
 44 | - cuBLAS
 45 | - cuDNN 9.x
 46 | - In case you get this error: `OSError: PortAudio library not found` run `sudo apt install libportaudio2`
 47 | - [Ollama](https://ollama.com) for AI command mode (with multimodal models for screenshot support)
 48 | 
 49 | #### Setting up Ollama
 50 | 1. Install Ollama by following the instructions at [ollama.com](https://ollama.com)
 51 | 2. Pull a model that supports both text and images for best results:
 52 |    ```bash
 53 |    ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar
 54 |    ```
 55 | 3. Make sure Ollama is running in the background:
 56 |    ```bash
 57 |    ollama serve
 58 |    ```
 59 | 
 60 | #### Handling the CUDA requirements
 61 | 
 62 | * Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x
 63 | * I had some trouble at first with Ubuntu 24.04, so I did the following:
 64 | 
 65 | ```bash
 66 | sudo apt update && sudo apt upgrade
 67 | sudo apt autoremove nvidia* --purge
 68 | ubuntu-drivers devices
 69 | sudo ubuntu-drivers autoinstall
 70 | wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
 71 | sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
 72 | sudo apt install cuda-toolkit-12-8
 73 | ```
 74 | or alternatively:
 75 | 
 76 | ``` 
 77 | wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
 78 | sudo dpkg -i cuda-keyring_1.1-1_all.deb
 79 | sudo apt update
 80 | sudo apt install cudnn9-cuda-12
 81 | ```
 82 | 
 83 | * Then after rebooting, it worked well.
 84 | 
 85 | ## Usage 💡
 86 | 
 87 | 1. Start the application:
 88 | ```bash
 89 | python src/vibevoice/cli.py
 90 | ```
 91 | 
 92 | 2. Hold down right control key (Ctrl_r) while speaking
 93 | 3. Release to transcribe
 94 | 4. Your text appears wherever your cursor is!
 95 | 
 96 | ### Configuration
 97 | 
 98 | You can customize various aspects of VibeVoice with the following environment variables:
 99 | 
100 | #### Keyboard Controls
101 | - `VOICEKEY`: Change the dictation activation key (default: "ctrl_r")
102 |   ```bash
103 |   export VOICEKEY="ctrl"  # Use left control instead
104 |   ```
105 | - `VOICEKEY_CMD`: Set the key for AI command mode (default: "scroll_lock")
106 |   ```bash
107 |   export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key
108 |   ```
109 | 
110 | #### AI and Screenshot Features
111 | - `OLLAMA_MODEL`: Specify which Ollama model to use (default: "gemma3:27b")
112 |   ```bash
113 |   export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM
114 |   ```
115 | - `INCLUDE_SCREENSHOT`: Enable or disable screenshots in AI command mode (default: "true")
116 |   ```bash
117 |   export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)
118 |   ```
119 | - `SCREENSHOT_MAX_WIDTH`: Set the maximum width for screenshots (default: "1024")
120 |   ```bash
121 |   export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots
122 |   ```
123 | 
124 | #### Screenshot Dependencies
125 | To use the screenshot functionality:
126 | ```bash
127 | sudo apt install gnome-screenshot
128 | ```
129 | 
130 | ## Usage Modes 💡
131 | 
132 | VibeVoice supports two modes:
133 | 
134 | ### 1. Dictation Mode
135 | 1. Hold down the dictation key (default: right Control)
136 | 2. Speak your text
137 | 3. Release to transcribe
138 | 4. Your text appears wherever your cursor is!
139 | 
140 | ### 2. AI Command Mode
141 | 1. Hold down the command key (default: Scroll Lock)
142 | 2. Ask a question or give a command
143 | 3. Release the key
144 | 4. The AI will analyze your request (and current screen if enabled) and type a response
145 | 
146 | ## Credits 🙏
147 | 
148 | - Original inspiration: [whisper-keyboard](https://github.com/vlad-ds/whisper-keyboard) by Vlad
149 | - [Faster Whisper](https://github.com/guillaumekln/faster-whisper) for the optimized Whisper implementation
150 | - Built by [Marc Päpper](https://www.paepper.com)
151 | 


--------------------------------------------------------------------------------
/docs/vibevoice-demo-caption.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mpaepper/vibevoice/1137e9629beebb6c3a56202cfadd56e20f4d1ee8/docs/vibevoice-demo-caption.gif


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [build-system]
 2 | requires = ["hatchling"]
 3 | build-backend = "hatchling.build"
 4 | 
 5 | [project]
 6 | name = "vibevoice"
 7 | version = "0.1.0"
 8 | description = "Voice-to-text using local Whisper model with keyboard input"
 9 | readme = "README.md"
10 | requires-python = ">=3.12"
11 | authors = [
12 |     { name = "Marc Päpper" }
13 | ]
14 | dependencies = [
15 |     "fastapi==0.104.1",
16 |     "uvicorn==0.23.2",
17 |     "faster-whisper==1.1.1",
18 |     "ctranslate2==4.5.0",
19 |     "torch==2.4.0+cu124",
20 |     "torchaudio==2.4.0+cu124",
21 |     "python-multipart==0.0.7",
22 |     "python-dotenv==1.0.0",
23 |     "sounddevice==0.4.6",
24 |     "numpy>=1.26.0",
25 |     "requests==2.32.3",
26 |     "pynput==1.7.6",
27 |     "scipy==1.11.3"
28 | ]
29 | 
30 | [project.scripts]
31 | vibevoice = "vibevoice.cli:main"
32 | 
33 | [tool.hatch.build.targets.wheel]
34 | packages = ["src/vibevoice"]
35 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | fastapi==0.104.1
 2 | uvicorn==0.23.2
 3 | faster-whisper==1.1.1
 4 | --extra-index-url https://download.pytorch.org/whl/cu124
 5 | ctranslate2==4.5.0
 6 | torch==2.4.0+cu124
 7 | torchaudio==2.4.0+cu124
 8 | python-multipart==0.0.7
 9 | python-dotenv==1.0.0
10 | sounddevice==0.4.6
11 | numpy>=1.26.0
12 | requests==2.32.3
13 | pynput==1.7.6
14 | scipy==1.11.3
15 | pyautogui==0.9.54
16 | Pillow==11.1.0
17 | 


--------------------------------------------------------------------------------
/src/vibevoice/__init__.py:
--------------------------------------------------------------------------------
1 | """Voice-to-text using local Whisper model with keyboard input"""
2 | 
3 | __version__ = "0.1.0"
4 | 


--------------------------------------------------------------------------------
/src/vibevoice/cli.py:
--------------------------------------------------------------------------------
  1 | """Command-line interface for vibevoice"""
  2 | 
  3 | import os
  4 | import subprocess
  5 | import time
  6 | import json
  7 | import sounddevice as sd
  8 | import numpy as np
  9 | import requests
 10 | import sys
 11 | import base64
 12 | 
 13 | SCREENSHOT_AVAILABLE = False
 14 | try:
 15 |     import pyautogui
 16 |     from PIL import Image
 17 |     SCREENSHOT_AVAILABLE = True
 18 | except ImportError as e:
 19 |     print(f"Screenshot functionality not available: {e}")
 20 |     print("Install Pillow with: pip install Pillow")
 21 | 
 22 | from pynput.keyboard import Controller as KeyboardController, Key, Listener, KeyCode
 23 | from scipy.io import wavfile
 24 | from dotenv import load_dotenv
 25 | 
 26 | from loading_indicator import LoadingIndicator
 27 | 
 28 | loading_indicator = LoadingIndicator()
 29 | 
 30 | def start_whisper_server():
 31 |     server_script = os.path.join(os.path.dirname(__file__), 'server.py')
 32 |     process = subprocess.Popen(['python', server_script])
 33 |     return process
 34 | 
 35 | def wait_for_server(timeout=1800, interval=0.5):
 36 |     start_time = time.time()
 37 |     while time.time() - start_time < timeout:
 38 |         try:
 39 |             response = requests.get('http://localhost:4242/health')
 40 |             if response.status_code == 200:
 41 |                 return True
 42 |         except requests.exceptions.RequestException:
 43 |             pass
 44 |         time.sleep(interval)
 45 |     raise TimeoutError("Server failed to start within timeout")
 46 | 
 47 | def capture_screenshot():
 48 |     """Capture a screenshot, save it, and return the path and base64 data."""
 49 |     if not SCREENSHOT_AVAILABLE:
 50 |         print("Screenshot functionality not available. Install Pillow with: pip install Pillow")
 51 |         return None, None
 52 |         
 53 |     try:
 54 |         screenshot_path = os.path.abspath('screenshot.png')
 55 |         print(f"Capturing screenshot to: {screenshot_path}")
 56 |         
 57 |         screenshot = pyautogui.screenshot()
 58 |         
 59 |         max_width = int(os.getenv('SCREENSHOT_MAX_WIDTH', '1024'))
 60 |         width, height = screenshot.size
 61 |         
 62 |         if width > max_width:
 63 |             ratio = max_width / width
 64 |             new_width = max_width
 65 |             new_height = int(height * ratio)
 66 |             screenshot = screenshot.resize((new_width, new_height))
 67 |         
 68 |         screenshot.save(screenshot_path)
 69 |         
 70 |         with open(screenshot_path, "rb") as image_file:
 71 |             base64_data = base64.b64encode(image_file.read()).decode('utf-8')
 72 |         
 73 |         return screenshot_path, base64_data
 74 |     except Exception as e:
 75 |         print(f"Error capturing screenshot: {e}")
 76 |         return None, None
 77 | 
 78 | def _process_llm_cmd(keyboard_controller, transcript):
 79 |     """Process transcript with Ollama and type the response."""
 80 | 
 81 |     try:
 82 |         loading_indicator.show(message=f"Processing: {transcript}")
 83 |         
 84 |         model = os.getenv('OLLAMA_MODEL', 'gemma3:27b')
 85 |         include_screenshot = os.getenv('INCLUDE_SCREENSHOT', 'true').lower() == 'true'
 86 |         
 87 |         screenshot_path, screenshot_base64 = (None, None)
 88 |         if include_screenshot and SCREENSHOT_AVAILABLE:
 89 |             screenshot_path, screenshot_base64 = capture_screenshot()
 90 |         
 91 |         user_prompt = transcript.strip()
 92 |         
 93 |         system_prompt = """You are a voice-controlled AI assistant. The user is talking to their computer using voice commands.
 94 | Your responses will be directly typed into the user's keyboard at their cursor position, so:
 95 | 1. Be concise and to the point, but friendly and engaging - prefer shorter answers
 96 | 2. Focus on answering the specific question or request
 97 | 3. Don't use introductory phrases like "Here's..." or "Based on the screenshot..."
 98 | 4. Don't include formatting like bullet points, which might look strange when typed
 99 | 5. If you see a screenshot, analyze it and use it to inform your response
100 | 6. Never apologize for limitations or explain what you're doing"""
101 |         
102 |         if screenshot_base64:
103 |             url = "http://localhost:11434/api/generate"
104 |             payload = {
105 |                 "model": model,
106 |                 "prompt": user_prompt,
107 |                 "system": system_prompt,
108 |                 "stream": True,
109 |                 "images": [screenshot_base64]  # Pass base64 data directly without data URI prefix
110 |             }
111 |             print(f"Sending request with screenshot to model: {model}")
112 |         else:
113 |             url = "http://localhost:11434/api/generate"
114 |             payload = {
115 |                 "model": model,
116 |                 "prompt": user_prompt,
117 |                 "system": system_prompt,
118 |                 "stream": True
119 |             }
120 |             print(f"Sending text-only request")
121 |         
122 |         response = requests.post(url, json=payload, stream=True)
123 |         response.raise_for_status()
124 |         
125 |         for line in response.iter_lines():
126 |             if line:
127 |                 data = line.decode('utf-8')
128 |                 if data.startswith('{'):
129 |                     chunk = json.loads(data)
130 |                     if 'response' in chunk:
131 |                         chunk_text = chunk['response']
132 |                         print(f"Debug - received chunk: {repr(chunk_text)}")
133 |                         
134 |                         # Replace smart/curly quotes with standard apostrophes
135 |                         # U+2018 (') and U+2019 (') are both replaced with standard apostrophe (')
136 |                         normalized_text = chunk_text.replace('\u2019', "'").replace('\u2018', "'")
137 |                         
138 |                         keyboard_controller.type(normalized_text)
139 |                         loading_indicator.hide()
140 |         
141 |         return "Successfully processed with Ollama"
142 |     except requests.exceptions.RequestException as e:
143 |         print(f"Error calling Ollama: {e}")
144 |     finally:
145 |         loading_indicator.hide()
146 | 
147 | def main():
148 |     load_dotenv()
149 |     key_label = os.environ.get("VOICEKEY", "ctrl_r")
150 |     cmd_label = os.environ.get("VOICEKEY_CMD", "scroll_lock")
151 |     RECORD_KEY = Key[key_label]
152 |     CMD_KEY = Key[cmd_label]
153 | #    CMD_KEY = KeyCode(vk=65027)  # This is how you can use non-standard keys, this is AltGr for me
154 | 
155 |     recording = False
156 |     audio_data = []
157 |     sample_rate = 16000
158 |     keyboard_controller = KeyboardController()
159 | 
160 |     def on_press(key):
161 |         nonlocal recording, audio_data
162 |         if key == RECORD_KEY or key == CMD_KEY and not recording:
163 |             recording = True
164 |             audio_data = []
165 |             print("Listening...")
166 | 
167 |     def on_release(key):
168 |         nonlocal recording, audio_data
169 |         if key == RECORD_KEY or key == CMD_KEY:
170 |             recording = False
171 |             print("Transcribing...")
172 |             
173 |             try:
174 |                 audio_data_np = np.concatenate(audio_data, axis=0)
175 |             except ValueError as e:
176 |                 print(e)
177 |                 return
178 |             
179 |             recording_path = os.path.abspath('recording.wav')
180 |             audio_data_int16 = (audio_data_np * np.iinfo(np.int16).max).astype(np.int16)
181 |             wavfile.write(recording_path, sample_rate, audio_data_int16)
182 | 
183 |             try:
184 |                 response = requests.post('http://localhost:4242/transcribe/', 
185 |                                       json={'file_path': recording_path})
186 |                 response.raise_for_status()
187 |                 transcript = response.json()['text']
188 |                 
189 |                 if transcript and key == RECORD_KEY:
190 |                     processed_transcript = transcript + " "
191 |                     print(processed_transcript)
192 |                     keyboard_controller.type(processed_transcript)
193 |                 elif transcript and key == CMD_KEY:
194 |                     _process_llm_cmd(keyboard_controller, transcript)
195 |             except requests.exceptions.RequestException as e:
196 |                 print(f"Error sending request to local API: {e}")
197 |             except Exception as e:
198 |                 print(f"Error processing transcript: {e}")
199 | 
200 |     def callback(indata, frames, time, status):
201 |         if status:
202 |             print(status)
203 |         if recording:
204 |             audio_data.append(indata.copy())
205 | 
206 |     server_process = start_whisper_server()
207 |     
208 |     try:
209 |         print(f"Waiting for the server to be ready...")
210 |         wait_for_server()
211 |         print(f"vibevoice is active. Hold down {key_label} to start dictating.")
212 |         with Listener(on_press=on_press, on_release=on_release) as listener:
213 |             with sd.InputStream(callback=callback, channels=1, samplerate=sample_rate):
214 |                 listener.join()
215 |     except TimeoutError as e:
216 |         print(f"Error: {e}")
217 |         server_process.terminate()
218 |         sys.exit(1)
219 |     except KeyboardInterrupt:
220 |         print("\nStopping...")
221 |     finally:
222 |         server_process.terminate()
223 | 
224 | if __name__ == "__main__":
225 |     main()
226 | 


--------------------------------------------------------------------------------
/src/vibevoice/loading_indicator.py:
--------------------------------------------------------------------------------
 1 | import subprocess
 2 | import threading
 3 | import time
 4 | import os
 5 | import platform
 6 | 
 7 | class LoadingIndicator:
 8 |     def __init__(self):
 9 |         self._notification_shown = False
10 |         self._stop_event = threading.Event()
11 |         self._thread = None
12 |         
13 |     def show(self, message="Processing your request..."):
14 |         if self._thread is not None:
15 |             return
16 |             
17 |         self._stop_event.clear()
18 |         self._thread = threading.Thread(target=self._show_notification, args=(message,))
19 |         self._thread.daemon = True
20 |         self._thread.start()
21 |     
22 |     def hide(self):
23 |         if self._thread is not None:
24 |             self._stop_event.set()
25 |             self._thread.join()
26 |             self._thread = None
27 |             
28 |     def _show_notification(self, message):
29 |         try:
30 |             system = platform.system()
31 |             
32 |             if system == "Linux":
33 |                 # Show notification on Linux
34 |                 subprocess.run([
35 |                     "notify-send", 
36 |                     "--icon=info", 
37 |                     "VibeVoice", 
38 |                     message,
39 |                     "--expire-time=10000"  # 10 seconds
40 |                 ])
41 |             elif system == "Darwin":  # macOS
42 |                 # Show notification on macOS
43 |                 subprocess.run([
44 |                     "osascript", 
45 |                     "-e", 
46 |                     f'display notification "{message}" with title "VibeVoice"'
47 |                 ])
48 |             elif system == "Windows":
49 |                 # Show notification on Windows (requires win10toast or similar)
50 |                 try:
51 |                     from win10toast import ToastNotifier
52 |                     toaster = ToastNotifier()
53 |                     toaster.show_toast("VibeVoice", 
54 |                                       message, 
55 |                                       duration=10,
56 |                                       threaded=True)
57 |                 except ImportError:
58 |                     # Fallback to print if win10toast not installed
59 |                     print("Processing request (install win10toast for notifications)...")
60 |             
61 |             # Wait until stop event or timeout
62 |             self._stop_event.wait(10)  # Wait up to 10 seconds
63 |             
64 |         except Exception as e:
65 |             print(f"Error showing notification: {e}") 


--------------------------------------------------------------------------------
/src/vibevoice/server.py:
--------------------------------------------------------------------------------
 1 | """FastAPI server for Whisper transcription"""
 2 | 
 3 | import uvicorn
 4 | from fastapi import FastAPI
 5 | from pydantic import BaseModel
 6 | from faster_whisper import WhisperModel
 7 | 
 8 | app = FastAPI()
 9 | 
10 | model = WhisperModel("large", device="cuda", compute_type="float16")
11 | # Enable in case you want to run on CPU, but it's much slower
12 | #model = WhisperModel("medium", device="cpu", compute_type="int8")
13 | 
14 | class TranscribeRequest(BaseModel):
15 |     file_path: str
16 | 
17 | @app.get("/health")
18 | def health_check():
19 |     return {"status": "ok"}
20 | 
21 | @app.post("/transcribe/")
22 | async def transcribe(request: TranscribeRequest):
23 |     segments, info = model.transcribe(request.file_path)
24 |     text = " ".join([segment.text.strip() for segment in segments])
25 |     return {"text": text}
26 | 
27 | def run_server():
28 |     uvicorn.run(app, host="0.0.0.0", port=4242)
29 | 
30 | if __name__ == "__main__":
31 |     run_server()
32 | 


--------------------------------------------------------------------------------