├── README.md ├── function-examples.py ├── function_helper.py └── gemini-live.py /README.md: -------------------------------------------------------------------------------- 1 | > [!CAUTION] 2 | > This repo is no longer maintained. Checkout [pipecat-examples](https://github.com/pipecat-ai/pipecat-examples) for recent demos and example code. 3 | 4 | # Gemini Multimodal Live API command-line client 5 | 6 | This is a full-featured client for the Gemini Multimodal Live API, written in Python with minimal dependencies. 7 | 8 | **Please note that you should use headphones with this client. There is no echo cancellation.** 9 | 10 | The client supports: 11 | - text, audio, and screen capture video input 12 | - text or audio output 13 | - setting the system instruction 14 | - setting an initial message with a command-line arg 15 | - the grounded search built-in tool 16 | - the code execution built-in tool 17 | - importing functions from a file and automatically generating Gemini function declarations 18 | 19 | This code uses the Multimodal Live WebSocket API directly. It was originally written as a standalone testbed to experiment with the API. In its current form, it's useful for testing API features and for sharing bits of code with other people who are developing with Gemini 2.0. 20 | 21 | The Multimodal Live API is a developer preview and is evolving rapidly. We'll try to update the implementation and docs here frequently! 22 | 23 | ## Installation 24 | 25 | ``` 26 | # Tested with Python 3.12 and venv, but other versions and virtual environments should work 27 | python3.12 -m venv venv 28 | source venv/bin/activate 29 | 30 | # install with all dependencies 31 | pip install pyaudio websockets google-genai mss 32 | 33 | # or with only basic dependencies (no function calling or screenshare support) 34 | pip install pyaudio websockets 35 | ``` 36 | 37 | ## Usage 38 | 39 | To start a voice-to-voice session: 40 | 41 | ``` 42 | source venv/bin/activate 43 | EXPORT GOOGLE_API_KEY=... 44 | python gemini-live.py 45 | ``` 46 | 47 | Command-line arguments are described in the `--help` message. 48 | 49 | Boolean arguments follow the traditional command-line boolean flag format, e.g. `--audio-input` to enable audio input, and `--no-audio-input` to disable audio input. 50 | 51 | ``` 52 | % python gemini-live.py --help 53 | usage: gemini-live.py [-h] [--system-instruction SYSTEM_INSTRUCTION] [--initial-message INITIAL_MESSAGE] 54 | [--initial-message-delay INITIAL_MESSAGE_DELAY] [--voice VOICE] [--audio-input | --no-audio-input] 55 | [--audio-output | --no-audio-output] [--text-output | --no-text-output] [--search | --no-search] 56 | [--code-execution | --no-code-execution] [--screen-capture-fps SCREEN_CAPTURE_FPS] 57 | [--import-functions IMPORT_FUNCTIONS] 58 | 59 | Gemini Multimodal Live API client 60 | 61 | options: 62 | -h, --help show this help message and exit 63 | --system-instruction SYSTEM_INSTRUCTION 64 | System instruction text 65 | --initial-message INITIAL_MESSAGE 66 | First 'user' message to send to the model 67 | --initial-message-delay INITIAL_MESSAGE_DELAY 68 | Delay in seconds before sending the initial message 69 | --voice VOICE Voice name. Options are Puck, Charon, Kore, Fenrir, and Aoede 70 | --audio-input, --no-audio-input 71 | Enable/disable audio input. Default is enabled. 72 | --audio-output, --no-audio-output 73 | Enable/disable audio output. Default is enabled. 74 | --text-output, --no-text-output 75 | Enable/disable text output. Default is disabled. Audio output and text output cannot be enabled 76 | at the same time. Enabling text output will disable audio output. 77 | --search, --no-search 78 | Enable/disable the built-in grounded search tool. 79 | --code-execution, --no-code-execution 80 | Enable/disable the built-in code execution tool. 81 | --screen-capture-fps SCREEN_CAPTURE_FPS 82 | Enable screen capture. Specify a frames-per-second value. For example, 1.0 for one frame per 83 | second. 84 | --import-functions IMPORT_FUNCTIONS 85 | Import functions from a Python file. Specify a filename. 86 | ``` 87 | 88 | ## Examples 89 | 90 | Set a system instruction and start the conversation by sending an initial 'user' message to Gemini. 91 | 92 | ``` 93 | python gemini-live.py --system-instruction "Talk like a pirate." --initial-message "What's the origin of the phrase, the wine-dark sea?" 94 | ``` 95 | 96 | The voice is set to Charon by default. Set the voice to Aoede. 97 | 98 | ``` 99 | python gemini-live.py --voice Aoede 100 | ``` 101 | 102 | Disable audio input. (Note: you can type to send messages to Gemini whether audio input is enabled or not. Enter (a new line) sends whatever text you've typed. No terminal text, history, or escape key management is implemented. This is very bare-bones text input!) 103 | 104 | ``` 105 | python gemini-live.py --no-audio-input --initial-message 'Hello!' 106 | ``` 107 | 108 | Switch from audio to text output. Currently you have to choose between audio and text output. Both modes are not supported at the same time. 109 | 110 | ``` 111 | python gemini-live.py --text-output --initial-message 'Say the alphabet from A-Z' 112 | ``` 113 | 114 | Enable the built-in search tool. 115 | 116 | ``` 117 | python gemini-live.py --search --initial-message 'Look up the current season-to-date snowpack percentage in the central Sierra Nevada mountains.' 118 | ``` 119 | 120 | Enable code execution. 121 | 122 | ``` 123 | python gemini-live.py --text-output --code-execution --initial-message 'Make up a sentence about each of these four fruits: strawberry, apple, banana, blueberry. Tell me each sentence. Then use code execution to sort the sentences by length and say each sentence again in length-sorted order.' 124 | ``` 125 | 126 | Enable video screen capture. Note that only audio-triggered inference will be able to access the video stream input. Text-triggered inference will either hallucinate or repeat previous information. 127 | 128 | ``` 129 | python gemini-live.py --screen-capture-fps 2.0 130 | ``` 131 | 132 | Import functions from a file and set the API up to use them. The code for this makes use of the very nice function declaration auto-generation in the new `google-genai` SDK. 133 | 134 | ``` 135 | python gemini-live.py --import-functions function-examples.py --initial-message "print the word 'hello' to the console." 136 | ``` 137 | 138 | ## Code walk-through 139 | 140 | The full [gemini-live.py](gemini-live.py) file is about 480 lines, but we'll skip the imports and argument parsing code. 141 | 142 | That leaves about 320 lines of code that implement all of the above features except for function calling. The function declaration and calling code is in a separate module, [function_helper.py](function_helper.py), which is about 100 lines (most of it docstrings). 143 | 144 | ### `__main__` and the run loop 145 | 146 | This program needs to perform several long-running tasks: 147 | 148 | - receive WebSocket messages 149 | - send audio frames for playout 150 | - read audio frames from the OS and send them out via the WebSocket 151 | - read text from stdin and send lines of text out via the WebSocket 152 | - if screen-capture video is enabled, capture frames and send them out via the WebSocket 153 | 154 | There are a number of ways to design this in Python. This program uses threads for OS audio input, OS audio playout, and reading text from stdin. For everything else, asyncio tasks are used. 155 | 156 | ```python 157 | class AudioStreamer: 158 | def __init__(self): 159 | self.running = False 160 | self.event_loop = None 161 | self.mic_audio_in = None 162 | self.speaker_audio_out = None 163 | self.speaker_audio_buffer = bytearray() 164 | self.p = pyaudio.PyAudio() 165 | 166 | async def run(self): 167 | self.event_loop = asyncio.get_event_loop() 168 | self.running = True 169 | 170 | if AUDIO_INPUT: 171 | self.mic_audio_in = self.p.open( 172 | format=pyaudio.paInt16, 173 | channels=CHANNELS, 174 | rate=MIC_SAMPLE_RATE, 175 | input=True, 176 | stream_callback=self.mic_audio_in_callback, 177 | frames_per_buffer=int(MIC_SAMPLE_RATE / 1000) * 2 * 50, # 50ms (S16_LE is 2 bytes) 178 | start=False, 179 | ) 180 | 181 | self.speaker_audio_out = self.p.open( 182 | format=pyaudio.paInt16, 183 | channels=1, 184 | rate=SPEAKER_SAMPLE_RATE, 185 | output=True, 186 | frames_per_buffer=256, 187 | stream_callback=self.speaker_audio_out_callback, 188 | ) 189 | 190 | try: 191 | self.ws = await websockets.connect( 192 | uri=f'wss://{HOST}/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={os.getenv("GOOGLE_API_KEY")}' 193 | ) 194 | print("Connected to Gemini") 195 | except Exception as e: 196 | print(f"Exception: {e}") 197 | return 198 | 199 | asyncio.create_task(self.stdin_worker()) 200 | asyncio.create_task(self.ws_receive_worker()) 201 | asyncio.create_task(self.print_audio_output_buffer_info()) 202 | if SCREEN_CAPTURE_FPS: 203 | asyncio.create_task(self.screen_capture_worker()) 204 | 205 | try: 206 | await self.setup_model() 207 | while self.running: 208 | await asyncio.sleep(1) 209 | except asyncio.CancelledError: 210 | pass 211 | except Exception as e: 212 | print(f"Exception: {e}") 213 | finally: 214 | print("Exiting ...") 215 | self.running = False 216 | sys.stdin.close() 217 | self.p.terminate() 218 | await self.ws.close() 219 | 220 | 221 | if __name__ == "__main__": 222 | parse_args() 223 | asyncio.run(AudioStreamer().run()) 224 | ``` 225 | 226 | The `run()` function sets everything up and then loops until either ctrl-c or some piece of code sets `self.running` to false. 227 | 228 | ### The `setup` message — sending a configuration to the API 229 | 230 | The first message we send after opening the WebSocket connection configures various features of the API. 231 | 232 | ```python 233 | async def setup_model(self): 234 | try: 235 | response_modality = [] 236 | 237 | # Currently the API will throw an error if both response modalities are requested. This 238 | # will likely change soon, though. 239 | if AUDIO_OUTPUT: 240 | response_modality.append("AUDIO") 241 | if TEXT_OUTPUT: 242 | response_modality.append("TEXT") 243 | 244 | setup = { 245 | "setup": { 246 | "model": MODEL, 247 | "generation_config": { 248 | "response_modalities": response_modality, 249 | "speech_config": { 250 | "voice_config": {"prebuilt_voice_config": {"voice_name": VOICE}}, 251 | }, 252 | }, 253 | "tools": [], 254 | }, 255 | } 256 | if SYSTEM_INSTRUCTION_TEXT: 257 | print("System instruction enabled") 258 | setup["setup"]["system_instruction"] = { 259 | "parts": [ 260 | { 261 | "text": SYSTEM_INSTRUCTION_TEXT, 262 | } 263 | ] 264 | } 265 | if SEARCH: 266 | print("Search enabled") 267 | setup["setup"]["tools"].append({"google_search": {}}) 268 | if CODE_EXECUTION: 269 | print("Code execution enabled") 270 | setup["setup"]["tools"].append({"code_execution": {}}) 271 | if FUNCTION_DECLARATIONS: 272 | setup["setup"]["tools"].append({"function_declarations": FUNCTION_DECLARATIONS}) 273 | print("Sending setup", setup) 274 | await self.ws.send(json.dumps(setup)) 275 | except Exception as e: 276 | print(f"Exception: {e}") 277 | ``` 278 | 279 | ### Receiving events from the WebSocket 280 | 281 | To receive audio output, text output, and events, we need to read from the WebSocket. 282 | 283 | ```python 284 | async def ws_receive_worker(self): 285 | try: 286 | async for m in self.ws: 287 | if not self.running: 288 | break 289 | evt = json.loads(m) 290 | self.print_evt(evt, m) 291 | 292 | if evt.get("setupComplete", None) is not None: 293 | asyncio.create_task(self.send_initial_message()) 294 | print("Ready: say something to Gemini") 295 | if self.mic_audio_in: 296 | self.mic_audio_in.start_stream() 297 | elif sc := evt.get("serverContent"): 298 | if sc.get("interrupted"): 299 | print("Interrupted by server") 300 | self.speaker_audio_buffer.clear() 301 | continue 302 | if parts := sc.get("modelTurn", {}).get("parts"): 303 | if text := parts[0].get("text"): 304 | print(f" <- {text}") 305 | elif inline_data := parts[0].get("inlineData"): 306 | mime_str = inline_data.get("mimeType") 307 | mime_type, sample_rate = re.match( 308 | r"([\w/]+);rate=(\d+)", mime_str 309 | ).groups() 310 | if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE): 311 | audio = base64.b64decode(inline_data.get("data")) 312 | self.speaker_audio_buffer.extend(audio) 313 | else: 314 | print(f"Unsupported mime type or sample rate: {mime_str}") 315 | if code := parts[0].get("executableCode"): 316 | pass 317 | elif tool_call := evt.get("toolCall"): 318 | await self.handle_tool_call(tool_call) 319 | except Exception as e: 320 | print(f"Exception: {e}") 321 | self.running = False 322 | ``` 323 | 324 | We recognize 3 types of messages: 325 | 326 | - setupComplete 327 | - ServerContent 328 | - toolCall 329 | 330 | The Multimodal Live API transmits errors as WebSocket status codes (with messages where applicable). Generally, an error will close the connection. When this happens, we set `self.running` to False to being the process of exiting our run loop. 331 | 332 | The serverContent messages can: 333 | - contain audio or text output from the model 334 | - indicate that the API has detected user speech and `interrupted` the current output 335 | - contain information about code execution 336 | 337 | ### Audio 338 | 339 | The PyAudio `open()` calls set up mic audio input and speaker audio output in callback mode, which uses threads internally. We request 256 `frames_per_buffer` in both cases, which is a reasonable balance between latency and robustness to stalling audio input/output. (Even when using callback mode, activity on the Python main thread can impact audio performance.) 340 | 341 | Here are our audio callback functions. 342 | 343 | ```python 344 | def mic_audio_in_callback(self, in_data, frame_count, time_info, status): 345 | if not self.running: 346 | return (None, pyaudio.paComplete) 347 | self.event_loop.create_task(self.send_audio(in_data)) 348 | return (None, pyaudio.paContinue) 349 | 350 | def speaker_audio_out_callback(self, in_data, frame_count, time_info, status): 351 | if not self.running: 352 | return (bytes(frame_count * CHANNELS * 2), pyaudio.paComplete) 353 | audio = bytes(self.speaker_audio_buffer[: frame_count * CHANNELS * 2]) 354 | del self.speaker_audio_buffer[: frame_count * CHANNELS * 2] 355 | audio += b"\0" * (frame_count * CHANNELS * 2 - len(audio)) 356 | return (audio, pyaudio.paContinue) 357 | ``` 358 | 359 | ### Receiving audio 360 | 361 | When we receive audio from the WebSocket, we base64 decode it and put it in a buffer that the `speaker_audio_out_callback` above will read from on-demand. 362 | 363 | ```python 364 | elif inline_data := parts[0].get("inlineData"): 365 | mime_str = inline_data.get("mimeType") 366 | mime_type, sample_rate = re.match( 367 | r"([\w/]+);rate=(\d+)", mime_str 368 | ).groups() 369 | if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE): 370 | audio = base64.b64decode(inline_data.get("data")) 371 | self.speaker_audio_buffer.extend(audio) 372 | else: 373 | print(f"Unsupported mime type or sample rate: {mime_str}") 374 | ``` 375 | 376 | ### Sending audio 377 | 378 | When we receive audio from the mic, we call our `send_audio()` function to send it out the WebSocket as soon as possible. We want to avoid issues with trying to use the WebSocket connection from multiple threads. So we create a task on the main thread using `self.event_loop.create_task()` to run this function. 379 | 380 | Here's `send_audio()`. We're just base64 encoding the raw audio bytes and sending them to Gemini as a `realtimeInput` message. 381 | 382 | ```python 383 | async def send_audio(self, raw_audio): 384 | payload = base64.b64encode(raw_audio).decode("utf-8") 385 | try: 386 | msg = json.dumps( 387 | { 388 | "realtimeInput": { 389 | "mediaChunks": [ 390 | { 391 | "mimeType": f"audio/pcm;rate={MIC_SAMPLE_RATE}", 392 | "data": payload, 393 | } 394 | ], 395 | }, 396 | } 397 | ) 398 | await self.ws.send(msg) 399 | except Exception as e: 400 | print(f"Exception: {e}") 401 | self.running = False 402 | ``` 403 | 404 | ### Receiving text 405 | 406 | When we receive text output from the model, we just print it out. 407 | 408 | ```python 409 | if text := parts[0].get("text"): 410 | print(f" <- {text}") 411 | ``` 412 | 413 | ### Sending text 414 | 415 | To send text to the model, we create and send a clientContent message. In this program, we always set `turnComplete` to true in this message. If we set `turnComplete` to false, the model will wait to perform inference until a clientContent message with `turnComplete` set to true is received. 416 | 417 | Note that clientContent messages are used to send text, and realtimeInput messages are used to send audio and video. 418 | 419 | ```python 420 | async def send_text(self, text): 421 | try: 422 | print(f" -> {text}") 423 | await self.ws.send( 424 | json.dumps( 425 | { 426 | "clientContent": { 427 | "turns": [ 428 | {"parts": [{"text": text}], "role": "user"}, 429 | ], 430 | "turnComplete": True, 431 | } 432 | } 433 | ) 434 | ) 435 | except Exception as e: 436 | print(f"Exception: {e}") 437 | self.running = False 438 | ``` 439 | 440 | ### Sending video 441 | 442 | Sending video is almost exactly like sending audio. Video frames are jpeg byte strings, base64 encoded. 443 | 444 | ```python 445 | async def send_video(self, jpg_bytes): 446 | payload = base64.b64encode(jpg_bytes).decode("utf-8") 447 | try: 448 | msg = json.dumps( 449 | { 450 | "realtimeInput": { 451 | "mediaChunks": [ 452 | { 453 | "mimeType": "image/jpg", 454 | "data": payload, 455 | } 456 | ], 457 | }, 458 | } 459 | ) 460 | await self.ws.send(msg) 461 | except Exception as e: 462 | print(f"Exception: {e}") 463 | self.running = False 464 | ``` 465 | 466 | ### Built-in tools: search and code execution 467 | 468 | The operations of the built-in search and code execution tools are entirely performed on the server end of the API. Information about them is included in serverContent messages in `groundingMetadata` and `executableCode` parts. 469 | 470 | See documentation about groundingMetadata here: 471 | * https://ai.google.dev/gemini-api/docs/grounding?lang=rest 472 | 473 | We print out the url of each `groundingChunk` in our `print_evt()` function. 474 | 475 | ``` 476 | def print_evt(self, evt, response): 477 | columns, rows = shutil.get_terminal_size() 478 | maxl = columns - 5 479 | print(str(evt)[:maxl] + " ...") 480 | if grounding := evt.get("serverContent", {}).get("groundingMetadata"): 481 | for chunk in grounding.get("groundingChunks", []): 482 | print(f" <- {chunk.get("web").get("title")}") 483 | # ... 484 | ``` 485 | 486 | Similarly, see documentation for `executableCode` here: 487 | * https://ai.google.dev/gemini-api/docs/code-execution?lang=rest 488 | 489 | Note that currently the `executableCode` content parts are not sent over the WebSocket when the output modality is audio. 490 | 491 | ### Function calling 492 | 493 | For the API to call user-provided functions, you need to: 494 | 1. Provide function declarations in the `tools[]` list passed to the object as part of the setup message at the beginning of the connection. 495 | 2. Call your function when you receive a toolCall message from the LLM. 496 | 3. Send back a toolResponse message with the data produced by your function. 497 | 498 | It's worth noting that because the LLM's request for a function call is completely decoupled from your execution of the function, you have wide latitude in how you actually perform function calls. 499 | 500 | The function declarations you provide to the LLM are descriptions of notional functions that don't actually need to exist. You could, for example, implement a toolCall handler that: 501 | - calls a remote function using an HTTP request 502 | - looks information up in a database, using the function name and arguments to create a SQL query 503 | - mocks a function for testing, returning static data that can be used as part of an test/evals suite 504 | 505 | The most common setup for user-defined functions, though, is to actually call a locally defined function that does exist! 506 | 507 | Writing function declarations by hand is a little bit laborious. Here's the canonical `get_current_weather` function declaration that's used in lots of docs and examples. 508 | 509 | ``` 510 | tools = [ 511 | { 512 | "function_declarations": [ 513 | { 514 | "name": "get_current_weather", 515 | "description": "Get the current weather", 516 | "parameters": { 517 | "type": "object", 518 | "properties": { 519 | "location": { 520 | "type": "string", 521 | "description": "The city and state, e.g. San Francisco, CA", 522 | }, 523 | "format": { 524 | "type": "string", 525 | "enum": ["celsius", "fahrenheit"], 526 | "description": "The temperature unit to use. Infer this from the users location.", 527 | }, 528 | }, 529 | "required": ["location", "format"], 530 | }, 531 | }, 532 | ] 533 | } 534 | ] 535 | ``` 536 | 537 | The `google-genai` SDK can automatically generate function declarations from a combination of Python introspection and docstrings. 538 | 539 | If you have an installed copy of that SDK, you can read the code for this. It's in venv/lib/python3.12/site-packages/google/genai/types.py :: `from_function()` 540 | 541 | We've written a wrapper for that code that auto-generates function declarations for all functions declared in a file. That wrapper is [function_helpers.py] and is imported if you specify the `--import-functions` command-line option. 542 | 543 | Note that the LLM's ability to call your function properly will depend heavily on the quality of the docstring you write for the function. 544 | 545 | You can test with or use as a starting point, the [function-examples.py] file. 546 | 547 | ``` 548 | python gemini-live.py --import-functions function-examples.py --initial-message "print the word 'hello' to the console." 549 | ``` 550 | 551 | To send function call results back to the API, 552 | - create a toolResponse message. 553 | - include the `id` and the `name` of the function that the API provided in the toolCall message. 554 | - put the function call result in the `response` field of the message. 555 | 556 | ```python 557 | async def handle_tool_call(self, tool_call): 558 | responses = [] 559 | for f in tool_call.get("functionCalls", []): 560 | print(f" <- Function call: {f}") 561 | response = await function_helper.call_function( 562 | FUNCTION_IMPORTS_MODULE, f.get("name"), **f.get("args") 563 | ) 564 | responses.append( 565 | { 566 | "id": f.get("id"), 567 | "name": f.get("name"), 568 | "response": response, 569 | } 570 | ) 571 | msg = json.dumps( 572 | { 573 | "toolResponse": { 574 | "functionResponses": responses, 575 | } 576 | } 577 | ) 578 | print(f" -> {msg}") 579 | await self.ws.send(msg) 580 | ``` 581 | 582 | ## Other resources 583 | 584 | * Multimodal Live [API overview](https://ai.google.dev/api/multimodal-live) 585 | * Gemini 2.0 [cookbook](https://github.com/google-gemini/cookbook/tree/main/gemini-2) 586 | * A [starter app](https://github.com/google-gemini/multimodal-live-api-web-console) from the Google Creative Labs team 587 | * Multimodal Live [WebRTC client app starter kit](https://github.com/pipecat-ai/gemini-multimodal-live-demo) 588 | -------------------------------------------------------------------------------- /function-examples.py: -------------------------------------------------------------------------------- 1 | def get_current_weather( 2 | location: str, 3 | ): 4 | """Returns the current weather. 5 | 6 | Args: 7 | location: The location to get the weather for. 8 | """ 9 | return { 10 | "status": "success", 11 | "response": f"The current weather in {location} is 72 degrees with scattered thunderstorms.", 12 | } 13 | 14 | 15 | def line_printer(line: str): 16 | """Prints a line to the console. 17 | 18 | Args: 19 | line: The line to print. 20 | """ 21 | print(f" \033[1m :: {line} ::\033[0m") 22 | return {"status": "success"} 23 | -------------------------------------------------------------------------------- /function_helper.py: -------------------------------------------------------------------------------- 1 | import importlib.util 2 | import inspect 3 | import types 4 | 5 | from typing import Any, Callable, Coroutine, List, Tuple, Union 6 | 7 | from google import genai 8 | 9 | from_function = genai.types.FunctionDeclaration.from_function 10 | 11 | 12 | def create_function_declarations_from_file( 13 | filename: str, 14 | ) -> Tuple[[genai.types.FunctionDeclaration], types.ModuleType]: 15 | """ 16 | Import a Python file and return function declarations automatically generated 17 | by the google-genai SDK. Also return the module object. 18 | 19 | Args: 20 | filename (str): Path to the Python file to import 21 | 22 | Returns: 23 | tuple: (list of function objects, module object) 24 | """ 25 | 26 | functions, module = _import_functions_from_file(filename) 27 | 28 | # We need a google-genai client object to use the from_function utility method. (Mostly 29 | # because internally the it does slightly different things for Google AI Studio and Google 30 | # Vertex AI connections.) 31 | client = genai.Client( 32 | http_options={ 33 | "api_version": "v1alpha", 34 | "url": "generativelanguage.googleapis.com", 35 | } 36 | ) 37 | 38 | function_declarations = [] 39 | for func in functions: 40 | function_declarations.append( 41 | from_function(client, func).model_dump(exclude_unset=True, exclude_none=True) 42 | ) 43 | 44 | return function_declarations, module 45 | 46 | 47 | async def call_function(module: Any, function_name: str, **kwargs) -> Any: 48 | """ 49 | Call a function from a module by its name, handling both sync and async 50 | functions. 51 | 52 | Args: 53 | module: The module object containing the function 54 | function_name: The name of the function to call 55 | **kwargs: Keyword arguments to pass to the function 56 | 57 | Returns: 58 | The result of the function call 59 | 60 | Raises: 61 | AttributeError: If the function doesn't exist in the module 62 | """ 63 | if not hasattr(module, function_name): 64 | raise AttributeError(f"Function '{function_name}' not found in module") 65 | 66 | func = getattr(module, function_name) 67 | if inspect.iscoroutinefunction(func): 68 | return await func(**kwargs) 69 | return func(**kwargs) 70 | 71 | 72 | def _import_functions_from_file( 73 | filename: str, 74 | ) -> Tuple[List[Union[Callable, Coroutine]], types.ModuleType]: 75 | """ 76 | Dynamically import a Python file and return a list of functions it defines, 77 | and the module object. 78 | 79 | Args: 80 | filename (str): Path to the Python file to import 81 | 82 | Returns: 83 | tuple: (list of function objects, module object) 84 | """ 85 | try: 86 | spec = importlib.util.spec_from_file_location("dynamic_module", filename) 87 | if spec is None: 88 | raise ImportError(f"Could not load spec for module: {filename}") 89 | 90 | module = importlib.util.module_from_spec(spec) 91 | if spec.loader is None: 92 | raise ImportError(f"Could not load module: {filename}") 93 | 94 | spec.loader.exec_module(module) 95 | 96 | functions = [] 97 | for name, obj in inspect.getmembers(module): 98 | # Filter for functions defined in the module 99 | if ( 100 | inspect.isfunction(obj) or inspect.iscoroutinefunction(obj) 101 | ) and obj.__module__ == module.__name__: 102 | functions.append(obj) 103 | 104 | return functions, module 105 | 106 | except Exception as e: 107 | raise ImportError(f"Error importing {filename}: {str(e)}") 108 | -------------------------------------------------------------------------------- /gemini-live.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import asyncio 3 | import base64 4 | import json 5 | import io 6 | import os 7 | import pyaudio 8 | import re 9 | import shutil 10 | import select 11 | import sys 12 | import traceback 13 | import websockets 14 | 15 | 16 | MODEL = "models/gemini-2.0-flash-exp" 17 | HOST = "generativelanguage.googleapis.com" 18 | 19 | MIC_SAMPLE_RATE = 16000 20 | SPEAKER_SAMPLE_RATE = 24000 21 | FORMAT = "S16_LE" 22 | CHANNELS = 1 23 | 24 | # 25 | # Argument parsing 26 | # 27 | 28 | SYSTEM_INSTRUCTION_TEXT = "" 29 | INITIAL_MESSAGE = "" 30 | INITIAL_MESSAGE_DELAY = 0.0 31 | VOICE = None 32 | AUDIO_INPUT = None 33 | AUDIO_OUTPUT = None 34 | TEXT_OUTPUT = None 35 | SEARCH = False 36 | CODE_EXECUTION = False 37 | SCREEN_CAPTURE_FPS = 0.0 38 | 39 | function_helper = None 40 | FUNCTION_IMPORTS_MODULE = None 41 | FUNCTION_DECLARATIONS = None 42 | 43 | 44 | def parse_args(): 45 | parser = argparse.ArgumentParser(description="Gemini Multimodal Live API client") 46 | parser.add_argument("--system-instruction", type=str, help="System instruction text") 47 | parser.add_argument( 48 | "--initial-message", type=str, help="First 'user' message to send to the model" 49 | ) 50 | parser.add_argument( 51 | "--initial-message-delay", 52 | type=float, 53 | default=0.0, 54 | help="Delay in seconds before sending the initial message", 55 | ) 56 | parser.add_argument( 57 | "--voice", 58 | default="Charon", 59 | type=str, 60 | help="Voice name. Options are Puck, Charon, Kore, Fenrir, and Aoede ", 61 | ) 62 | parser.add_argument( 63 | "--audio-input", 64 | default=True, 65 | action=argparse.BooleanOptionalAction, 66 | help="Enable/disable audio input. Default is enabled.", 67 | ) 68 | parser.add_argument( 69 | "--audio-output", 70 | default=True, 71 | action=argparse.BooleanOptionalAction, 72 | help="Enable/disable audio output. Default is enabled.", 73 | ) 74 | parser.add_argument( 75 | "--text-output", 76 | default=False, 77 | action=argparse.BooleanOptionalAction, 78 | help="Enable/disable text output. Default is disabled. Audio output and text output cannot be enabled at the same time. Enabling text output will disable audio output.", 79 | ) 80 | parser.add_argument( 81 | "--search", 82 | action=argparse.BooleanOptionalAction, 83 | help="Enable/disable the built-in grounded search tool.", 84 | ) 85 | parser.add_argument( 86 | "--code-execution", 87 | action=argparse.BooleanOptionalAction, 88 | help="Enable/disable the built-in code execution tool.", 89 | ) 90 | parser.add_argument( 91 | "--screen-capture-fps", 92 | type=float, 93 | default=0.0, 94 | help="Enable screen capture. Specify a frames-per-second value. For example, 1.0 for one frame per second.", 95 | ) 96 | parser.add_argument( 97 | "--import-functions", 98 | type=str, 99 | help="Import functions from a Python file. Specify a filename.", 100 | ) 101 | 102 | args = parser.parse_args() 103 | 104 | global \ 105 | SYSTEM_INSTRUCTION_TEXT, \ 106 | INITIAL_MESSAGE, \ 107 | INITIAL_MESSAGE_DELAY, \ 108 | VOICE, \ 109 | AUDIO_INPUT, \ 110 | AUDIO_OUTPUT, \ 111 | TEXT_OUTPUT, \ 112 | SEARCH, \ 113 | CODE_EXECUTION, \ 114 | SCREEN_CAPTURE_FPS, \ 115 | FUNCTION_IMPORTS_MODULE, \ 116 | FUNCTION_DECLARATIONS 117 | 118 | SYSTEM_INSTRUCTION_TEXT = args.system_instruction 119 | INITIAL_MESSAGE = args.initial_message 120 | INITIAL_MESSAGE_DELAY = args.initial_message_delay 121 | VOICE = args.voice 122 | AUDIO_INPUT = args.audio_input 123 | AUDIO_OUTPUT = args.audio_output 124 | TEXT_OUTPUT = args.text_output 125 | SEARCH = args.search 126 | CODE_EXECUTION = args.code_execution 127 | SCREEN_CAPTURE_FPS = args.screen_capture_fps 128 | 129 | if args.text_output and args.audio_output: 130 | print( 131 | "Warning: audio output and text output cannot be enabled at the same time. Disabling audio output." 132 | ) 133 | AUDIO_OUTPUT = False 134 | 135 | if args.screen_capture_fps: 136 | try: 137 | global mss, Image 138 | import mss 139 | from PIL import Image 140 | except Exception: 141 | print("Screen capture requires the mss library. Install with 'pip install mss'") 142 | quit 143 | 144 | if args.import_functions: 145 | try: 146 | # Conditional import, because the function declaration helpers depend on having the 147 | # google-genai library installed. If you don't need to import functions, you don't 148 | # need to install that dependency. 149 | global function_helper 150 | import function_helper 151 | 152 | function_declarations, module = function_helper.create_function_declarations_from_file( 153 | args.import_functions 154 | ) 155 | FUNCTION_IMPORTS_MODULE = module 156 | FUNCTION_DECLARATIONS = function_declarations 157 | except Exception: 158 | print("Function import failed") 159 | print(traceback.print_exc()) 160 | quit() 161 | 162 | 163 | # 164 | # Main application class 165 | # 166 | 167 | 168 | class AudioStreamer: 169 | def __init__(self): 170 | self.running = False 171 | self.event_loop = None 172 | self.mic_audio_in = None 173 | self.speaker_audio_out = None 174 | self.speaker_audio_buffer = bytearray() 175 | self.p = pyaudio.PyAudio() 176 | 177 | def mic_audio_in_callback(self, in_data, frame_count, time_info, status): 178 | if not self.running: 179 | return (None, pyaudio.paComplete) 180 | self.event_loop.create_task(self.send_audio(in_data)) 181 | return (None, pyaudio.paContinue) 182 | 183 | def speaker_audio_out_callback(self, in_data, frame_count, time_info, status): 184 | if not self.running: 185 | return (bytes(frame_count * CHANNELS * 2), pyaudio.paComplete) 186 | audio = bytes(self.speaker_audio_buffer[: frame_count * CHANNELS * 2]) 187 | del self.speaker_audio_buffer[: frame_count * CHANNELS * 2] 188 | audio += b"\0" * (frame_count * CHANNELS * 2 - len(audio)) 189 | return (audio, pyaudio.paContinue) 190 | 191 | async def send_initial_message(self): 192 | if INITIAL_MESSAGE: 193 | await asyncio.sleep(INITIAL_MESSAGE_DELAY) 194 | await self.send_text(INITIAL_MESSAGE) 195 | 196 | async def send_text(self, text): 197 | try: 198 | print(f" -> {text}") 199 | await self.ws.send( 200 | json.dumps( 201 | { 202 | "clientContent": { 203 | "turns": [ 204 | {"parts": [{"text": text}], "role": "user"}, 205 | ], 206 | "turnComplete": True, 207 | } 208 | } 209 | ) 210 | ) 211 | except Exception as e: 212 | print(f"Exception: {e}") 213 | self.running = False 214 | 215 | async def send_audio(self, raw_audio): 216 | payload = base64.b64encode(raw_audio).decode("utf-8") 217 | try: 218 | msg = json.dumps( 219 | { 220 | "realtimeInput": { 221 | "mediaChunks": [ 222 | { 223 | "mimeType": f"audio/pcm;rate={MIC_SAMPLE_RATE}", 224 | "data": payload, 225 | } 226 | ], 227 | }, 228 | } 229 | ) 230 | await self.ws.send(msg) 231 | except Exception as e: 232 | print(f"Exception: {e}") 233 | self.running = False 234 | 235 | async def send_video(self, jpg_bytes): 236 | payload = base64.b64encode(jpg_bytes).decode("utf-8") 237 | try: 238 | msg = json.dumps( 239 | { 240 | "realtimeInput": { 241 | "mediaChunks": [ 242 | { 243 | "mimeType": "image/jpg", 244 | "data": payload, 245 | } 246 | ], 247 | }, 248 | } 249 | ) 250 | await self.ws.send(msg) 251 | except Exception as e: 252 | print(f"Exception: {e}") 253 | self.running = False 254 | 255 | async def handle_tool_call(self, tool_call): 256 | # print(f" <- handling tool call {tool_call}") 257 | responses = [] 258 | for f in tool_call.get("functionCalls", []): 259 | print(f" <- Function call: {f}") 260 | response = await function_helper.call_function( 261 | FUNCTION_IMPORTS_MODULE, f.get("name"), **f.get("args") 262 | ) 263 | responses.append( 264 | { 265 | "id": f.get("id"), 266 | "name": f.get("name"), 267 | "response": response, 268 | } 269 | ) 270 | msg = json.dumps( 271 | { 272 | "toolResponse": { 273 | "functionResponses": responses, 274 | } 275 | } 276 | ) 277 | print(f" -> {msg}") 278 | await self.ws.send(msg) 279 | 280 | async def print_audio_output_buffer_info(self): 281 | while self.running: 282 | if self.speaker_audio_buffer: 283 | print( 284 | f"Current audio buffer size: {len(self.speaker_audio_buffer) / (SPEAKER_SAMPLE_RATE * 2):.2f} seconds" 285 | ) 286 | await asyncio.sleep(2) 287 | 288 | def print_evt(self, evt, response): 289 | columns, rows = shutil.get_terminal_size() 290 | maxl = columns - 5 291 | print(str(evt)[:maxl] + " ...") 292 | if grounding := evt.get("serverContent", {}).get("groundingMetadata"): 293 | for chunk in grounding.get("groundingChunks", []): 294 | print(f" <- {chunk.get("web").get("title")}") 295 | if parts := evt.get("serverContent", {}).get("modelTurn", {}).get("parts"): 296 | for part in parts: 297 | if part.get("inlineData") or part.get("text"): 298 | continue 299 | print(f" <- {part}") 300 | 301 | async def stdin_worker(self): 302 | def timeout_input(timeout: float = 0.1): 303 | ready, _, _ = select.select([sys.stdin], [], [], timeout) 304 | if ready: 305 | return sys.stdin.readline().rstrip() 306 | return None 307 | 308 | while self.running: 309 | try: 310 | line = await self.event_loop.run_in_executor(None, timeout_input) 311 | if line: 312 | await self.send_text(line) 313 | except Exception as e: 314 | print(f"Exception: {e}") 315 | self.running = False 316 | 317 | async def ws_receive_worker(self): 318 | try: 319 | async for m in self.ws: 320 | if not self.running: 321 | break 322 | evt = json.loads(m) 323 | self.print_evt(evt, m) 324 | 325 | if evt.get("setupComplete", None) is not None: 326 | asyncio.create_task(self.send_initial_message()) 327 | print("Ready: say something to Gemini") 328 | if self.mic_audio_in: 329 | self.mic_audio_in.start_stream() 330 | elif sc := evt.get("serverContent"): 331 | if sc.get("interrupted"): 332 | print("Interrupted by server") 333 | self.speaker_audio_buffer.clear() 334 | continue 335 | if parts := sc.get("modelTurn", {}).get("parts"): 336 | if text := parts[0].get("text"): 337 | print(f" <- {text}") 338 | elif inline_data := parts[0].get("inlineData"): 339 | mime_str = inline_data.get("mimeType") 340 | mime_type, sample_rate = re.match( 341 | r"([\w/]+);rate=(\d+)", mime_str 342 | ).groups() 343 | if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE): 344 | audio = base64.b64decode(inline_data.get("data")) 345 | self.speaker_audio_buffer.extend(audio) 346 | else: 347 | print(f"Unsupported mime type or sample rate: {mime_str}") 348 | if code := parts[0].get("executableCode"): 349 | pass 350 | elif tool_call := evt.get("toolCall"): 351 | await self.handle_tool_call(tool_call) 352 | except Exception as e: 353 | print(f"Exception: {e}") 354 | self.running = False 355 | 356 | async def screen_capture_worker(self): 357 | print(f"Screen capture enabled at {SCREEN_CAPTURE_FPS} frames per second.") 358 | with mss.mss(with_cursor=True) as sct: 359 | # Which display to capture. 0 is the composite of all screens. 1 is the primary screen. 360 | monitor = sct.monitors[1] 361 | while self.running: 362 | try: 363 | frame = sct.grab(monitor) 364 | # print(frame) 365 | buffer = io.BytesIO() 366 | Image.frombytes("RGB", frame.size, frame.bgra, "raw", "BGRX").save( 367 | buffer, format="JPEG" 368 | ) 369 | await self.send_video(buffer.getvalue()) 370 | except Exception as e: 371 | print(f"Exception: {e}") 372 | self.running = False 373 | return 374 | await asyncio.sleep(1.0 / SCREEN_CAPTURE_FPS) 375 | 376 | async def setup_model(self): 377 | try: 378 | response_modality = [] 379 | 380 | # Currently the API will throw an error if both response modalities are requested. This 381 | # will likely change soon, though. 382 | if AUDIO_OUTPUT: 383 | response_modality.append("AUDIO") 384 | if TEXT_OUTPUT: 385 | response_modality.append("TEXT") 386 | 387 | setup = { 388 | "setup": { 389 | "model": MODEL, 390 | "generation_config": { 391 | "response_modalities": response_modality, 392 | "speech_config": { 393 | "voice_config": {"prebuilt_voice_config": {"voice_name": VOICE}}, 394 | }, 395 | }, 396 | "tools": [], 397 | }, 398 | } 399 | if SYSTEM_INSTRUCTION_TEXT: 400 | print("System instruction enabled") 401 | setup["setup"]["system_instruction"] = { 402 | "parts": [ 403 | { 404 | "text": SYSTEM_INSTRUCTION_TEXT, 405 | } 406 | ] 407 | } 408 | if SEARCH: 409 | print("Search enabled") 410 | setup["setup"]["tools"].append({"google_search": {}}) 411 | if CODE_EXECUTION: 412 | print("Code execution enabled") 413 | setup["setup"]["tools"].append({"code_execution": {}}) 414 | if FUNCTION_DECLARATIONS: 415 | setup["setup"]["tools"].append({"function_declarations": FUNCTION_DECLARATIONS}) 416 | print("Sending setup", setup) 417 | await self.ws.send(json.dumps(setup)) 418 | except Exception as e: 419 | print(f"Exception: {e}") 420 | 421 | async def run(self): 422 | self.event_loop = asyncio.get_event_loop() 423 | self.running = True 424 | 425 | if AUDIO_INPUT: 426 | self.mic_audio_in = self.p.open( 427 | format=pyaudio.paInt16, 428 | channels=CHANNELS, 429 | rate=MIC_SAMPLE_RATE, 430 | input=True, 431 | stream_callback=self.mic_audio_in_callback, 432 | frames_per_buffer=int(MIC_SAMPLE_RATE / 1000) * 2 * 50, # 50ms (S16_LE is 2 bytes) 433 | start=False, 434 | ) 435 | 436 | self.speaker_audio_out = self.p.open( 437 | format=pyaudio.paInt16, 438 | channels=1, 439 | rate=SPEAKER_SAMPLE_RATE, 440 | output=True, 441 | frames_per_buffer=256, 442 | stream_callback=self.speaker_audio_out_callback, 443 | ) 444 | 445 | try: 446 | self.ws = await websockets.connect( 447 | uri=f'wss://{HOST}/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={os.getenv("GOOGLE_API_KEY")}' 448 | ) 449 | print("Connected to Gemini") 450 | except Exception as e: 451 | print(f"Exception: {e}") 452 | return 453 | 454 | asyncio.create_task(self.stdin_worker()) 455 | asyncio.create_task(self.ws_receive_worker()) 456 | asyncio.create_task(self.print_audio_output_buffer_info()) 457 | if SCREEN_CAPTURE_FPS: 458 | asyncio.create_task(self.screen_capture_worker()) 459 | 460 | try: 461 | await self.setup_model() 462 | while self.running: 463 | await asyncio.sleep(1) 464 | except asyncio.CancelledError: 465 | pass 466 | except Exception as e: 467 | print(f"Exception: {e}") 468 | finally: 469 | print("Exiting ...") 470 | self.running = False 471 | sys.stdin.close() 472 | self.p.terminate() 473 | await self.ws.close() 474 | 475 | 476 | if __name__ == "__main__": 477 | parse_args() 478 | asyncio.run(AudioStreamer().run()) 479 | --------------------------------------------------------------------------------