├── README.md
├── function-examples.py
├── function_helper.py
└── gemini-live.py


/README.md:
--------------------------------------------------------------------------------
  1 | > [!CAUTION]
  2 | > This repo is no longer maintained. Checkout [pipecat-examples](https://github.com/pipecat-ai/pipecat-examples) for recent demos and example code. 
  3 | 
  4 | # Gemini Multimodal Live API command-line client
  5 | 
  6 | This is a full-featured client for the Gemini Multimodal Live API, written in Python with minimal dependencies.
  7 | 
  8 | **Please note that you should use headphones with this client. There is no echo cancellation.**
  9 | 
 10 | The client supports:
 11 |   - text, audio, and screen capture video input
 12 |   - text or audio output
 13 |   - setting the system instruction
 14 |   - setting an initial message with a command-line arg
 15 |   - the grounded search built-in tool
 16 |   - the code execution built-in tool
 17 |   - importing functions from a file and automatically generating Gemini function declarations
 18 | 
 19 | This code uses the Multimodal Live WebSocket API directly. It was originally written as a standalone testbed to experiment with the API. In its current form, it's useful for testing API features and for sharing bits of code with other people who are developing with Gemini 2.0.
 20 | 
 21 | The Multimodal Live API is a developer preview and is evolving rapidly. We'll try to update the implementation and docs here frequently!
 22 | 
 23 | ## Installation
 24 | 
 25 | ```
 26 | # Tested with Python 3.12 and venv, but other versions and virtual environments should work
 27 | python3.12 -m venv venv
 28 | source venv/bin/activate
 29 | 
 30 | # install with all dependencies
 31 | pip install pyaudio websockets google-genai mss
 32 | 
 33 | # or with only basic dependencies (no function calling or screenshare support)
 34 | pip install pyaudio websockets
 35 | ```
 36 | 
 37 | ## Usage
 38 | 
 39 | To start a voice-to-voice session:
 40 | 
 41 | ```
 42 | source venv/bin/activate
 43 | EXPORT GOOGLE_API_KEY=...
 44 | python gemini-live.py
 45 | ```
 46 | 
 47 | Command-line arguments are described in the `--help` message.
 48 | 
 49 | Boolean arguments follow the traditional command-line boolean flag format, e.g. `--audio-input` to enable audio input, and `--no-audio-input` to disable audio input.
 50 |  
 51 | ```
 52 | % python gemini-live.py --help
 53 | usage: gemini-live.py [-h] [--system-instruction SYSTEM_INSTRUCTION] [--initial-message INITIAL_MESSAGE]
 54 |                       [--initial-message-delay INITIAL_MESSAGE_DELAY] [--voice VOICE] [--audio-input | --no-audio-input]
 55 |                       [--audio-output | --no-audio-output] [--text-output | --no-text-output] [--search | --no-search]
 56 |                       [--code-execution | --no-code-execution] [--screen-capture-fps SCREEN_CAPTURE_FPS]
 57 |                       [--import-functions IMPORT_FUNCTIONS]
 58 | 
 59 | Gemini Multimodal Live API client
 60 | 
 61 | options:
 62 |   -h, --help            show this help message and exit
 63 |   --system-instruction SYSTEM_INSTRUCTION
 64 |                         System instruction text
 65 |   --initial-message INITIAL_MESSAGE
 66 |                         First 'user' message to send to the model
 67 |   --initial-message-delay INITIAL_MESSAGE_DELAY
 68 |                         Delay in seconds before sending the initial message
 69 |   --voice VOICE         Voice name. Options are Puck, Charon, Kore, Fenrir, and Aoede
 70 |   --audio-input, --no-audio-input
 71 |                         Enable/disable audio input. Default is enabled.
 72 |   --audio-output, --no-audio-output
 73 |                         Enable/disable audio output. Default is enabled.
 74 |   --text-output, --no-text-output
 75 |                         Enable/disable text output. Default is disabled. Audio output and text output cannot be enabled
 76 |                         at the same time. Enabling text output will disable audio output.
 77 |   --search, --no-search
 78 |                         Enable/disable the built-in grounded search tool.
 79 |   --code-execution, --no-code-execution
 80 |                         Enable/disable the built-in code execution tool.
 81 |   --screen-capture-fps SCREEN_CAPTURE_FPS
 82 |                         Enable screen capture. Specify a frames-per-second value. For example, 1.0 for one frame per
 83 |                         second.
 84 |   --import-functions IMPORT_FUNCTIONS
 85 |                         Import functions from a Python file. Specify a filename.
 86 | ```
 87 | 
 88 | ## Examples
 89 | 
 90 | Set a system instruction and start the conversation by sending an initial 'user' message to Gemini.
 91 | 
 92 | ```
 93 | python gemini-live.py --system-instruction "Talk like a pirate." --initial-message "What's the origin of the phrase, the wine-dark sea?"
 94 | ```
 95 | 
 96 | The voice is set to Charon by default. Set the voice to Aoede.
 97 | 
 98 | ```
 99 | python gemini-live.py --voice Aoede
100 | ```
101 | 
102 | Disable audio input. (Note: you can type to send messages to Gemini whether audio input is enabled or not. Enter (a new line) sends whatever text you've typed. No terminal text, history, or escape key management is implemented. This is very bare-bones text input!)
103 | 
104 | ```
105 | python gemini-live.py --no-audio-input --initial-message 'Hello!'
106 | ```
107 | 
108 | Switch from audio to text output. Currently you have to choose between audio and text output. Both modes are not supported at the same time.
109 | 
110 | ```
111 | python gemini-live.py --text-output --initial-message 'Say the alphabet from A-Z'
112 | ```
113 | 
114 | Enable the built-in search tool.
115 | 
116 | ```
117 | python gemini-live.py --search --initial-message 'Look up the current season-to-date snowpack percentage in the central Sierra Nevada mountains.'
118 | ```
119 | 
120 | Enable code execution.
121 | 
122 | ```
123 | python gemini-live.py --text-output --code-execution --initial-message 'Make up a sentence about each of these four fruits: strawberry, apple, banana, blueberry. Tell me each sentence. Then use code execution to sort the sentences by length and say each sentence again in length-sorted order.'
124 | ```
125 | 
126 | Enable video screen capture. Note that only audio-triggered inference will be able to access the video stream input. Text-triggered inference will either hallucinate or repeat previous information.
127 | 
128 | ```
129 | python gemini-live.py --screen-capture-fps 2.0 
130 | ```
131 | 
132 | Import functions from a file and set the API up to use them. The code for this makes use of the very nice function declaration auto-generation in the new `google-genai` SDK.
133 | 
134 | ```
135 | python gemini-live.py --import-functions function-examples.py --initial-message "print the word 'hello' to the console."
136 | ```
137 | 
138 | ## Code walk-through
139 | 
140 | The full [gemini-live.py](gemini-live.py) file is about 480 lines, but we'll skip the imports and argument parsing code.
141 | 
142 | That leaves about 320 lines of code that implement all of the above features except for function calling. The function declaration and calling code is in a separate module, [function_helper.py](function_helper.py), which is about 100 lines (most of it docstrings).
143 | 
144 | ### `__main__` and the run loop
145 | 
146 | This program needs to perform several long-running tasks:
147 | 
148 |   - receive WebSocket messages
149 |   - send audio frames for playout
150 |   - read audio frames from the OS and send them out via the WebSocket
151 |   - read text from stdin and send lines of text out via the WebSocket
152 |   - if screen-capture video is enabled, capture frames and send them out via the WebSocket
153 | 
154 | There are a number of ways to design this in Python. This program uses threads for OS audio input, OS audio playout, and reading text from stdin. For everything else, asyncio tasks are used.
155 | 
156 | ```python
157 | class AudioStreamer:
158 |     def __init__(self):
159 |         self.running = False
160 |         self.event_loop = None
161 |         self.mic_audio_in = None
162 |         self.speaker_audio_out = None
163 |         self.speaker_audio_buffer = bytearray()
164 |         self.p = pyaudio.PyAudio()
165 | 
166 |     async def run(self):
167 |         self.event_loop = asyncio.get_event_loop()
168 |         self.running = True
169 | 
170 |         if AUDIO_INPUT:
171 |             self.mic_audio_in = self.p.open(
172 |                 format=pyaudio.paInt16,
173 |                 channels=CHANNELS,
174 |                 rate=MIC_SAMPLE_RATE,
175 |                 input=True,
176 |                 stream_callback=self.mic_audio_in_callback,
177 |                 frames_per_buffer=int(MIC_SAMPLE_RATE / 1000) * 2 * 50,  # 50ms (S16_LE is 2 bytes)
178 |                 start=False,
179 |             )
180 | 
181 |         self.speaker_audio_out = self.p.open(
182 |             format=pyaudio.paInt16,
183 |             channels=1,
184 |             rate=SPEAKER_SAMPLE_RATE,
185 |             output=True,
186 |             frames_per_buffer=256,
187 |             stream_callback=self.speaker_audio_out_callback,
188 |         )
189 | 
190 |         try:
191 |             self.ws = await websockets.connect(
192 |                 uri=f'wss://{HOST}/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={os.getenv("GOOGLE_API_KEY")}'
193 |             )
194 |             print("Connected to Gemini")
195 |         except Exception as e:
196 |             print(f"Exception: {e}")
197 |             return
198 | 
199 |         asyncio.create_task(self.stdin_worker())
200 |         asyncio.create_task(self.ws_receive_worker())
201 |         asyncio.create_task(self.print_audio_output_buffer_info())
202 |         if SCREEN_CAPTURE_FPS:
203 |             asyncio.create_task(self.screen_capture_worker())
204 | 
205 |         try:
206 |             await self.setup_model()
207 |             while self.running:
208 |                 await asyncio.sleep(1)
209 |         except asyncio.CancelledError:
210 |             pass
211 |         except Exception as e:
212 |             print(f"Exception: {e}")
213 |         finally:
214 |             print("Exiting ...")
215 |             self.running = False
216 |             sys.stdin.close()
217 |             self.p.terminate()
218 |             await self.ws.close()
219 | 
220 | 
221 | if __name__ == "__main__":
222 |     parse_args()
223 |     asyncio.run(AudioStreamer().run())
224 | ```
225 | 
226 | The `run()` function sets everything up and then loops until either ctrl-c or some piece of code sets `self.running` to false.
227 | 
228 | ### The `setup` message — sending a configuration to the API
229 | 
230 | The first message we send after opening the WebSocket connection configures various features of the API.
231 | 
232 | ```python
233 |     async def setup_model(self):
234 |         try:
235 |             response_modality = []
236 | 
237 |             # Currently the API will throw an error if both response modalities are requested. This
238 |             # will likely change soon, though.
239 |             if AUDIO_OUTPUT:
240 |                 response_modality.append("AUDIO")
241 |             if TEXT_OUTPUT:
242 |                 response_modality.append("TEXT")
243 | 
244 |             setup = {
245 |                 "setup": {
246 |                     "model": MODEL,
247 |                     "generation_config": {
248 |                         "response_modalities": response_modality,
249 |                         "speech_config": {
250 |                             "voice_config": {"prebuilt_voice_config": {"voice_name": VOICE}},
251 |                         },
252 |                     },
253 |                     "tools": [],
254 |                 },
255 |             }
256 |             if SYSTEM_INSTRUCTION_TEXT:
257 |                 print("System instruction enabled")
258 |                 setup["setup"]["system_instruction"] = {
259 |                     "parts": [
260 |                         {
261 |                             "text": SYSTEM_INSTRUCTION_TEXT,
262 |                         }
263 |                     ]
264 |                 }
265 |             if SEARCH:
266 |                 print("Search enabled")
267 |                 setup["setup"]["tools"].append({"google_search": {}})
268 |             if CODE_EXECUTION:
269 |                 print("Code execution enabled")
270 |                 setup["setup"]["tools"].append({"code_execution": {}})
271 |             if FUNCTION_DECLARATIONS:
272 |                 setup["setup"]["tools"].append({"function_declarations": FUNCTION_DECLARATIONS})
273 |             print("Sending setup", setup)
274 |             await self.ws.send(json.dumps(setup))
275 |         except Exception as e:
276 |             print(f"Exception: {e}")
277 | ```
278 | 
279 | ### Receiving events from the WebSocket
280 | 
281 | To receive audio output, text output, and events, we need to read from the WebSocket.
282 | 
283 | ```python
284 |     async def ws_receive_worker(self):
285 |         try:
286 |             async for m in self.ws:
287 |                 if not self.running:
288 |                     break
289 |                 evt = json.loads(m)
290 |                 self.print_evt(evt, m)
291 | 
292 |                 if evt.get("setupComplete", None) is not None:
293 |                     asyncio.create_task(self.send_initial_message())
294 |                     print("Ready: say something to Gemini")
295 |                     if self.mic_audio_in:
296 |                         self.mic_audio_in.start_stream()
297 |                 elif sc := evt.get("serverContent"):
298 |                     if sc.get("interrupted"):
299 |                         print("Interrupted by server")
300 |                         self.speaker_audio_buffer.clear()
301 |                         continue
302 |                     if parts := sc.get("modelTurn", {}).get("parts"):
303 |                         if text := parts[0].get("text"):
304 |                             print(f"  <- {text}")
305 |                         elif inline_data := parts[0].get("inlineData"):
306 |                             mime_str = inline_data.get("mimeType")
307 |                             mime_type, sample_rate = re.match(
308 |                                 r"([\w/]+);rate=(\d+)", mime_str
309 |                             ).groups()
310 |                             if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE):
311 |                                 audio = base64.b64decode(inline_data.get("data"))
312 |                                 self.speaker_audio_buffer.extend(audio)
313 |                             else:
314 |                                 print(f"Unsupported mime type or sample rate: {mime_str}")
315 |                         if code := parts[0].get("executableCode"):
316 |                             pass
317 |                 elif tool_call := evt.get("toolCall"):
318 |                     await self.handle_tool_call(tool_call)
319 |         except Exception as e:
320 |             print(f"Exception: {e}")
321 |             self.running = False
322 | ```
323 | 
324 | We recognize 3 types of messages:
325 | 
326 | - setupComplete
327 | - ServerContent
328 | - toolCall
329 | 
330 | The Multimodal Live API transmits errors as WebSocket status codes (with messages where applicable). Generally, an error will close the connection. When this happens, we set `self.running` to False to being the process of exiting our run loop.
331 | 
332 | The serverContent messages can:
333 | - contain audio or text output from the model
334 | - indicate that the API has detected user speech and `interrupted` the current output
335 | - contain information about code execution
336 | 
337 | ### Audio
338 | 
339 | The PyAudio `open()` calls set up mic audio input and speaker audio output in callback mode, which uses threads internally. We request 256 `frames_per_buffer` in both cases, which is a reasonable balance between latency and robustness to stalling audio input/output. (Even when using callback mode, activity on the Python main thread can impact audio performance.)
340 | 
341 | Here are our audio callback functions.
342 | 
343 | ```python
344 |     def mic_audio_in_callback(self, in_data, frame_count, time_info, status):
345 |         if not self.running:
346 |             return (None, pyaudio.paComplete)
347 |         self.event_loop.create_task(self.send_audio(in_data))
348 |         return (None, pyaudio.paContinue)
349 | 
350 |     def speaker_audio_out_callback(self, in_data, frame_count, time_info, status):
351 |         if not self.running:
352 |             return (bytes(frame_count * CHANNELS * 2), pyaudio.paComplete)
353 |         audio = bytes(self.speaker_audio_buffer[: frame_count * CHANNELS * 2])
354 |         del self.speaker_audio_buffer[: frame_count * CHANNELS * 2]
355 |         audio += b"\0" * (frame_count * CHANNELS * 2 - len(audio))
356 |         return (audio, pyaudio.paContinue)
357 | ```
358 | 
359 | ### Receiving audio
360 | 
361 | When we receive audio from the WebSocket, we base64 decode it and put it in a buffer that the `speaker_audio_out_callback` above will read from on-demand. 
362 | 
363 | ```python
364 |                         elif inline_data := parts[0].get("inlineData"):
365 |                             mime_str = inline_data.get("mimeType")
366 |                             mime_type, sample_rate = re.match(
367 |                                 r"([\w/]+);rate=(\d+)", mime_str
368 |                             ).groups()
369 |                             if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE):
370 |                                 audio = base64.b64decode(inline_data.get("data"))
371 |                                 self.speaker_audio_buffer.extend(audio)
372 |                             else:
373 |                                 print(f"Unsupported mime type or sample rate: {mime_str}")
374 | ```
375 | 
376 | ### Sending audio 
377 | 
378 | When we receive audio from the mic, we call our `send_audio()` function to send it out the WebSocket as soon as possible. We want to avoid issues with trying to use the WebSocket connection from multiple threads. So we create a task on the main thread using `self.event_loop.create_task()` to run this function.
379 | 
380 | Here's `send_audio()`. We're just base64 encoding the raw audio bytes and sending them to Gemini as a `realtimeInput` message.
381 | 
382 | ```python
383 |     async def send_audio(self, raw_audio):
384 |         payload = base64.b64encode(raw_audio).decode("utf-8")
385 |         try:
386 |             msg = json.dumps(
387 |                 {
388 |                     "realtimeInput": {
389 |                         "mediaChunks": [
390 |                             {
391 |                                 "mimeType": f"audio/pcm;rate={MIC_SAMPLE_RATE}",
392 |                                 "data": payload,
393 |                             }
394 |                         ],
395 |                     },
396 |                 }
397 |             )
398 |             await self.ws.send(msg)
399 |         except Exception as e:
400 |             print(f"Exception: {e}")
401 |             self.running = False
402 | ```
403 | 
404 | ### Receiving text
405 | 
406 | When we receive text output from the model, we just print it out.
407 | 
408 | ```python
409 |                          if text := parts[0].get("text"):
410 |                             print(f"  <- {text}")
411 | ```
412 | 
413 | ### Sending text
414 | 
415 | To send text to the model, we create and send a clientContent message. In this program, we always set `turnComplete` to true in this message. If we set `turnComplete` to false, the model will wait to perform inference until a clientContent message with `turnComplete` set to true is received.
416 | 
417 | Note that clientContent messages are used to send text, and realtimeInput messages are used to send audio and video.
418 | 
419 | ```python
420 |     async def send_text(self, text):
421 |         try:
422 |             print(f"  -> {text}")
423 |             await self.ws.send(
424 |                 json.dumps(
425 |                     {
426 |                         "clientContent": {
427 |                             "turns": [
428 |                                 {"parts": [{"text": text}], "role": "user"},
429 |                             ],
430 |                             "turnComplete": True,
431 |                         }
432 |                     }
433 |                 )
434 |             )
435 |         except Exception as e:
436 |             print(f"Exception: {e}")
437 |             self.running = False
438 | ```
439 | 
440 | ### Sending video
441 | 
442 | Sending video is almost exactly like sending audio. Video frames are jpeg byte strings, base64 encoded.
443 | 
444 | ```python
445 |     async def send_video(self, jpg_bytes):
446 |         payload = base64.b64encode(jpg_bytes).decode("utf-8")
447 |         try:
448 |             msg = json.dumps(
449 |                 {
450 |                     "realtimeInput": {
451 |                         "mediaChunks": [
452 |                             {
453 |                                 "mimeType": "image/jpg",
454 |                                 "data": payload,
455 |                             }
456 |                         ],
457 |                     },
458 |                 }
459 |             )
460 |             await self.ws.send(msg)
461 |         except Exception as e:
462 |             print(f"Exception: {e}")
463 |             self.running = False
464 | ```
465 | 
466 | ### Built-in tools: search and code execution
467 | 
468 | The operations of the built-in search and code execution tools are entirely performed on the server end of the API. Information about them is included in serverContent messages in `groundingMetadata` and `executableCode` parts.
469 | 
470 | See documentation about groundingMetadata here:
471 | * https://ai.google.dev/gemini-api/docs/grounding?lang=rest
472 | 
473 | We print out the url of each `groundingChunk` in our `print_evt()` function.
474 | 
475 | ```
476 |     def print_evt(self, evt, response):
477 |         columns, rows = shutil.get_terminal_size()
478 |         maxl = columns - 5
479 |         print(str(evt)[:maxl] + " ...")
480 |         if grounding := evt.get("serverContent", {}).get("groundingMetadata"):
481 |             for chunk in grounding.get("groundingChunks", []):
482 |                 print(f"  <- {chunk.get("web").get("title")}")
483 |         # ...
484 | ```
485 | 
486 | Similarly, see documentation for `executableCode` here:
487 | * https://ai.google.dev/gemini-api/docs/code-execution?lang=rest
488 | 
489 | Note that currently the `executableCode` content parts are not sent over the WebSocket when the output modality is audio.
490 | 
491 | ### Function calling
492 | 
493 | For the API to call user-provided functions, you need to:
494 | 1. Provide function declarations in the `tools[]` list passed to the object as part of the setup message at the beginning of the connection.
495 | 2. Call your function when you receive a toolCall message from the LLM.
496 | 3. Send back a toolResponse message with the data produced by your function.
497 | 
498 | It's worth noting that because the LLM's request for a function call is completely decoupled from your execution of the function, you have wide latitude in how you actually perform function calls.
499 | 
500 | The function declarations you provide to the LLM are descriptions of notional functions that don't actually need to exist. You could, for example, implement a toolCall handler that:
501 |   - calls a remote function using an HTTP request
502 |   - looks information up in a database, using the function name and arguments to create a SQL query
503 |   - mocks a function for testing, returning static data that can be used as part of an test/evals suite
504 | 
505 | The most common setup for user-defined functions, though, is to actually call a locally defined function that does exist!
506 | 
507 | Writing function declarations by hand is a little bit laborious. Here's the canonical `get_current_weather` function declaration that's used in lots of docs and examples.
508 | 
509 | ```
510 | tools = [
511 |     {
512 |         "function_declarations": [
513 |             {
514 |                 "name": "get_current_weather",
515 |                 "description": "Get the current weather",
516 |                 "parameters": {
517 |                     "type": "object",
518 |                     "properties": {
519 |                         "location": {
520 |                             "type": "string",
521 |                             "description": "The city and state, e.g. San Francisco, CA",
522 |                         },
523 |                         "format": {
524 |                             "type": "string",
525 |                             "enum": ["celsius", "fahrenheit"],
526 |                             "description": "The temperature unit to use. Infer this from the users location.",
527 |                         },
528 |                     },
529 |                     "required": ["location", "format"],
530 |                 },
531 |             },
532 |         ]
533 |     }
534 | ]
535 | ```
536 | 
537 | The `google-genai` SDK can automatically generate function declarations from a combination of Python introspection and docstrings. 
538 | 
539 | If you have an installed copy of that SDK, you can read the code for this. It's in venv/lib/python3.12/site-packages/google/genai/types.py :: `from_function()`
540 | 
541 | We've written a wrapper for that code that auto-generates function declarations for all functions declared in a file. That wrapper is [function_helpers.py] and is imported if you specify the `--import-functions` command-line option.
542 | 
543 | Note that the LLM's ability to call your function properly will depend heavily on the quality of the docstring you write for the function.
544 | 
545 | You can test with or use as a starting point, the [function-examples.py] file.
546 | 
547 | ```
548 | python gemini-live.py --import-functions function-examples.py --initial-message "print the word 'hello' to the console."
549 | ```
550 | 
551 | To send function call results back to the API,
552 | - create a toolResponse message.
553 | - include the `id` and the `name` of the function that the API provided in the toolCall message.
554 | - put the function call result in the `response` field of the message.
555 | 
556 | ```python
557 |     async def handle_tool_call(self, tool_call):
558 |         responses = []
559 |         for f in tool_call.get("functionCalls", []):
560 |             print(f"  <- Function call: {f}")
561 |             response = await function_helper.call_function(
562 |                 FUNCTION_IMPORTS_MODULE, f.get("name"), **f.get("args")
563 |             )
564 |             responses.append(
565 |                 {
566 |                     "id": f.get("id"),
567 |                     "name": f.get("name"),
568 |                     "response": response,
569 |                 }
570 |             )
571 |         msg = json.dumps(
572 |             {
573 |                 "toolResponse": {
574 |                     "functionResponses": responses,
575 |                 }
576 |             }
577 |         )
578 |         print(f"  -> {msg}")
579 |         await self.ws.send(msg)
580 | ```
581 | 
582 | ## Other resources
583 | 
584 | * Multimodal Live [API overview](https://ai.google.dev/api/multimodal-live)
585 | * Gemini 2.0 [cookbook](https://github.com/google-gemini/cookbook/tree/main/gemini-2)
586 | * A [starter app](https://github.com/google-gemini/multimodal-live-api-web-console) from the Google Creative Labs team
587 | * Multimodal Live [WebRTC client app starter kit](https://github.com/pipecat-ai/gemini-multimodal-live-demo)
588 | 


--------------------------------------------------------------------------------
/function-examples.py:
--------------------------------------------------------------------------------
 1 | def get_current_weather(
 2 |     location: str,
 3 | ):
 4 |     """Returns the current weather.
 5 | 
 6 |     Args:
 7 |       location: The location to get the weather for.
 8 |     """
 9 |     return {
10 |         "status": "success",
11 |         "response": f"The current weather in {location} is 72 degrees with scattered thunderstorms.",
12 |     }
13 | 
14 | 
15 | def line_printer(line: str):
16 |     """Prints a line to the console.
17 | 
18 |     Args:
19 |       line: The line to print.
20 |     """
21 |     print(f"  \033[1m :: {line} ::\033[0m")
22 |     return {"status": "success"}
23 | 


--------------------------------------------------------------------------------
/function_helper.py:
--------------------------------------------------------------------------------
  1 | import importlib.util
  2 | import inspect
  3 | import types
  4 | 
  5 | from typing import Any, Callable, Coroutine, List, Tuple, Union
  6 | 
  7 | from google import genai
  8 | 
  9 | from_function = genai.types.FunctionDeclaration.from_function
 10 | 
 11 | 
 12 | def create_function_declarations_from_file(
 13 |     filename: str,
 14 | ) -> Tuple[[genai.types.FunctionDeclaration], types.ModuleType]:
 15 |     """
 16 |     Import a Python file and return function declarations automatically generated
 17 |     by the google-genai SDK. Also return the module object.
 18 | 
 19 |     Args:
 20 |         filename (str): Path to the Python file to import
 21 | 
 22 |     Returns:
 23 |         tuple: (list of function objects, module object)
 24 |     """
 25 | 
 26 |     functions, module = _import_functions_from_file(filename)
 27 | 
 28 |     # We need a google-genai client object to use the from_function utility method. (Mostly
 29 |     # because internally the it does slightly different things for Google AI Studio and Google
 30 |     # Vertex AI connections.)
 31 |     client = genai.Client(
 32 |         http_options={
 33 |             "api_version": "v1alpha",
 34 |             "url": "generativelanguage.googleapis.com",
 35 |         }
 36 |     )
 37 | 
 38 |     function_declarations = []
 39 |     for func in functions:
 40 |         function_declarations.append(
 41 |             from_function(client, func).model_dump(exclude_unset=True, exclude_none=True)
 42 |         )
 43 | 
 44 |     return function_declarations, module
 45 | 
 46 | 
 47 | async def call_function(module: Any, function_name: str, **kwargs) -> Any:
 48 |     """
 49 |     Call a function from a module by its name, handling both sync and async
 50 |     functions.
 51 | 
 52 |     Args:
 53 |         module: The module object containing the function
 54 |         function_name: The name of the function to call
 55 |         **kwargs: Keyword arguments to pass to the function
 56 | 
 57 |     Returns:
 58 |         The result of the function call
 59 | 
 60 |     Raises:
 61 |         AttributeError: If the function doesn't exist in the module
 62 |     """
 63 |     if not hasattr(module, function_name):
 64 |         raise AttributeError(f"Function '{function_name}' not found in module")
 65 | 
 66 |     func = getattr(module, function_name)
 67 |     if inspect.iscoroutinefunction(func):
 68 |         return await func(**kwargs)
 69 |     return func(**kwargs)
 70 | 
 71 | 
 72 | def _import_functions_from_file(
 73 |     filename: str,
 74 | ) -> Tuple[List[Union[Callable, Coroutine]], types.ModuleType]:
 75 |     """
 76 |     Dynamically import a Python file and return a list of functions it defines,
 77 |     and the module object.
 78 | 
 79 |     Args:
 80 |         filename (str): Path to the Python file to import
 81 | 
 82 |     Returns:
 83 |         tuple: (list of function objects, module object)
 84 |     """
 85 |     try:
 86 |         spec = importlib.util.spec_from_file_location("dynamic_module", filename)
 87 |         if spec is None:
 88 |             raise ImportError(f"Could not load spec for module: {filename}")
 89 | 
 90 |         module = importlib.util.module_from_spec(spec)
 91 |         if spec.loader is None:
 92 |             raise ImportError(f"Could not load module: {filename}")
 93 | 
 94 |         spec.loader.exec_module(module)
 95 | 
 96 |         functions = []
 97 |         for name, obj in inspect.getmembers(module):
 98 |             # Filter for functions defined in the module
 99 |             if (
100 |                 inspect.isfunction(obj) or inspect.iscoroutinefunction(obj)
101 |             ) and obj.__module__ == module.__name__:
102 |                 functions.append(obj)
103 | 
104 |         return functions, module
105 | 
106 |     except Exception as e:
107 |         raise ImportError(f"Error importing {filename}: {str(e)}")
108 | 


--------------------------------------------------------------------------------
/gemini-live.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import asyncio
  3 | import base64
  4 | import json
  5 | import io
  6 | import os
  7 | import pyaudio
  8 | import re
  9 | import shutil
 10 | import select
 11 | import sys
 12 | import traceback
 13 | import websockets
 14 | 
 15 | 
 16 | MODEL = "models/gemini-2.0-flash-exp"
 17 | HOST = "generativelanguage.googleapis.com"
 18 | 
 19 | MIC_SAMPLE_RATE = 16000
 20 | SPEAKER_SAMPLE_RATE = 24000
 21 | FORMAT = "S16_LE"
 22 | CHANNELS = 1
 23 | 
 24 | #
 25 | # Argument parsing
 26 | #
 27 | 
 28 | SYSTEM_INSTRUCTION_TEXT = ""
 29 | INITIAL_MESSAGE = ""
 30 | INITIAL_MESSAGE_DELAY = 0.0
 31 | VOICE = None
 32 | AUDIO_INPUT = None
 33 | AUDIO_OUTPUT = None
 34 | TEXT_OUTPUT = None
 35 | SEARCH = False
 36 | CODE_EXECUTION = False
 37 | SCREEN_CAPTURE_FPS = 0.0
 38 | 
 39 | function_helper = None
 40 | FUNCTION_IMPORTS_MODULE = None
 41 | FUNCTION_DECLARATIONS = None
 42 | 
 43 | 
 44 | def parse_args():
 45 |     parser = argparse.ArgumentParser(description="Gemini Multimodal Live API client")
 46 |     parser.add_argument("--system-instruction", type=str, help="System instruction text")
 47 |     parser.add_argument(
 48 |         "--initial-message", type=str, help="First 'user' message to send to the model"
 49 |     )
 50 |     parser.add_argument(
 51 |         "--initial-message-delay",
 52 |         type=float,
 53 |         default=0.0,
 54 |         help="Delay in seconds before sending the initial message",
 55 |     )
 56 |     parser.add_argument(
 57 |         "--voice",
 58 |         default="Charon",
 59 |         type=str,
 60 |         help="Voice name. Options are Puck, Charon, Kore, Fenrir, and Aoede ",
 61 |     )
 62 |     parser.add_argument(
 63 |         "--audio-input",
 64 |         default=True,
 65 |         action=argparse.BooleanOptionalAction,
 66 |         help="Enable/disable audio input. Default is enabled.",
 67 |     )
 68 |     parser.add_argument(
 69 |         "--audio-output",
 70 |         default=True,
 71 |         action=argparse.BooleanOptionalAction,
 72 |         help="Enable/disable audio output. Default is enabled.",
 73 |     )
 74 |     parser.add_argument(
 75 |         "--text-output",
 76 |         default=False,
 77 |         action=argparse.BooleanOptionalAction,
 78 |         help="Enable/disable text output. Default is disabled. Audio output and text output cannot be enabled at the same time. Enabling text output will disable audio output.",
 79 |     )
 80 |     parser.add_argument(
 81 |         "--search",
 82 |         action=argparse.BooleanOptionalAction,
 83 |         help="Enable/disable the built-in grounded search tool.",
 84 |     )
 85 |     parser.add_argument(
 86 |         "--code-execution",
 87 |         action=argparse.BooleanOptionalAction,
 88 |         help="Enable/disable the built-in code execution tool.",
 89 |     )
 90 |     parser.add_argument(
 91 |         "--screen-capture-fps",
 92 |         type=float,
 93 |         default=0.0,
 94 |         help="Enable screen capture. Specify a frames-per-second value. For example, 1.0 for one frame per second.",
 95 |     )
 96 |     parser.add_argument(
 97 |         "--import-functions",
 98 |         type=str,
 99 |         help="Import functions from a Python file. Specify a filename.",
100 |     )
101 | 
102 |     args = parser.parse_args()
103 | 
104 |     global \
105 |         SYSTEM_INSTRUCTION_TEXT, \
106 |         INITIAL_MESSAGE, \
107 |         INITIAL_MESSAGE_DELAY, \
108 |         VOICE, \
109 |         AUDIO_INPUT, \
110 |         AUDIO_OUTPUT, \
111 |         TEXT_OUTPUT, \
112 |         SEARCH, \
113 |         CODE_EXECUTION, \
114 |         SCREEN_CAPTURE_FPS, \
115 |         FUNCTION_IMPORTS_MODULE, \
116 |         FUNCTION_DECLARATIONS
117 | 
118 |     SYSTEM_INSTRUCTION_TEXT = args.system_instruction
119 |     INITIAL_MESSAGE = args.initial_message
120 |     INITIAL_MESSAGE_DELAY = args.initial_message_delay
121 |     VOICE = args.voice
122 |     AUDIO_INPUT = args.audio_input
123 |     AUDIO_OUTPUT = args.audio_output
124 |     TEXT_OUTPUT = args.text_output
125 |     SEARCH = args.search
126 |     CODE_EXECUTION = args.code_execution
127 |     SCREEN_CAPTURE_FPS = args.screen_capture_fps
128 | 
129 |     if args.text_output and args.audio_output:
130 |         print(
131 |             "Warning: audio output and text output cannot be enabled at the same time. Disabling audio output."
132 |         )
133 |         AUDIO_OUTPUT = False
134 | 
135 |     if args.screen_capture_fps:
136 |         try:
137 |             global mss, Image
138 |             import mss
139 |             from PIL import Image
140 |         except Exception:
141 |             print("Screen capture requires the mss library. Install with 'pip install mss'")
142 |             quit
143 | 
144 |     if args.import_functions:
145 |         try:
146 |             # Conditional import, because the function declaration helpers depend on having the
147 |             # google-genai library installed. If you don't need to import functions, you don't
148 |             # need to install that dependency.
149 |             global function_helper
150 |             import function_helper
151 | 
152 |             function_declarations, module = function_helper.create_function_declarations_from_file(
153 |                 args.import_functions
154 |             )
155 |             FUNCTION_IMPORTS_MODULE = module
156 |             FUNCTION_DECLARATIONS = function_declarations
157 |         except Exception:
158 |             print("Function import failed")
159 |             print(traceback.print_exc())
160 |             quit()
161 | 
162 | 
163 | #
164 | # Main application class
165 | #
166 | 
167 | 
168 | class AudioStreamer:
169 |     def __init__(self):
170 |         self.running = False
171 |         self.event_loop = None
172 |         self.mic_audio_in = None
173 |         self.speaker_audio_out = None
174 |         self.speaker_audio_buffer = bytearray()
175 |         self.p = pyaudio.PyAudio()
176 | 
177 |     def mic_audio_in_callback(self, in_data, frame_count, time_info, status):
178 |         if not self.running:
179 |             return (None, pyaudio.paComplete)
180 |         self.event_loop.create_task(self.send_audio(in_data))
181 |         return (None, pyaudio.paContinue)
182 | 
183 |     def speaker_audio_out_callback(self, in_data, frame_count, time_info, status):
184 |         if not self.running:
185 |             return (bytes(frame_count * CHANNELS * 2), pyaudio.paComplete)
186 |         audio = bytes(self.speaker_audio_buffer[: frame_count * CHANNELS * 2])
187 |         del self.speaker_audio_buffer[: frame_count * CHANNELS * 2]
188 |         audio += b"\0" * (frame_count * CHANNELS * 2 - len(audio))
189 |         return (audio, pyaudio.paContinue)
190 | 
191 |     async def send_initial_message(self):
192 |         if INITIAL_MESSAGE:
193 |             await asyncio.sleep(INITIAL_MESSAGE_DELAY)
194 |             await self.send_text(INITIAL_MESSAGE)
195 | 
196 |     async def send_text(self, text):
197 |         try:
198 |             print(f"  -> {text}")
199 |             await self.ws.send(
200 |                 json.dumps(
201 |                     {
202 |                         "clientContent": {
203 |                             "turns": [
204 |                                 {"parts": [{"text": text}], "role": "user"},
205 |                             ],
206 |                             "turnComplete": True,
207 |                         }
208 |                     }
209 |                 )
210 |             )
211 |         except Exception as e:
212 |             print(f"Exception: {e}")
213 |             self.running = False
214 | 
215 |     async def send_audio(self, raw_audio):
216 |         payload = base64.b64encode(raw_audio).decode("utf-8")
217 |         try:
218 |             msg = json.dumps(
219 |                 {
220 |                     "realtimeInput": {
221 |                         "mediaChunks": [
222 |                             {
223 |                                 "mimeType": f"audio/pcm;rate={MIC_SAMPLE_RATE}",
224 |                                 "data": payload,
225 |                             }
226 |                         ],
227 |                     },
228 |                 }
229 |             )
230 |             await self.ws.send(msg)
231 |         except Exception as e:
232 |             print(f"Exception: {e}")
233 |             self.running = False
234 | 
235 |     async def send_video(self, jpg_bytes):
236 |         payload = base64.b64encode(jpg_bytes).decode("utf-8")
237 |         try:
238 |             msg = json.dumps(
239 |                 {
240 |                     "realtimeInput": {
241 |                         "mediaChunks": [
242 |                             {
243 |                                 "mimeType": "image/jpg",
244 |                                 "data": payload,
245 |                             }
246 |                         ],
247 |                     },
248 |                 }
249 |             )
250 |             await self.ws.send(msg)
251 |         except Exception as e:
252 |             print(f"Exception: {e}")
253 |             self.running = False
254 | 
255 |     async def handle_tool_call(self, tool_call):
256 |         # print(f"  <- handling tool call {tool_call}")
257 |         responses = []
258 |         for f in tool_call.get("functionCalls", []):
259 |             print(f"  <- Function call: {f}")
260 |             response = await function_helper.call_function(
261 |                 FUNCTION_IMPORTS_MODULE, f.get("name"), **f.get("args")
262 |             )
263 |             responses.append(
264 |                 {
265 |                     "id": f.get("id"),
266 |                     "name": f.get("name"),
267 |                     "response": response,
268 |                 }
269 |             )
270 |         msg = json.dumps(
271 |             {
272 |                 "toolResponse": {
273 |                     "functionResponses": responses,
274 |                 }
275 |             }
276 |         )
277 |         print(f"  -> {msg}")
278 |         await self.ws.send(msg)
279 | 
280 |     async def print_audio_output_buffer_info(self):
281 |         while self.running:
282 |             if self.speaker_audio_buffer:
283 |                 print(
284 |                     f"Current audio buffer size: {len(self.speaker_audio_buffer) / (SPEAKER_SAMPLE_RATE * 2):.2f} seconds"
285 |                 )
286 |             await asyncio.sleep(2)
287 | 
288 |     def print_evt(self, evt, response):
289 |         columns, rows = shutil.get_terminal_size()
290 |         maxl = columns - 5
291 |         print(str(evt)[:maxl] + " ...")
292 |         if grounding := evt.get("serverContent", {}).get("groundingMetadata"):
293 |             for chunk in grounding.get("groundingChunks", []):
294 |                 print(f"  <- {chunk.get("web").get("title")}")
295 |         if parts := evt.get("serverContent", {}).get("modelTurn", {}).get("parts"):
296 |             for part in parts:
297 |                 if part.get("inlineData") or part.get("text"):
298 |                     continue
299 |                 print(f"  <- {part}")
300 | 
301 |     async def stdin_worker(self):
302 |         def timeout_input(timeout: float = 0.1):
303 |             ready, _, _ = select.select([sys.stdin], [], [], timeout)
304 |             if ready:
305 |                 return sys.stdin.readline().rstrip()
306 |             return None
307 | 
308 |         while self.running:
309 |             try:
310 |                 line = await self.event_loop.run_in_executor(None, timeout_input)
311 |                 if line:
312 |                     await self.send_text(line)
313 |             except Exception as e:
314 |                 print(f"Exception: {e}")
315 |                 self.running = False
316 | 
317 |     async def ws_receive_worker(self):
318 |         try:
319 |             async for m in self.ws:
320 |                 if not self.running:
321 |                     break
322 |                 evt = json.loads(m)
323 |                 self.print_evt(evt, m)
324 | 
325 |                 if evt.get("setupComplete", None) is not None:
326 |                     asyncio.create_task(self.send_initial_message())
327 |                     print("Ready: say something to Gemini")
328 |                     if self.mic_audio_in:
329 |                         self.mic_audio_in.start_stream()
330 |                 elif sc := evt.get("serverContent"):
331 |                     if sc.get("interrupted"):
332 |                         print("Interrupted by server")
333 |                         self.speaker_audio_buffer.clear()
334 |                         continue
335 |                     if parts := sc.get("modelTurn", {}).get("parts"):
336 |                         if text := parts[0].get("text"):
337 |                             print(f"  <- {text}")
338 |                         elif inline_data := parts[0].get("inlineData"):
339 |                             mime_str = inline_data.get("mimeType")
340 |                             mime_type, sample_rate = re.match(
341 |                                 r"([\w/]+);rate=(\d+)", mime_str
342 |                             ).groups()
343 |                             if mime_type == "audio/pcm" and sample_rate == str(SPEAKER_SAMPLE_RATE):
344 |                                 audio = base64.b64decode(inline_data.get("data"))
345 |                                 self.speaker_audio_buffer.extend(audio)
346 |                             else:
347 |                                 print(f"Unsupported mime type or sample rate: {mime_str}")
348 |                         if code := parts[0].get("executableCode"):
349 |                             pass
350 |                 elif tool_call := evt.get("toolCall"):
351 |                     await self.handle_tool_call(tool_call)
352 |         except Exception as e:
353 |             print(f"Exception: {e}")
354 |             self.running = False
355 | 
356 |     async def screen_capture_worker(self):
357 |         print(f"Screen capture enabled at {SCREEN_CAPTURE_FPS} frames per second.")
358 |         with mss.mss(with_cursor=True) as sct:
359 |             # Which display to capture. 0 is the composite of all screens. 1 is the primary screen.
360 |             monitor = sct.monitors[1]
361 |             while self.running:
362 |                 try:
363 |                     frame = sct.grab(monitor)
364 |                     # print(frame)
365 |                     buffer = io.BytesIO()
366 |                     Image.frombytes("RGB", frame.size, frame.bgra, "raw", "BGRX").save(
367 |                         buffer, format="JPEG"
368 |                     )
369 |                     await self.send_video(buffer.getvalue())
370 |                 except Exception as e:
371 |                     print(f"Exception: {e}")
372 |                     self.running = False
373 |                     return
374 |                 await asyncio.sleep(1.0 / SCREEN_CAPTURE_FPS)
375 | 
376 |     async def setup_model(self):
377 |         try:
378 |             response_modality = []
379 | 
380 |             # Currently the API will throw an error if both response modalities are requested. This
381 |             # will likely change soon, though.
382 |             if AUDIO_OUTPUT:
383 |                 response_modality.append("AUDIO")
384 |             if TEXT_OUTPUT:
385 |                 response_modality.append("TEXT")
386 | 
387 |             setup = {
388 |                 "setup": {
389 |                     "model": MODEL,
390 |                     "generation_config": {
391 |                         "response_modalities": response_modality,
392 |                         "speech_config": {
393 |                             "voice_config": {"prebuilt_voice_config": {"voice_name": VOICE}},
394 |                         },
395 |                     },
396 |                     "tools": [],
397 |                 },
398 |             }
399 |             if SYSTEM_INSTRUCTION_TEXT:
400 |                 print("System instruction enabled")
401 |                 setup["setup"]["system_instruction"] = {
402 |                     "parts": [
403 |                         {
404 |                             "text": SYSTEM_INSTRUCTION_TEXT,
405 |                         }
406 |                     ]
407 |                 }
408 |             if SEARCH:
409 |                 print("Search enabled")
410 |                 setup["setup"]["tools"].append({"google_search": {}})
411 |             if CODE_EXECUTION:
412 |                 print("Code execution enabled")
413 |                 setup["setup"]["tools"].append({"code_execution": {}})
414 |             if FUNCTION_DECLARATIONS:
415 |                 setup["setup"]["tools"].append({"function_declarations": FUNCTION_DECLARATIONS})
416 |             print("Sending setup", setup)
417 |             await self.ws.send(json.dumps(setup))
418 |         except Exception as e:
419 |             print(f"Exception: {e}")
420 | 
421 |     async def run(self):
422 |         self.event_loop = asyncio.get_event_loop()
423 |         self.running = True
424 | 
425 |         if AUDIO_INPUT:
426 |             self.mic_audio_in = self.p.open(
427 |                 format=pyaudio.paInt16,
428 |                 channels=CHANNELS,
429 |                 rate=MIC_SAMPLE_RATE,
430 |                 input=True,
431 |                 stream_callback=self.mic_audio_in_callback,
432 |                 frames_per_buffer=int(MIC_SAMPLE_RATE / 1000) * 2 * 50,  # 50ms (S16_LE is 2 bytes)
433 |                 start=False,
434 |             )
435 | 
436 |         self.speaker_audio_out = self.p.open(
437 |             format=pyaudio.paInt16,
438 |             channels=1,
439 |             rate=SPEAKER_SAMPLE_RATE,
440 |             output=True,
441 |             frames_per_buffer=256,
442 |             stream_callback=self.speaker_audio_out_callback,
443 |         )
444 | 
445 |         try:
446 |             self.ws = await websockets.connect(
447 |                 uri=f'wss://{HOST}/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={os.getenv("GOOGLE_API_KEY")}'
448 |             )
449 |             print("Connected to Gemini")
450 |         except Exception as e:
451 |             print(f"Exception: {e}")
452 |             return
453 | 
454 |         asyncio.create_task(self.stdin_worker())
455 |         asyncio.create_task(self.ws_receive_worker())
456 |         asyncio.create_task(self.print_audio_output_buffer_info())
457 |         if SCREEN_CAPTURE_FPS:
458 |             asyncio.create_task(self.screen_capture_worker())
459 | 
460 |         try:
461 |             await self.setup_model()
462 |             while self.running:
463 |                 await asyncio.sleep(1)
464 |         except asyncio.CancelledError:
465 |             pass
466 |         except Exception as e:
467 |             print(f"Exception: {e}")
468 |         finally:
469 |             print("Exiting ...")
470 |             self.running = False
471 |             sys.stdin.close()
472 |             self.p.terminate()
473 |             await self.ws.close()
474 | 
475 | 
476 | if __name__ == "__main__":
477 |     parse_args()
478 |     asyncio.run(AudioStreamer().run())
479 | 


--------------------------------------------------------------------------------