├── README.md └── Gemini API quickstart.txt /README.md: -------------------------------------------------------------------------------- 1 | # Complete Gemini API Documentation for Offline LLM 2 | > ⚠️ **Disclaimer** 3 | > This documentation is an enhanced, summarized, and reorganized version of the [official Gemini API documentation](https://ai.google.dev). 4 | > It is shared for educational and reference purposes only. Always consult the official docs for the latest updates. 5 | 6 | 7 | This file contains the complete, single-file documentation for the Google Gemini API. 8 | 9 | ## Purpose 10 | 11 | The content herein is structured as a comprehensive knowledge base, specifically formatted to be ingested by an offline Large Language Model (LLM). By providing this entire document as context, an offline LLM can: 12 | 13 | * Answer questions about the Gemini API's features and functionalities. 14 | * Generate code snippets for various API use cases. 15 | * Explain concepts related to the different Gemini models, rate limits, and supported features. 16 | 17 | ## Content 18 | 19 | This document includes detailed information on: 20 | 21 | * **Quickstart**: Initial setup and first API requests. 22 | * **Gemini Models**: An overview of all available model variants and their specific use cases. 23 | * **Core Features**: In-depth explanations and code examples for: 24 | * Text Generation 25 | * Image Generation & Understanding 26 | * Video & Audio Understanding 27 | * Function Calling 28 | * Structured Output 29 | * and much more. 30 | * **Advanced Topics**: Guides on context caching, batch processing, and safety settings. 31 | 32 | ## Usage 33 | 34 | To use this documentation with an offline LLM, simply load the entire content of this `Gemini API quickstart.txt` file as a single text input or document. The LLM will then be able to use the information contained within to respond to your queries about the Gemini API. 35 | -------------------------------------------------------------------------------- /Gemini API quickstart.txt: -------------------------------------------------------------------------------- 1 | Gemini API quickstart 2 | 3 | This quickstart shows you how to install our libraries and make your first Gemini API request. 4 | 5 | Before you begin 6 | You need a Gemini API key. If you don't already have one, you can get it for free in Google AI Studio. 7 | 8 | Install the Google GenAI SDK 9 | 10 | pip install -q -U google-genai 11 | 12 | 13 | 14 | Make your first request 15 | Here is an example that uses the generateContent method to send a request to the Gemini API using the Gemini 2.5 Flash model. 16 | 17 | If you set your API key as the environment variable GEMINI_API_KEY, it will be picked up automatically by the client when using the Gemini API libraries. Otherwise you will need to pass your API key as an argument when initializing the client. 18 | 19 | Note that all code samples in the Gemini API docs assume that you have set the environment variable GEMINI_API_KEY. 20 | 21 | 22 | 23 | from google import genai 24 | 25 | # The client gets the API key from the environment variable `GEMINI_API_KEY`. 26 | client = genai.Client() 27 | 28 | response = client.models.generate_content( 29 | model="gemini-2.5-flash", contents="Explain how AI works in a few words" 30 | ) 31 | print(response.text) 32 | 33 | 34 | 35 | 36 | "Thinking" is on by default on many of our code samples 37 | Many code samples on this site use the Gemini 2.5 Flash model, which has the "thinking" feature enabled by default to enhance response quality. You should be aware that this may increase response time and token usage. If you prioritize speed or wish to minimize costs, you can disable this feature by setting the thinking budget to zero, as shown in the examples below. For more details, see the thinking guide. 38 | 39 | Note: Thinking is only available on Gemini 2.5 series models and can't be disabled on Gemini 2.5 Pro. 40 | 41 | 42 | 43 | from google import genai 44 | from google.genai import types 45 | 46 | client = genai.Client() 47 | 48 | response = client.models.generate_content( 49 | model="gemini-2.5-flash", 50 | contents="Explain how AI works in a few words", 51 | config=types.GenerateContentConfig( 52 | thinking_config=types.ThinkingConfig(thinking_budget=0) # Disables thinking 53 | ), 54 | ) 55 | print(response.text) 56 | 57 | 58 | 59 | 60 | 61 | Providing API key explicitly 62 | In some cases, you may want to explicitly provide an API key. For example: 63 | 64 | You're doing a simple API call and prefer hard coding the API key. 65 | You want explicit control without having to rely on automatic discovery of environment variables by the Gemini API libraries 66 | You're using an environment where environment variables are not supported (e.g web) or you are making REST calls. 67 | Below are examples for how you can provide an API key explicitly: 68 | 69 | 70 | 71 | from google import genai 72 | 73 | client = genai.Client(api_key="YOUR_API_KEY") 74 | 75 | response = client.models.generate_content( 76 | model="gemini-2.5-flash", contents="Explain how AI works in a few words" 77 | ) 78 | print(response.text) 79 | 80 | 81 | 82 | Gemini models 83 | 84 | 85 | 86 | Model variants 87 | The Gemini API offers different models that are optimized for specific use cases. Here's a brief overview of Gemini variants that are available: 88 | 89 | Model variant Input(s) Output Optimized for 90 | Gemini 2.5 Pro 91 | gemini-2.5-pro Audio, images, videos, text, and PDF Text Enhanced thinking and reasoning, multimodal understanding, advanced coding, and more 92 | Gemini 2.5 Flash 93 | gemini-2.5-flash Audio, images, videos, and text Text Adaptive thinking, cost efficiency 94 | Gemini 2.5 Flash-Lite Preview 95 | gemini-2.5-flash-lite-preview-06-17 Text, image, video, audio Text Most cost-efficient model supporting high throughput 96 | Gemini 2.5 Flash Native Audio 97 | gemini-2.5-flash-preview-native-audio-dialog & 98 | gemini-2.5-flash-exp-native-audio-thinking-dialog Audio, videos, and text Text and audio, interleaved High quality, natural conversational audio outputs, with or without thinking 99 | Gemini 2.5 Flash Preview TTS 100 | gemini-2.5-flash-preview-tts Text Audio Low latency, controllable, single- and multi-speaker text-to-speech audio generation 101 | Gemini 2.5 Pro Preview TTS 102 | gemini-2.5-pro-preview-tts Text Audio Low latency, controllable, single- and multi-speaker text-to-speech audio generation 103 | Gemini 2.0 Flash 104 | gemini-2.0-flash Audio, images, videos, and text Text Next generation features, speed, and realtime streaming. 105 | Gemini 2.0 Flash Preview Image Generation 106 | gemini-2.0-flash-preview-image-generation Audio, images, videos, and text Text, images Conversational image generation and editing 107 | Gemini 2.0 Flash-Lite 108 | gemini-2.0-flash-lite Audio, images, videos, and text Text Cost efficiency and low latency 109 | Gemini 1.5 Flash 110 | gemini-1.5-flash Audio, images, videos, and text Text Fast and versatile performance across a diverse variety of tasks 111 | Deprecated 112 | Gemini 1.5 Flash-8B 113 | gemini-1.5-flash-8b Audio, images, videos, and text Text High volume and lower intelligence tasks 114 | Deprecated 115 | Gemini 1.5 Pro 116 | gemini-1.5-pro Audio, images, videos, and text Text Complex reasoning tasks requiring more intelligence 117 | Deprecated 118 | Gemini Embedding 119 | gemini-embedding-001 Text Text embeddings Measuring the relatedness of text strings 120 | Imagen 4 121 | imagen-4.0-generate-preview-06-06 122 | imagen-4.0-ultra-generate-preview-06-06 Text Images Our most up-to-date image generation model 123 | Imagen 3 124 | imagen-3.0-generate-002 Text Images High quality image generation model 125 | Veo 2 126 | veo-2.0-generate-001 Text, images Video High quality video generation 127 | Gemini 2.5 Flash Live 128 | gemini-live-2.5-flash-preview Audio, video, and text Text, audio Low-latency bidirectional voice and video interactions 129 | Gemini 2.0 Flash Live 130 | gemini-2.0-flash-live-001 Audio, video, and text Text, audio Low-latency bidirectional voice and video interactions 131 | 132 | 133 | 134 | 135 | 136 | Rate limits 137 | 138 | Rate limits regulate the number of requests you can make to the Gemini API within a given timeframe. These limits help maintain fair usage, protect against abuse, and help maintain system performance for all users. 139 | 140 | How rate limits work 141 | Rate limits are usually measured across three dimensions: 142 | 143 | Requests per minute (RPM) 144 | Requests per day (RPD) 145 | Tokens per minute (input) (TPM) 146 | Your usage is evaluated against each limit, and exceeding any of them will trigger a rate limit error. For example, if your RPM limit is 20, making 21 requests within a minute will result in an error, even if you haven't exceeded your TPM or other limits. 147 | 148 | Rate limits are applied per project, not per API key. 149 | 150 | Limits vary depending on the specific model being used, and some limits only apply to specific models. For example, Images per minute, or IPM, is only calculated for models capable of generating images (Imagen 3), but is conceptually similar to TPM. Other models might have a token per day limit (TPD). 151 | 152 | Rate limits are more restricted for experimental and preview models. 153 | 154 | Usage tiers 155 | Rate limits are tied to the project's usage tier. As your API usage and spending increase, you'll have an option to upgrade to a higher tier with increased rate limits. 156 | 157 | 158 | Free Tier 159 | Tier 1 160 | Tier 2 161 | Tier 3 162 | Model RPM TPM RPD 163 | Text-out models 164 | Gemini 2.5 Pro 5 250,000 100 165 | Gemini 2.5 Flash 10 250,000 250 166 | Gemini 2.5 Flash-Lite Preview 06-17 15 250,000 1,000 167 | Gemini 2.0 Flash 15 1,000,000 200 168 | Gemini 2.0 Flash-Lite 30 1,000,000 200 169 | Live API 170 | Gemini 2.5 Flash Live 3 sessions 1,000,000 171 | Gemini 2.5 Flash Preview Native Audio Dialog 1 session 25,000 5 172 | Gemini 2.5 Flash Experimental Native Audio Thinking Dialog 1 session 10,000 5 173 | Gemini 2.0 Flash Live 3 sessions 1,000,000 174 | Multi-modal generation models 175 | Gemini 2.5 Flash Preview TTS 3 10,000 15 176 | Gemini 2.5 Pro Preview TTS 177 | Gemini 2.0 Flash Preview Image Generation 10 200,000 100 178 | Imagen 3 179 | Veo 2 180 | Other models 181 | Gemma 3 & 3n 30 15,000 14,400 182 | Gemini Embedding 100 30,000 1,000 183 | Deprecated models 184 | Gemini 1.5 Flash (Deprecated) 15 250,000 50 185 | Gemini 1.5 Flash-8B (Deprecated) 15 250,000 50 186 | Gemini 1.5 Pro (Deprecated) 187 | 188 | 189 | 190 | 191 | 192 | Batch Mode rate limits 193 | Batch Mode requests are subject to their own rate limits, separate from the non-batch mode API calls. 194 | 195 | Concurrent batch requests: 100 196 | Input file size limit: 2GB 197 | File storage limit: 20GB 198 | Enqueued tokens per model: The following table outlines the maximum number of tokens that can be enqueued for batch processing across all your active batch jobs for a given model. 199 | How to upgrade to the next tier 200 | The Gemini API uses Cloud Billing for all billing services. To transition from the Free tier to a paid tier, you must first enable Cloud Billing for your Google Cloud project. 201 | 202 | Once your project meets the specified criteria, it becomes eligible for an upgrade to the next tier. To request an upgrade, follow these steps: 203 | 204 | Navigate to the API keys page in AI Studio. 205 | Locate the project you want to upgrade and click "Upgrade". The "Upgrade" option will only show up for projects that meet next tier qualifications. 206 | After a quick validation, the project will be upgraded to the next tier. 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | Text generation 217 | 218 | The Gemini API can generate text output from various inputs, including text, images, video, and audio, leveraging Gemini models. 219 | 220 | Here's a basic example that takes a single text input: 221 | 222 | from google import genai 223 | 224 | client = genai.Client() 225 | 226 | response = client.models.generate_content( 227 | model="gemini-2.5-flash", 228 | contents="How does AI work?" 229 | ) 230 | print(response.text) 231 | 232 | 233 | 234 | 235 | Thinking with Gemini 2.5 236 | 2.5 Flash and Pro models have "thinking" enabled by default to enhance quality, which may take longer to run and increase token usage. 237 | 238 | When using 2.5 Flash, you can disable thinking by setting the thinking budget to zero 239 | 240 | 241 | from google import genai 242 | from google.genai import types 243 | 244 | client = genai.Client() 245 | 246 | response = client.models.generate_content( 247 | model="gemini-2.5-flash", 248 | contents="How does AI work?", 249 | config=types.GenerateContentConfig( 250 | thinking_config=types.ThinkingConfig(thinking_budget=0) # Disables thinking 251 | ), 252 | ) 253 | print(response.text) 254 | 255 | 256 | 257 | 258 | 259 | System instructions and other configurations 260 | You can guide the behavior of Gemini models with system instructions. To do so, pass a GenerateContentConfig object. 261 | 262 | 263 | from google import genai 264 | from google.genai import types 265 | 266 | client = genai.Client() 267 | 268 | response = client.models.generate_content( 269 | model="gemini-2.5-flash", 270 | config=types.GenerateContentConfig( 271 | system_instruction="You are a cat. Your name is Neko."), 272 | contents="Hello there" 273 | ) 274 | 275 | print(response.text) 276 | 277 | 278 | The GenerateContentConfig object also lets you override default generation parameters, such as temperature. 279 | 280 | 281 | from google import genai 282 | from google.genai import types 283 | 284 | client = genai.Client() 285 | 286 | response = client.models.generate_content( 287 | model="gemini-2.5-flash", 288 | contents=["Explain how AI works"], 289 | config=types.GenerateContentConfig( 290 | temperature=0.1 291 | ) 292 | ) 293 | print(response.text) 294 | 295 | 296 | 297 | Multimodal inputs 298 | The Gemini API supports multimodal inputs, allowing you to combine text with media files. The following example demonstrates providing an image: 299 | 300 | from PIL import Image 301 | from google import genai 302 | 303 | client = genai.Client() 304 | 305 | image = Image.open("/path/to/organ.png") 306 | response = client.models.generate_content( 307 | model="gemini-2.5-flash", 308 | contents=[image, "Tell me about this instrument"] 309 | ) 310 | print(response.text) 311 | 312 | 313 | 314 | Streaming responses 315 | By default, the model returns a response only after the entire generation process is complete. 316 | 317 | For more fluid interactions, use streaming to receive GenerateContentResponse instances incrementally as they're generated. 318 | 319 | from google import genai 320 | 321 | client = genai.Client() 322 | 323 | response = client.models.generate_content_stream( 324 | model="gemini-2.5-flash", 325 | contents=["Explain how AI works"] 326 | ) 327 | for chunk in response: 328 | print(chunk.text, end="") 329 | 330 | 331 | 332 | Multi-turn conversations (Chat) 333 | Our SDKs provide functionality to collect multiple rounds of prompts and responses into a chat, giving you an easy way to keep track of the conversation history. 334 | 335 | Note: Chat functionality is only implemented as part of the SDKs. Behind the scenes, it still uses the generateContent API. For multi-turn conversations, the full conversation history is sent to the model with each follow-up turn. 336 | 337 | 338 | from google import genai 339 | 340 | client = genai.Client() 341 | chat = client.chats.create(model="gemini-2.5-flash") 342 | 343 | response = chat.send_message("I have 2 dogs in my house.") 344 | print(response.text) 345 | 346 | response = chat.send_message("How many paws are in my house?") 347 | print(response.text) 348 | 349 | for message in chat.get_history(): 350 | print(f'role - {message.role}',end=": ") 351 | print(message.parts[0].text) 352 | 353 | 354 | 355 | Streaming can also be used for multi-turn conversations. 356 | 357 | 358 | from google import genai 359 | 360 | client = genai.Client() 361 | chat = client.chats.create(model="gemini-2.5-flash") 362 | 363 | response = chat.send_message_stream("I have 2 dogs in my house.") 364 | for chunk in response: 365 | print(chunk.text, end="") 366 | 367 | response = chat.send_message_stream("How many paws are in my house?") 368 | for chunk in response: 369 | print(chunk.text, end="") 370 | 371 | for message in chat.get_history(): 372 | print(f'role - {message.role}', end=": ") 373 | print(message.parts[0].text) 374 | 375 | 376 | 377 | 378 | 379 | Image generation 380 | 381 | You can generate images using the Gemini API with either Gemini's built-in multimodal capabilities or Imagen, Google's specialized image generation models. For most use cases, start with Gemini. Choose Imagen for specialized tasks where image quality is critical. See Choosing the right model section for more guidance. 382 | 383 | All generated images include a SynthID watermark. 384 | 385 | Before you begin 386 | Ensure you use a supported model and version for image generation: 387 | 388 | For Gemini, use Gemini 2.0 Flash Preview Image Generation. 389 | 390 | For Imagen, use one of the Imagen models (Imagen 3, Imagen 4 or Imagen 4 Ultra). 391 | 392 | Note that those models are only available on the Paid tier. 393 | You can access both Gemini and Imagen models using the same libraries. 394 | 395 | Note: Image generation may not be available in all regions and countries, review our Models page for more information. 396 | Generate images using Gemini 397 | Gemini can generate and process images conversationally. You can prompt Gemini with text, images, or a combination of both to achieve various image-related tasks, such as image generation and editing. 398 | 399 | You must include responseModalities: ["TEXT", "IMAGE"] in your configuration. Image-only output is not supported with these models. 400 | 401 | Image generation (text-to-image) 402 | The following code demonstrates how to generate an image based on a descriptive prompt: 403 | 404 | 405 | from google import genai 406 | from google.genai import types 407 | from PIL import Image 408 | from io import BytesIO 409 | import base64 410 | 411 | client = genai.Client() 412 | 413 | contents = ('Hi, can you create a 3d rendered image of a pig ' 414 | 'with wings and a top hat flying over a happy ' 415 | 'futuristic scifi city with lots of greenery?') 416 | 417 | response = client.models.generate_content( 418 | model="gemini-2.0-flash-preview-image-generation", 419 | contents=contents, 420 | config=types.GenerateContentConfig( 421 | response_modalities=['TEXT', 'IMAGE'] 422 | ) 423 | ) 424 | 425 | for part in response.candidates[0].content.parts: 426 | if part.text is not None: 427 | print(part.text) 428 | elif part.inline_data is not None: 429 | image = Image.open(BytesIO((part.inline_data.data))) 430 | image.save('gemini-native-image.png') 431 | image.show() 432 | 433 | 434 | 435 | 436 | Speech generation (text-to-speech) 437 | 438 | The Gemini API can transform text input into single speaker or multi-speaker audio using native text-to-speech (TTS) generation capabilities. Text-to-speech (TTS) generation is controllable, meaning you can use natural language to structure interactions and guide the style, accent, pace, and tone of the audio. 439 | 440 | The TTS capability differs from speech generation provided through the Live API, which is designed for interactive, unstructured audio, and multimodal inputs and outputs. While the Live API excels in dynamic conversational contexts, TTS through the Gemini API is tailored for scenarios that require exact text recitation with fine-grained control over style and sound, such as podcast or audiobook generation. 441 | 442 | This guide shows you how to generate single-speaker and multi-speaker audio from text. 443 | 444 | Preview: Native text-to-speech (TTS) is in Preview. 445 | Before you begin 446 | Ensure you use a Gemini 2.5 model variant with native text-to-speech (TTS) capabilities, as listed in the Supported models section. For optimal results, consider which model best fits your specific use case. 447 | 448 | You may find it useful to test the Gemini 2.5 TTS models in AI Studio before you start building. 449 | 450 | Note: TTS models accept text-only inputs and produce audio-only outputs. For a complete list of restrictions specific to TTS models, review the Limitations section. 451 | Single-speaker text-to-speech 452 | To convert text to single-speaker audio, set the response modality to "audio", and pass a SpeechConfig object with VoiceConfig set. You'll need to choose a voice name from the prebuilt output voices. 453 | 454 | This example saves the output audio from the model in a wave file: 455 | 456 | from google import genai 457 | from google.genai import types 458 | import wave 459 | 460 | # Set up the wave file to save the output: 461 | def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2): 462 | with wave.open(filename, "wb") as wf: 463 | wf.setnchannels(channels) 464 | wf.setsampwidth(sample_width) 465 | wf.setframerate(rate) 466 | wf.writeframes(pcm) 467 | 468 | client = genai.Client() 469 | 470 | response = client.models.generate_content( 471 | model="gemini-2.5-flash-preview-tts", 472 | contents="Say cheerfully: Have a wonderful day!", 473 | config=types.GenerateContentConfig( 474 | response_modalities=["AUDIO"], 475 | speech_config=types.SpeechConfig( 476 | voice_config=types.VoiceConfig( 477 | prebuilt_voice_config=types.PrebuiltVoiceConfig( 478 | voice_name='Kore', 479 | ) 480 | ) 481 | ), 482 | ) 483 | ) 484 | 485 | data = response.candidates[0].content.parts[0].inline_data.data 486 | 487 | file_name='out.wav' 488 | wave_file(file_name, data) # Saves the file to current directory 489 | 490 | 491 | 492 | Multi-speaker text-to-speech 493 | For multi-speaker audio, you'll need a MultiSpeakerVoiceConfig object with each speaker (up to 2) configured as a SpeakerVoiceConfig. You'll need to define each speaker with the same names used in the prompt: 494 | 495 | 496 | from google import genai 497 | from google.genai import types 498 | import wave 499 | 500 | # Set up the wave file to save the output: 501 | def wave_file(filename, pcm, channels=1, rate=24000, sample_width=2): 502 | with wave.open(filename, "wb") as wf: 503 | wf.setnchannels(channels) 504 | wf.setsampwidth(sample_width) 505 | wf.setframerate(rate) 506 | wf.writeframes(pcm) 507 | 508 | client = genai.Client() 509 | 510 | prompt = """TTS the following conversation between Joe and Jane: 511 | Joe: How's it going today Jane? 512 | Jane: Not too bad, how about you?""" 513 | 514 | response = client.models.generate_content( 515 | model="gemini-2.5-flash-preview-tts", 516 | contents=prompt, 517 | config=types.GenerateContentConfig( 518 | response_modalities=["AUDIO"], 519 | speech_config=types.SpeechConfig( 520 | multi_speaker_voice_config=types.MultiSpeakerVoiceConfig( 521 | speaker_voice_configs=[ 522 | types.SpeakerVoiceConfig( 523 | speaker='Joe', 524 | voice_config=types.VoiceConfig( 525 | prebuilt_voice_config=types.PrebuiltVoiceConfig( 526 | voice_name='Kore', 527 | ) 528 | ) 529 | ), 530 | types.SpeakerVoiceConfig( 531 | speaker='Jane', 532 | voice_config=types.VoiceConfig( 533 | prebuilt_voice_config=types.PrebuiltVoiceConfig( 534 | voice_name='Puck', 535 | ) 536 | ) 537 | ), 538 | ] 539 | ) 540 | ) 541 | ) 542 | ) 543 | 544 | data = response.candidates[0].content.parts[0].inline_data.data 545 | 546 | file_name='out.wav' 547 | wave_file(file_name, data) # Saves the file to current directory 548 | 549 | 550 | 551 | 552 | Controlling speech style with prompts 553 | You can control style, tone, accent, and pace using natural language prompts for both single- and multi-speaker TTS. For example, in a single-speaker prompt, you can say: 554 | 555 | 556 | Say in an spooky whisper: 557 | "By the pricking of my thumbs... 558 | Something wicked this way comes" 559 | In a multi-speaker prompt, provide the model with each speaker's name and corresponding transcript. You can also provide guidance for each speaker individually: 560 | 561 | 562 | Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy: 563 | 564 | Speaker1: So... what's on the agenda today? 565 | Speaker2: You're never going to guess! 566 | Try using a voice option that corresponds to the style or emotion you want to convey, to emphasize it even more. In the previous prompt, for example, Enceladus's breathiness might emphasize "tired" and "bored", while Puck's upbeat tone could complement "excited" and "happy". 567 | 568 | Generating a prompt to convert to audio 569 | The TTS models only output audio, but you can use other models to generate a transcript first, then pass that transcript to the TTS model to read aloud. 570 | 571 | 572 | from google import genai 573 | from google.genai import types 574 | 575 | client = genai.Client() 576 | 577 | transcript = client.models.generate_content( 578 | model="gemini-2.0-flash", 579 | contents="""Generate a short transcript around 100 words that reads 580 | like it was clipped from a podcast by excited herpetologists. 581 | The hosts names are Dr. Anya and Liam.""").text 582 | 583 | response = client.models.generate_content( 584 | model="gemini-2.5-flash-preview-tts", 585 | contents=transcript, 586 | config=types.GenerateContentConfig( 587 | response_modalities=["AUDIO"], 588 | speech_config=types.SpeechConfig( 589 | multi_speaker_voice_config=types.MultiSpeakerVoiceConfig( 590 | speaker_voice_configs=[ 591 | types.SpeakerVoiceConfig( 592 | speaker='Dr. Anya', 593 | voice_config=types.VoiceConfig( 594 | prebuilt_voice_config=types.PrebuiltVoiceConfig( 595 | voice_name='Kore', 596 | ) 597 | ) 598 | ), 599 | types.SpeakerVoiceConfig( 600 | speaker='Liam', 601 | voice_config=types.VoiceConfig( 602 | prebuilt_voice_config=types.PrebuiltVoiceConfig( 603 | voice_name='Puck', 604 | ) 605 | ) 606 | ), 607 | ] 608 | ) 609 | ) 610 | ) 611 | ) 612 | 613 | # ...Code to stream or save the output 614 | 615 | 616 | Voice options 617 | TTS models support the following 30 voice options in the voice_name field: 618 | 619 | Zephyr -- Bright Puck -- Upbeat Charon -- Informative 620 | Kore -- Firm Fenrir -- Excitable Leda -- Youthful 621 | Orus -- Firm Aoede -- Breezy Callirrhoe -- Easy-going 622 | Autonoe -- Bright Enceladus -- Breathy Iapetus -- Clear 623 | Umbriel -- Easy-going Algieba -- Smooth Despina -- Smooth 624 | Erinome -- Clear Algenib -- Gravelly Rasalgethi -- Informative 625 | Laomedeia -- Upbeat Achernar -- Soft Alnilam -- Firm 626 | Schedar -- Even Gacrux -- Mature Pulcherrima -- Forward 627 | Achird -- Friendly Zubenelgenubi -- Casual Vindemiatrix -- Gentle 628 | Sadachbia -- Lively Sadaltager -- Knowledgeable Sulafat -- Warm 629 | You can hear all the voice options in AI Studio. 630 | 631 | Supported languages 632 | The TTS models detect the input language automatically. They support the following 24 languages: 633 | 634 | Language BCP-47 Code Language BCP-47 Code 635 | Arabic (Egyptian) ar-EG German (Germany) de-DE 636 | English (US) en-US Spanish (US) es-US 637 | French (France) fr-FR Hindi (India) hi-IN 638 | Indonesian (Indonesia) id-ID Italian (Italy) it-IT 639 | Japanese (Japan) ja-JP Korean (Korea) ko-KR 640 | Portuguese (Brazil) pt-BR Russian (Russia) ru-RU 641 | Dutch (Netherlands) nl-NL Polish (Poland) pl-PL 642 | Thai (Thailand) th-TH Turkish (Turkey) tr-TR 643 | Vietnamese (Vietnam) vi-VN Romanian (Romania) ro-RO 644 | Ukrainian (Ukraine) uk-UA Bengali (Bangladesh) bn-BD 645 | English (India) en-IN & hi-IN bundle Marathi (India) mr-IN 646 | Tamil (India) ta-IN Telugu (India) te-IN 647 | Supported models 648 | Model Single speaker Multispeaker 649 | Gemini 2.5 Flash Preview TTS✔️ ✔️ 650 | Gemini 2.5 Pro Preview TTS ✔️ ✔️ 651 | Limitations 652 | TTS models can only receive text inputs and generate audio outputs. 653 | A TTS session has a context window limit of 32k tokens. 654 | Review Languages section for language support. 655 | 656 | 657 | 658 | 659 | Getting started with long context 660 | Earlier versions of generative models were only able to process 8,000 tokens at a time. Newer models pushed this further by accepting 32,000 or even 128,000 tokens. Gemini is the first model capable of accepting 1 million tokens. 661 | 662 | In practice, 1 million tokens would look like: 663 | 664 | 50,000 lines of code (with the standard 80 characters per line) 665 | All the text messages you have sent in the last 5 years 666 | 8 average length English novels 667 | Transcripts of over 200 average length podcast episodes 668 | 669 | 670 | Structured output 671 | 672 | You can configure Gemini for structured output instead of unstructured text, allowing precise extraction and standardization of information for further processing. For example, you can use structured output to extract information from resumes, standardize them to build a structured database. 673 | 674 | Gemini can generate either JSON or enum values as structured output. 675 | 676 | Generating JSON 677 | There are two ways to generate JSON using the Gemini API: 678 | 679 | Configure a schema on the model 680 | Provide a schema in a text prompt 681 | Configuring a schema on the model is the recommended way to generate JSON, because it constrains the model to output JSON. 682 | 683 | Configuring a schema (recommended) 684 | To constrain the model to generate JSON, configure a responseSchema. The model will then respond to any prompt with JSON-formatted output. 685 | 686 | 687 | from google import genai 688 | from pydantic import BaseModel 689 | 690 | class Recipe(BaseModel): 691 | recipe_name: str 692 | ingredients: list[str] 693 | 694 | client = genai.Client() 695 | response = client.models.generate_content( 696 | model="gemini-2.5-flash", 697 | contents="List a few popular cookie recipes, and include the amounts of ingredients.", 698 | config={ 699 | "response_mime_type": "application/json", 700 | "response_schema": list[Recipe], 701 | }, 702 | ) 703 | # Use the response as a JSON string. 704 | print(response.text) 705 | 706 | # Use instantiated objects. 707 | my_recipes: list[Recipe] = response.parsed 708 | 709 | The output might look like this: 710 | 711 | [ 712 | { 713 | "recipeName": "Chocolate Chip Cookies", 714 | "ingredients": [ 715 | "1 cup (2 sticks) unsalted butter, softened", 716 | "3/4 cup granulated sugar", 717 | "3/4 cup packed brown sugar", 718 | "1 teaspoon vanilla extract", 719 | "2 large eggs", 720 | "2 1/4 cups all-purpose flour", 721 | "1 teaspoon baking soda", 722 | "1 teaspoon salt", 723 | "2 cups chocolate chips" 724 | ] 725 | }, 726 | ... 727 | ] 728 | 729 | 730 | 731 | 732 | Providing a schema in a text prompt 733 | Instead of configuring a schema, you can supply a schema as natural language or pseudo-code in a text prompt. This method is not recommended, because it might produce lower quality output, and because the model is not constrained to follow the schema. 734 | 735 | Warning: Don't provide a schema in a text prompt if you're configuring a responseSchema. This can produce unexpected or low quality results. 736 | Here's a generic example of a schema provided in a text prompt: 737 | 738 | 739 | List a few popular cookie recipes, and include the amounts of ingredients. 740 | 741 | Produce JSON matching this specification: 742 | 743 | Recipe = { "recipeName": string, "ingredients": array } 744 | Return: array 745 | Since the model gets the schema from text in the prompt, you might have some flexibility in how you represent the schema. But when you supply a schema inline like this, the model is not actually constrained to return JSON. For a more deterministic, higher quality response, configure a schema on the model, and don't duplicate the schema in the text prompt. 746 | 747 | Generating enum values 748 | In some cases you might want the model to choose a single option from a list of options. To implement this behavior, you can pass an enum in your schema. You can use an enum option anywhere you could use a string in the responseSchema, because an enum is an array of strings. Like a JSON schema, an enum lets you constrain model output to meet the requirements of your application. 749 | 750 | For example, assume that you're developing an application to classify musical instruments into one of five categories: "Percussion", "String", "Woodwind", "Brass", or ""Keyboard"". You could create an enum to help with this task. 751 | 752 | In the following example, you pass an enum as the responseSchema, constraining the model to choose the most appropriate option. 753 | 754 | from google import genai 755 | import enum 756 | 757 | class Instrument(enum.Enum): 758 | PERCUSSION = "Percussion" 759 | STRING = "String" 760 | WOODWIND = "Woodwind" 761 | BRASS = "Brass" 762 | KEYBOARD = "Keyboard" 763 | 764 | client = genai.Client() 765 | response = client.models.generate_content( 766 | model='gemini-2.5-flash', 767 | contents='What type of instrument is an oboe?', 768 | config={ 769 | 'response_mime_type': 'text/x.enum', 770 | 'response_schema': Instrument, 771 | }, 772 | ) 773 | 774 | print(response.text) 775 | # Woodwind 776 | 777 | 778 | And you can also pass the schema as JSON: 779 | 780 | from google import genai 781 | 782 | client = genai.Client() 783 | response = client.models.generate_content( 784 | model='gemini-2.5-flash', 785 | contents='What type of instrument is an oboe?', 786 | config={ 787 | 'response_mime_type': 'text/x.enum', 788 | 'response_schema': { 789 | "type": "STRING", 790 | "enum": ["Percussion", "String", "Woodwind", "Brass", "Keyboard"], 791 | }, 792 | }, 793 | ) 794 | 795 | print(response.text) 796 | # Woodwind 797 | 798 | 799 | 800 | 801 | Gemini thinking 802 | 803 | The Gemini 2.5 series models use an internal "thinking process" that significantly improves their reasoning and multi-step planning abilities, making them highly effective for complex tasks such as coding, advanced mathematics, and data analysis. 804 | 805 | This guide shows you how to work with Gemini's thinking capabilities using the Gemini API. 806 | 807 | Generating content with thinking 808 | Initiating a request with a thinking model is similar to any other content generation request. The key difference lies in specifying one of the models with thinking support in the model field, as demonstrated in the following text generation example: 809 | 810 | from google import genai 811 | 812 | client = genai.Client() 813 | prompt = "Explain the concept of Occam's Razor and provide a simple, everyday example." 814 | response = client.models.generate_content( 815 | model="gemini-2.5-pro", 816 | contents=prompt 817 | ) 818 | 819 | print(response.text) 820 | 821 | 822 | Thinking budgets 823 | The thinkingBudget parameter guides the model on the number of thinking tokens to use when generating a response. A higher token count generally allows for more detailed reasoning, which can be beneficial for tackling more complex tasks. If latency is more important, use a lower budget or disable thinking by setting thinkingBudget to 0. Setting the thinkingBudget to -1 turns on dynamic thinking, meaning the model will adjust the budget based on the complexity of the request. 824 | 825 | The thinkingBudget is only supported in Gemini 2.5 Flash, 2.5 Pro, and 2.5 Flash-Lite. Depending on the prompt, the model might overflow or underflow the token budget. 826 | 827 | The following are thinkingBudget configuration details for each model type. 828 | 829 | 830 | Model Default setting 831 | (Thinking budget is not set) Range Disable thinking Turn on dynamic thinking 832 | 2.5 Pro Dynamic thinking: Model decides when and how much to think 128 to 32768 N/A: Cannot disable thinking thinkingBudget = -1 833 | 2.5 Flash Dynamic thinking: Model decides when and how much to think 0 to 24576 thinkingBudget = 0 thinkingBudget = -1 834 | 2.5 Flash Lite Model does not think 512 to 24576 thinkingBudget = 0 thinkingBudget = -1 835 | 836 | 837 | 838 | 839 | from google import genai 840 | from google.genai import types 841 | 842 | client = genai.Client() 843 | 844 | response = client.models.generate_content( 845 | model="gemini-2.5-pro", 846 | contents="Provide a list of 3 famous physicists and their key contributions", 847 | config=types.GenerateContentConfig( 848 | thinking_config=types.ThinkingConfig(thinking_budget=1024) 849 | # Turn off thinking: 850 | # thinking_config=types.ThinkingConfig(thinking_budget=0) 851 | # Turn on dynamic thinking: 852 | # thinking_config=types.ThinkingConfig(thinking_budget=-1) 853 | ), 854 | ) 855 | 856 | print(response.text) 857 | 858 | 859 | 860 | Thought summaries 861 | Thought summaries are synthesized versions of the model's raw thoughts and offer insights into the model's internal reasoning process. Note that thinking budgets apply to the model's raw thoughts and not to thought summaries. 862 | 863 | You can enable thought summaries by setting includeThoughts to true in your request configuration. You can then access the summary by iterating through the response parameter's parts, and checking the thought boolean. 864 | 865 | Here's an example demonstrating how to enable and retrieve thought summaries without streaming, which returns a single, final thought summary with the response: 866 | 867 | 868 | from google import genai 869 | from google.genai import types 870 | 871 | client = genai.Client() 872 | prompt = "What is the sum of the first 50 prime numbers?" 873 | response = client.models.generate_content( 874 | model="gemini-2.5-pro", 875 | contents=prompt, 876 | config=types.GenerateContentConfig( 877 | thinking_config=types.ThinkingConfig( 878 | include_thoughts=True 879 | ) 880 | ) 881 | ) 882 | 883 | for part in response.candidates[0].content.parts: 884 | if not part.text: 885 | continue 886 | if part.thought: 887 | print("Thought summary:") 888 | print(part.text) 889 | print() 890 | else: 891 | print("Answer:") 892 | print(part.text) 893 | print() 894 | 895 | 896 | 897 | 898 | Function calling with the Gemini API 899 | 900 | Function calling lets you connect models to external tools and APIs. Instead of generating text responses, the model determines when to call specific functions and provides the necessary parameters to execute real-world actions. This allows the model to act as a bridge between natural language and real-world actions and data. Function calling has 3 primary use cases: 901 | 902 | Augment Knowledge: Access information from external sources like databases, APIs, and knowledge bases. 903 | Extend Capabilities: Use external tools to perform computations and extend the limitations of the model, such as using a calculator or creating charts. 904 | Take Actions: Interact with external systems using APIs, such as scheduling appointments, creating invoices, sending emails, or controlling smart home devices. 905 | 906 | 907 | from google import genai 908 | from google.genai import types 909 | 910 | # Define the function declaration for the model 911 | schedule_meeting_function = { 912 | "name": "schedule_meeting", 913 | "description": "Schedules a meeting with specified attendees at a given time and date.", 914 | "parameters": { 915 | "type": "object", 916 | "properties": { 917 | "attendees": { 918 | "type": "array", 919 | "items": {"type": "string"}, 920 | "description": "List of people attending the meeting.", 921 | }, 922 | "date": { 923 | "type": "string", 924 | "description": "Date of the meeting (e.g., '2024-07-29')", 925 | }, 926 | "time": { 927 | "type": "string", 928 | "description": "Time of the meeting (e.g., '15:00')", 929 | }, 930 | "topic": { 931 | "type": "string", 932 | "description": "The subject or topic of the meeting.", 933 | }, 934 | }, 935 | "required": ["attendees", "date", "time", "topic"], 936 | }, 937 | } 938 | 939 | # Configure the client and tools 940 | client = genai.Client() 941 | tools = types.Tool(function_declarations=[schedule_meeting_function]) 942 | config = types.GenerateContentConfig(tools=[tools]) 943 | 944 | # Send request with function declarations 945 | response = client.models.generate_content( 946 | model="gemini-2.5-flash", 947 | contents="Schedule a meeting with Bob and Alice for 03/14/2025 at 10:00 AM about the Q3 planning.", 948 | config=config, 949 | ) 950 | 951 | # Check for a function call 952 | if response.candidates[0].content.parts[0].function_call: 953 | function_call = response.candidates[0].content.parts[0].function_call 954 | print(f"Function to call: {function_call.name}") 955 | print(f"Arguments: {function_call.args}") 956 | # In a real app, you would call your function here: 957 | # result = schedule_meeting(**function_call.args) 958 | else: 959 | print("No function call found in the response.") 960 | print(response.text) 961 | 962 | 963 | Function calling involves a structured interaction between your application, the model, and external functions. Here's a breakdown of the process: 964 | 965 | Define Function Declaration: Define the function declaration in your application code. Function Declarations describe the function's name, parameters, and purpose to the model. 966 | Call LLM with function declarations: Send user prompt along with the function declaration(s) to the model. It analyzes the request and determines if a function call would be helpful. If so, it responds with a structured JSON object. 967 | Execute Function Code (Your Responsibility): The Model does not execute the function itself. It's your application's responsibility to process the response and check for Function Call, if 968 | Yes: Extract the name and args of the function and execute the corresponding function in your application. 969 | No: The model has provided a direct text response to the prompt (this flow is less emphasized in the example but is a possible outcome). 970 | Create User friendly response: If a function was executed, capture the result and send it back to the model in a subsequent turn of the conversation. It will use the result to generate a final, user-friendly response that incorporates the information from the function call. 971 | This process can be repeated over multiple turns, allowing for complex interactions and workflows. The model also supports calling multiple functions in a single turn (parallel function calling) and in sequence (compositional function calling). 972 | 973 | Step 1: Define a function declaration 974 | Define a function and its declaration within your application code that allows users to set light values and make an API request. This function could call external services or APIs. 975 | 976 | 977 | # Define a function that the model can call to control smart lights 978 | set_light_values_declaration = { 979 | "name": "set_light_values", 980 | "description": "Sets the brightness and color temperature of a light.", 981 | "parameters": { 982 | "type": "object", 983 | "properties": { 984 | "brightness": { 985 | "type": "integer", 986 | "description": "Light level from 0 to 100. Zero is off and 100 is full brightness", 987 | }, 988 | "color_temp": { 989 | "type": "string", 990 | "enum": ["daylight", "cool", "warm"], 991 | "description": "Color temperature of the light fixture, which can be `daylight`, `cool` or `warm`.", 992 | }, 993 | }, 994 | "required": ["brightness", "color_temp"], 995 | }, 996 | } 997 | 998 | # This is the actual function that would be called based on the model's suggestion 999 | def set_light_values(brightness: int, color_temp: str) -> dict[str, int | str]: 1000 | """Set the brightness and color temperature of a room light. (mock API). 1001 | 1002 | Args: 1003 | brightness: Light level from 0 to 100. Zero is off and 100 is full brightness 1004 | color_temp: Color temperature of the light fixture, which can be `daylight`, `cool` or `warm`. 1005 | 1006 | Returns: 1007 | A dictionary containing the set brightness and color temperature. 1008 | """ 1009 | return {"brightness": brightness, "colorTemperature": color_temp} 1010 | 1011 | 1012 | Step 2: Call the model with function declarations 1013 | Once you have defined your function declarations, you can prompt the model to use them. It analyzes the prompt and function declarations and decides whether to respond directly or to call a function. If a function is called, the response object will contain a function call suggestion. 1014 | 1015 | 1016 | from google.genai import types 1017 | 1018 | # Configure the client and tools 1019 | client = genai.Client() 1020 | tools = types.Tool(function_declarations=[set_light_values_declaration]) 1021 | config = types.GenerateContentConfig(tools=[tools]) 1022 | 1023 | # Define user prompt 1024 | contents = [ 1025 | types.Content( 1026 | role="user", parts=[types.Part(text="Turn the lights down to a romantic level")] 1027 | ) 1028 | ] 1029 | 1030 | # Send request with function declarations 1031 | response = client.models.generate_content( 1032 | model="gemini-2.5-flash", 1033 | contents=contents 1034 | config=config, 1035 | ) 1036 | 1037 | print(response.candidates[0].content.parts[0].function_call) 1038 | 1039 | 1040 | The model then returns a functionCall object in an OpenAPI compatible schema specifying how to call one or more of the declared functions in order to respond to the user's question. 1041 | 1042 | 1043 | id=None args={'color_temp': 'warm', 'brightness': 25} name='set_light_values' 1044 | 1045 | 1046 | Step 3: Execute set_light_values function code 1047 | Extract the function call details from the model's response, parse the arguments , and execute the set_light_values function. 1048 | 1049 | # Extract tool call details, it may not be in the first part. 1050 | tool_call = response.candidates[0].content.parts[0].function_call 1051 | 1052 | if tool_call.name == "set_light_values": 1053 | result = set_light_values(**tool_call.args) 1054 | print(f"Function execution result: {result}") 1055 | 1056 | 1057 | Step 4: Create user friendly response with function result and call the model again 1058 | Finally, send the result of the function execution back to the model so it can incorporate this information into its final response to the user. 1059 | 1060 | # Create a function response part 1061 | function_response_part = types.Part.from_function_response( 1062 | name=tool_call.name, 1063 | response={"result": result}, 1064 | ) 1065 | 1066 | # Append function call and result of the function execution to contents 1067 | contents.append(response.candidates[0].content) # Append the content from the model's response. 1068 | contents.append(types.Content(role="user", parts=[function_response_part])) # Append the function response 1069 | 1070 | final_response = client.models.generate_content( 1071 | model="gemini-2.5-flash", 1072 | config=config, 1073 | contents=contents, 1074 | ) 1075 | 1076 | print(final_response.text) 1077 | 1078 | 1079 | 1080 | This completes the function calling flow. The model successfully used the set_light_values function to perform the request action of the user. 1081 | 1082 | Function declarations 1083 | When you implement function calling in a prompt, you create a tools object, which contains one or more function declarations. You define functions using JSON, specifically with a select subset of the OpenAPI schema format. A single function declaration can include the following parameters: 1084 | 1085 | name (string): A unique name for the function (get_weather_forecast, send_email). Use descriptive names without spaces or special characters (use underscores or camelCase). 1086 | description (string): A clear and detailed explanation of the function's purpose and capabilities. This is crucial for the model to understand when to use the function. Be specific and provide examples if helpful ("Finds theaters based on location and optionally movie title which is currently playing in theaters."). 1087 | parameters (object): Defines the input parameters the function expects. 1088 | type (string): Specifies the overall data type, such as object. 1089 | properties (object): Lists individual parameters, each with: 1090 | type (string): The data type of the parameter, such as string, integer, boolean, array. 1091 | description (string): A description of the parameter's purpose and format. Provide examples and constraints ("The city and state, e.g., 'San Francisco, CA' or a zip code e.g., '95616'."). 1092 | enum (array, optional): If the parameter values are from a fixed set, use "enum" to list the allowed values instead of just describing them in the description. This improves accuracy ("enum": ["daylight", "cool", "warm"]). 1093 | required (array): An array of strings listing the parameter names that are mandatory for the function to operate. 1094 | Function calling with thinking 1095 | Enabling "thinking" can improve function call performance by allowing the model to reason through a request before suggesting function calls. 1096 | 1097 | However, because the Gemini API is stateless, this reasoning context is lost between turns, which can reduce the quality of function calls as they require multiple turn requests. 1098 | 1099 | To preserve this context you can use thought signatures. A thought signature is an encrypted representation of the model's internal thought process that you pass back to the model on subsequent turns. 1100 | 1101 | To use thought signatures: 1102 | 1103 | Receive the signature: When thinking is enabled, the API response will include a thought_signature field containing an encrypted representation of the model's reasoning. 1104 | Return the signature: When you send the function's execution result back to the server, include the thought_signature you received. 1105 | This allows the model to restore its previous thinking context and will likely result in better function calling performance. 1106 | 1107 | Receiving signatures from the server 1108 | 1109 | Signatures are returned in the part after the model's thinking phase, which typically is a text or function call. 1110 | 1111 | 1112 | Parallel function calling 1113 | In addition to single turn function calling, you can also call multiple functions at once. Parallel function calling lets you execute multiple functions at once and is used when the functions are not dependent on each other. This is useful in scenarios like gathering data from multiple independent sources, such as retrieving customer details from different databases or checking inventory levels across various warehouses or performing multiple actions such as converting your apartment into a disco. 1114 | 1115 | 1116 | power_disco_ball = { 1117 | "name": "power_disco_ball", 1118 | "description": "Powers the spinning disco ball.", 1119 | "parameters": { 1120 | "type": "object", 1121 | "properties": { 1122 | "power": { 1123 | "type": "boolean", 1124 | "description": "Whether to turn the disco ball on or off.", 1125 | } 1126 | }, 1127 | "required": ["power"], 1128 | }, 1129 | } 1130 | 1131 | start_music = { 1132 | "name": "start_music", 1133 | "description": "Play some music matching the specified parameters.", 1134 | "parameters": { 1135 | "type": "object", 1136 | "properties": { 1137 | "energetic": { 1138 | "type": "boolean", 1139 | "description": "Whether the music is energetic or not.", 1140 | }, 1141 | "loud": { 1142 | "type": "boolean", 1143 | "description": "Whether the music is loud or not.", 1144 | }, 1145 | }, 1146 | "required": ["energetic", "loud"], 1147 | }, 1148 | } 1149 | 1150 | dim_lights = { 1151 | "name": "dim_lights", 1152 | "description": "Dim the lights.", 1153 | "parameters": { 1154 | "type": "object", 1155 | "properties": { 1156 | "brightness": { 1157 | "type": "number", 1158 | "description": "The brightness of the lights, 0.0 is off, 1.0 is full.", 1159 | } 1160 | }, 1161 | "required": ["brightness"], 1162 | }, 1163 | } 1164 | 1165 | 1166 | Configure the function calling mode to allow using all of the specified tools. To learn more, you can read about configuring function calling. 1167 | 1168 | from google import genai 1169 | from google.genai import types 1170 | 1171 | # Configure the client and tools 1172 | client = genai.Client() 1173 | house_tools = [ 1174 | types.Tool(function_declarations=[power_disco_ball, start_music, dim_lights]) 1175 | ] 1176 | config = types.GenerateContentConfig( 1177 | tools=house_tools, 1178 | automatic_function_calling=types.AutomaticFunctionCallingConfig( 1179 | disable=True 1180 | ), 1181 | # Force the model to call 'any' function, instead of chatting. 1182 | tool_config=types.ToolConfig( 1183 | function_calling_config=types.FunctionCallingConfig(mode='ANY') 1184 | ), 1185 | ) 1186 | 1187 | chat = client.chats.create(model="gemini-2.5-flash", config=config) 1188 | response = chat.send_message("Turn this place into a party!") 1189 | 1190 | # Print out each of the function calls requested from this single call 1191 | print("Example 1: Forced function calling") 1192 | for fn in response.function_calls: 1193 | args = ", ".join(f"{key}={val}" for key, val in fn.args.items()) 1194 | print(f"{fn.name}({args})") 1195 | 1196 | 1197 | Compositional function calling 1198 | Compositional or sequential function calling allows Gemini to chain multiple function calls together to fulfill a complex request. For example, to answer "Get the temperature in my current location", the Gemini API might first invoke a get_current_location() function followed by a get_weather() function that takes the location as a parameter. 1199 | 1200 | The following example demonstrates how to implement compositional function calling using the Python SDK and automatic function calling. 1201 | 1202 | 1203 | 1204 | import os 1205 | from google import genai 1206 | from google.genai import types 1207 | 1208 | # Example Functions 1209 | def get_weather_forecast(location: str) -> dict: 1210 | """Gets the current weather temperature for a given location.""" 1211 | print(f"Tool Call: get_weather_forecast(location={location})") 1212 | # TODO: Make API call 1213 | print("Tool Response: {'temperature': 25, 'unit': 'celsius'}") 1214 | return {"temperature": 25, "unit": "celsius"} # Dummy response 1215 | 1216 | def set_thermostat_temperature(temperature: int) -> dict: 1217 | """Sets the thermostat to a desired temperature.""" 1218 | print(f"Tool Call: set_thermostat_temperature(temperature={temperature})") 1219 | # TODO: Interact with a thermostat API 1220 | print("Tool Response: {'status': 'success'}") 1221 | return {"status": "success"} 1222 | 1223 | # Configure the client and model 1224 | client = genai.Client() 1225 | config = types.GenerateContentConfig( 1226 | tools=[get_weather_forecast, set_thermostat_temperature] 1227 | ) 1228 | 1229 | # Make the request 1230 | response = client.models.generate_content( 1231 | model="gemini-2.5-flash", 1232 | contents="If it's warmer than 20°C in London, set the thermostat to 20°C, otherwise set it to 18°C.", 1233 | config=config, 1234 | ) 1235 | 1236 | # Print the final, user-facing response 1237 | print(response.text) 1238 | 1239 | 1240 | Function calling modes 1241 | The Gemini API lets you control how the model uses the provided tools (function declarations). Specifically, you can set the mode within the.function_calling_config. 1242 | 1243 | AUTO (Default): The model decides whether to generate a natural language response or suggest a function call based on the prompt and context. This is the most flexible mode and recommended for most scenarios. 1244 | ANY: The model is constrained to always predict a function call and guarantees function schema adherence. If allowed_function_names is not specified, the model can choose from any of the provided function declarations. If allowed_function_names is provided as a list, the model can only choose from the functions in that list. Use this mode when you require a function call response to every prompt (if applicable). 1245 | NONE: The model is prohibited from making function calls. This is equivalent to sending a request without any function declarations. Use this to temporarily disable function calling without removing your tool definitions. 1246 | 1247 | 1248 | from google.genai import types 1249 | 1250 | # Configure function calling mode 1251 | tool_config = types.ToolConfig( 1252 | function_calling_config=types.FunctionCallingConfig( 1253 | mode="ANY", allowed_function_names=["get_current_temperature"] 1254 | ) 1255 | ) 1256 | 1257 | # Create the generation config 1258 | config = types.GenerateContentConfig( 1259 | tools=[tools], # not defined here. 1260 | tool_config=tool_config, 1261 | ) 1262 | 1263 | 1264 | Automatic function calling (Python only) 1265 | When using the Python SDK, you can provide Python functions directly as tools. The SDK automatically converts the Python function to declarations, handles the function call execution and the response cycle for you. The Python SDK then automatically: 1266 | 1267 | Detects function call responses from the model. 1268 | Call the corresponding Python function in your code. 1269 | Sends the function response back to the model. 1270 | Returns the model's final text response. 1271 | To use this, define your function with type hints and a docstring, and then pass the function itself (not a JSON declaration) as a tool: 1272 | 1273 | 1274 | from google import genai 1275 | from google.genai import types 1276 | 1277 | # Define the function with type hints and docstring 1278 | def get_current_temperature(location: str) -> dict: 1279 | """Gets the current temperature for a given location. 1280 | 1281 | Args: 1282 | location: The city and state, e.g. San Francisco, CA 1283 | 1284 | Returns: 1285 | A dictionary containing the temperature and unit. 1286 | """ 1287 | # ... (implementation) ... 1288 | return {"temperature": 25, "unit": "Celsius"} 1289 | 1290 | # Configure the client 1291 | client = genai.Client() 1292 | config = types.GenerateContentConfig( 1293 | tools=[get_current_temperature] 1294 | ) # Pass the function itself 1295 | 1296 | # Make the request 1297 | response = client.models.generate_content( 1298 | model="gemini-2.5-flash", 1299 | contents="What's the temperature in Boston?", 1300 | config=config, 1301 | ) 1302 | 1303 | print(response.text) # The SDK handles the function call and returns the final text 1304 | 1305 | 1306 | 1307 | Model context protocol (MCP) 1308 | Model Context Protocol (MCP) is an open standard for connecting AI applications with external tools and data. MCP provides a common protocol for models to access context, such as functions (tools), data sources (resources), or predefined prompts. 1309 | 1310 | The Gemini SDKs have built-in support for the MCP, reducing boilerplate code and offering automatic tool calling for MCP tools. When the model generates an MCP tool call, the Python and JavaScript client SDK can automatically execute the MCP tool and send the response back to the model in a subsequent request, continuing this loop until no more tool calls are made by the model. 1311 | 1312 | Here, you can find an example of how to use a local MCP server with Gemini and mcp SDK. 1313 | 1314 | 1315 | pip install mcp 1316 | 1317 | import os 1318 | import asyncio 1319 | from datetime import datetime 1320 | from mcp import ClientSession, StdioServerParameters 1321 | from mcp.client.stdio import stdio_client 1322 | from google import genai 1323 | 1324 | client = genai.Client() 1325 | 1326 | # Create server parameters for stdio connection 1327 | server_params = StdioServerParameters( 1328 | command="npx", # Executable 1329 | args=["-y", "@philschmid/weather-mcp"], # MCP Server 1330 | env=None, # Optional environment variables 1331 | ) 1332 | 1333 | async def run(): 1334 | async with stdio_client(server_params) as (read, write): 1335 | async with ClientSession(read, write) as session: 1336 | # Prompt to get the weather for the current day in London. 1337 | prompt = f"What is the weather in London in {datetime.now().strftime('%Y-%m-%d')}?" 1338 | 1339 | # Initialize the connection between client and server 1340 | await session.initialize() 1341 | 1342 | # Send request to the model with MCP function declarations 1343 | response = await client.aio.models.generate_content( 1344 | model="gemini-2.5-flash", 1345 | contents=prompt, 1346 | config=genai.types.GenerateContentConfig( 1347 | temperature=0, 1348 | tools=[session], # uses the session, will automatically call the tool 1349 | # Uncomment if you **don't** want the SDK to automatically call the tool 1350 | # automatic_function_calling=genai.types.AutomaticFunctionCallingConfig( 1351 | # disable=True 1352 | # ), 1353 | ), 1354 | ) 1355 | print(response.text) 1356 | 1357 | # Start the asyncio event loop and run the main function 1358 | asyncio.run(run()) 1359 | 1360 | 1361 | 1362 | Limitations with built-in MCP support 1363 | Built-in MCP support is a experimental feature in our SDKs and has the following limitations: 1364 | 1365 | Only tools are supported, not resources nor prompts 1366 | It is available for the Python and JavaScript/TypeScript SDK. 1367 | Breaking changes might occur in future releases. 1368 | Manual integration of MCP servers is always an option if these limit what you're building. 1369 | 1370 | Supported models 1371 | This section lists models and their function calling capabilities. Experimental models are not included. You can find a comprehensive capabilities overview on the model overview page. 1372 | 1373 | Model Function Calling Parallel Function Calling Compositional Function Calling 1374 | Gemini 2.5 Pro✔️ ✔️ ✔️ 1375 | Gemini 2.5 Flash ✔️ ✔️ ✔️ 1376 | Gemini 2.5 Flash-Lite ✔️ ✔️ ✔️ 1377 | Gemini 2.0 Flash ✔️ ✔️ ✔️ 1378 | Gemini 2.0 Flash-Lite X X X 1379 | Best practices 1380 | Function and Parameter Descriptions: Be extremely clear and specific in your descriptions. The model relies on these to choose the correct function and provide appropriate arguments. 1381 | Naming: Use descriptive function names (without spaces, periods, or dashes). 1382 | Strong Typing: Use specific types (integer, string, enum) for parameters to reduce errors. If a parameter has a limited set of valid values, use an enum. 1383 | Tool Selection: While the model can use an arbitrary number of tools, providing too many can increase the risk of selecting an incorrect or suboptimal tool. For best results, aim to provide only the relevant tools for the context or task, ideally keeping the active set to a maximum of 10-20. Consider dynamic tool selection based on conversation context if you have a large total number of tools. 1384 | Prompt Engineering: 1385 | Provide context: Tell the model its role (e.g., "You are a helpful weather assistant."). 1386 | Give instructions: Specify how and when to use functions (e.g., "Don't guess dates; always use a future date for forecasts."). 1387 | Encourage clarification: Instruct the model to ask clarifying questions if needed. 1388 | Temperature: Use a low temperature (e.g., 0) for more deterministic and reliable function calls. 1389 | Validation: If a function call has significant consequences (e.g., placing an order), validate the call with the user before executing it. 1390 | Error Handling: Implement robust error handling in your functions to gracefully handle unexpected inputs or API failures. Return informative error messages that the model can use to generate helpful responses to the user. 1391 | Security: Be mindful of security when calling external APIs. Use appropriate authentication and authorization mechanisms. Avoid exposing sensitive data in function calls. 1392 | Token Limits: Function descriptions and parameters count towards your input token limit. If you're hitting token limits, consider limiting the number of functions or the length of the descriptions, break down complex tasks into smaller, more focused function sets. 1393 | 1394 | 1395 | 1396 | 1397 | 1398 | 1399 | 1400 | 1401 | Document understanding 1402 | 1403 | Gemini models can process documents in PDF format, using native vision to understand entire document contexts. This goes beyond simple text extraction, allowing Gemini to: 1404 | 1405 | Analyze and interpret content, including text, images, diagrams, charts, and tables, even in long documents up to 1000 pages. 1406 | Extract information into structured output formats. 1407 | Summarize and answer questions based on both the visual and textual elements in a document. 1408 | Transcribe document content (e.g. to HTML), preserving layouts and formatting, for use in downstream applications. 1409 | Passing inline PDF data 1410 | You can pass inline PDF data in the request to generateContent. For PDF payloads under 20MB, you can choose between uploading base64 encoded documents or directly uploading locally stored files. 1411 | 1412 | The following example shows you how to fetch a PDF from a URL and convert it to bytes for processing: 1413 | 1414 | 1415 | from google import genai 1416 | from google.genai import types 1417 | import httpx 1418 | 1419 | client = genai.Client() 1420 | 1421 | doc_url = "https://discovery.ucl.ac.uk/id/eprint/10089234/1/343019_3_art_0_py4t4l_convrt.pdf" 1422 | 1423 | # Retrieve and encode the PDF byte 1424 | doc_data = httpx.get(doc_url).content 1425 | 1426 | prompt = "Summarize this document" 1427 | response = client.models.generate_content( 1428 | model="gemini-2.5-flash", 1429 | contents=[ 1430 | types.Part.from_bytes( 1431 | data=doc_data, 1432 | mime_type='application/pdf', 1433 | ), 1434 | prompt]) 1435 | print(response.text) 1436 | 1437 | You can also read a PDF from a local file for processing: 1438 | 1439 | from google import genai 1440 | from google.genai import types 1441 | import pathlib 1442 | 1443 | client = genai.Client() 1444 | 1445 | # Retrieve and encode the PDF byte 1446 | filepath = pathlib.Path('file.pdf') 1447 | 1448 | prompt = "Summarize this document" 1449 | response = client.models.generate_content( 1450 | model="gemini-2.5-flash", 1451 | contents=[ 1452 | types.Part.from_bytes( 1453 | data=filepath.read_bytes(), 1454 | mime_type='application/pdf', 1455 | ), 1456 | prompt]) 1457 | print(response.text) 1458 | 1459 | Uploading PDFs using the File API 1460 | You can use the File API to upload larger documents. Always use the File API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20MB. 1461 | 1462 | Note: The File API lets you store up to 50MB of PDF files. Files are stored for 48 hours. You can access them in that period with your API key, but you can't download them from the API. The File API is available at no cost in all regions where the Gemini API is available. 1463 | Call media.upload to upload a file using the File API. The following code uploads a document file and then uses the file in a call to models.generateContent. 1464 | 1465 | 1466 | Large PDFs from URLs 1467 | Use the File API to simplify uploading and processing large PDF files from URLs: 1468 | 1469 | from google import genai 1470 | from google.genai import types 1471 | import io 1472 | import httpx 1473 | 1474 | client = genai.Client() 1475 | 1476 | long_context_pdf_path = "https://www.nasa.gov/wp-content/uploads/static/history/alsj/a17/A17_FlightPlan.pdf" 1477 | 1478 | # Retrieve and upload the PDF using the File API 1479 | doc_io = io.BytesIO(httpx.get(long_context_pdf_path).content) 1480 | 1481 | sample_doc = client.files.upload( 1482 | # You can pass a path or a file-like object here 1483 | file=doc_io, 1484 | config=dict( 1485 | mime_type='application/pdf') 1486 | ) 1487 | 1488 | prompt = "Summarize this document" 1489 | 1490 | response = client.models.generate_content( 1491 | model="gemini-2.5-flash", 1492 | contents=[sample_doc, prompt]) 1493 | print(response.text) 1494 | 1495 | 1496 | Large PDFs stored locally 1497 | 1498 | from google import genai 1499 | from google.genai import types 1500 | import pathlib 1501 | import httpx 1502 | 1503 | client = genai.Client() 1504 | 1505 | # Retrieve and encode the PDF byte 1506 | file_path = pathlib.Path('large_file.pdf') 1507 | 1508 | # Upload the PDF using the File API 1509 | sample_file = client.files.upload( 1510 | file=file_path, 1511 | ) 1512 | 1513 | prompt="Summarize this document" 1514 | 1515 | response = client.models.generate_content( 1516 | model="gemini-2.5-flash", 1517 | contents=[sample_file, "Summarize this document"]) 1518 | print(response.text) 1519 | 1520 | 1521 | Passing multiple PDFs 1522 | The Gemini API is capable of processing multiple PDF documents (up to 1000 pages) in a single request, as long as the combined size of the documents and the text prompt stays within the model's context window. 1523 | 1524 | 1525 | from google import genai 1526 | import io 1527 | import httpx 1528 | 1529 | client = genai.Client() 1530 | 1531 | doc_url_1 = "https://arxiv.org/pdf/2312.11805" 1532 | doc_url_2 = "https://arxiv.org/pdf/2403.05530" 1533 | 1534 | # Retrieve and upload both PDFs using the File API 1535 | doc_data_1 = io.BytesIO(httpx.get(doc_url_1).content) 1536 | doc_data_2 = io.BytesIO(httpx.get(doc_url_2).content) 1537 | 1538 | sample_pdf_1 = client.files.upload( 1539 | file=doc_data_1, 1540 | config=dict(mime_type='application/pdf') 1541 | ) 1542 | sample_pdf_2 = client.files.upload( 1543 | file=doc_data_2, 1544 | config=dict(mime_type='application/pdf') 1545 | ) 1546 | 1547 | prompt = "What is the difference between each of the main benchmarks between these two papers? Output these in a table." 1548 | 1549 | response = client.models.generate_content( 1550 | model="gemini-2.5-flash", 1551 | contents=[sample_pdf_1, sample_pdf_2, prompt]) 1552 | print(response.text) 1553 | 1554 | 1555 | Technical details 1556 | Gemini supports a maximum of 1,000 document pages. Each document page is equivalent to 258 tokens. 1557 | 1558 | While there are no specific limits to the number of pixels in a document besides the model's context window, larger pages are scaled down to a maximum resolution of 3072x3072 while preserving their original aspect ratio, while smaller pages are scaled up to 768x768 pixels. There is no cost reduction for pages at lower sizes, other than bandwidth, or performance improvement for pages at higher resolution. 1559 | 1560 | Document types 1561 | Technically, you can pass other MIME types for document understanding, like TXT, Markdown, HTML, XML, etc. However, document vision only meaningfully understands PDFs. Other types will be extracted as pure text, and the model won't be able to interpret what we see in the rendering of those files. Any file-type specifics like charts, diagrams, HTML tags, Markdown formatting, etc., will be lost. 1562 | 1563 | Best practices 1564 | For best results: 1565 | 1566 | Rotate pages to the correct orientation before uploading. 1567 | Avoid blurry pages. 1568 | If using a single page, place the text prompt after the page. 1569 | 1570 | 1571 | 1572 | 1573 | Image understanding 1574 | 1575 | Gemini models are built to be multimodal from the ground up, unlocking a wide range of image processing and computer vision tasks including but not limited to image captioning, classification, and visual question answering without having to train specialized ML models. 1576 | 1577 | Tip: In addition to their general multimodal capabilities, Gemini models (2.0 and newer) offer improved accuracy for specific use cases like object detection and segmentation, through additional training. See the Capabilities section for more details. 1578 | Passing images to Gemini 1579 | You can provide images as input to Gemini using two methods: 1580 | 1581 | Passing inline image data: Ideal for smaller files (total request size less than 20MB, including prompts). 1582 | Uploading images using the File API: Recommended for larger files or for reusing images across multiple requests. 1583 | Passing inline image data 1584 | You can pass inline image data in the request to generateContent. You can provide image data as Base64 encoded strings or by reading local files directly (depending on the language). 1585 | 1586 | The following example shows how to read an image from a local file and pass it to generateContent API for processing. 1587 | 1588 | 1589 | from google.genai import types 1590 | 1591 | with open('path/to/small-sample.jpg', 'rb') as f: 1592 | image_bytes = f.read() 1593 | 1594 | response = client.models.generate_content( 1595 | model='gemini-2.5-flash', 1596 | contents=[ 1597 | types.Part.from_bytes( 1598 | data=image_bytes, 1599 | mime_type='image/jpeg', 1600 | ), 1601 | 'Caption this image.' 1602 | ] 1603 | ) 1604 | 1605 | print(response.text) 1606 | 1607 | You can also fetch an image from a URL, convert it to bytes, and pass it to generateContent as shown in the following examples. 1608 | 1609 | from google import genai 1610 | from google.genai import types 1611 | 1612 | import requests 1613 | 1614 | image_path = "https://goo.gle/instrument-img" 1615 | image_bytes = requests.get(image_path).content 1616 | image = types.Part.from_bytes( 1617 | data=image_bytes, mime_type="image/jpeg" 1618 | ) 1619 | 1620 | client = genai.Client() 1621 | 1622 | response = client.models.generate_content( 1623 | model="gemini-2.5-flash", 1624 | contents=["What is this image?", image], 1625 | ) 1626 | 1627 | print(response.text) 1628 | 1629 | 1630 | Note: Inline image data limits your total request size (text prompts, system instructions, and inline bytes) to 20MB. For larger requests, upload image files using the File API. Files API is also more efficient for scenarios that use the same image repeatedly. 1631 | Uploading images using the File API 1632 | For large files or to be able to use the same image file repeatedly, use the Files API. The following code uploads an image file and then uses the file in a call to generateContent. See the Files API guide for more information and examples. 1633 | 1634 | from google import genai 1635 | 1636 | client = genai.Client() 1637 | 1638 | my_file = client.files.upload(file="path/to/sample.jpg") 1639 | 1640 | response = client.models.generate_content( 1641 | model="gemini-2.5-flash", 1642 | contents=[my_file, "Caption this image."], 1643 | ) 1644 | 1645 | print(response.text) 1646 | 1647 | 1648 | Prompting with multiple images 1649 | You can provide multiple images in a single prompt by including multiple image Part objects in the contents array. These can be a mix of inline data (local files or URLs) and File API references. 1650 | 1651 | 1652 | from google import genai 1653 | from google.genai import types 1654 | 1655 | client = genai.Client() 1656 | 1657 | # Upload the first image 1658 | image1_path = "path/to/image1.jpg" 1659 | uploaded_file = client.files.upload(file=image1_path) 1660 | 1661 | # Prepare the second image as inline data 1662 | image2_path = "path/to/image2.png" 1663 | with open(image2_path, 'rb') as f: 1664 | img2_bytes = f.read() 1665 | 1666 | # Create the prompt with text and multiple images 1667 | response = client.models.generate_content( 1668 | 1669 | model="gemini-2.5-flash", 1670 | contents=[ 1671 | "What is different between these two images?", 1672 | uploaded_file, # Use the uploaded file reference 1673 | types.Part.from_bytes( 1674 | data=img2_bytes, 1675 | mime_type='image/png' 1676 | ) 1677 | ] 1678 | ) 1679 | 1680 | print(response.text) 1681 | 1682 | 1683 | Object detection 1684 | From Gemini 2.0 onwards, models are further trained to detect objects in an image and get their bounding box coordinates. The coordinates, relative to image dimensions, scale to [0, 1000]. You need to descale these coordinates based on your original image size. 1685 | 1686 | 1687 | from google import genai 1688 | from google.genai import types 1689 | from PIL import Image 1690 | import json 1691 | 1692 | client = genai.Client() 1693 | prompt = "Detect the all of the prominent items in the image. The box_2d should be [ymin, xmin, ymax, xmax] normalized to 0-1000." 1694 | 1695 | image = Image.open("/path/to/image.png") 1696 | 1697 | config = types.GenerateContentConfig( 1698 | response_mime_type="application/json" 1699 | ) 1700 | 1701 | response = client.models.generate_content(model="gemini-2.5-flash", 1702 | contents=[image, prompt], 1703 | config=config 1704 | ) 1705 | 1706 | width, height = image.size 1707 | bounding_boxes = json.loads(response.text) 1708 | 1709 | converted_bounding_boxes = [] 1710 | for bounding_box in bounding_boxes: 1711 | abs_y1 = int(bounding_box["box_2d"][0]/1000 * height) 1712 | abs_x1 = int(bounding_box["box_2d"][1]/1000 * width) 1713 | abs_y2 = int(bounding_box["box_2d"][2]/1000 * height) 1714 | abs_x2 = int(bounding_box["box_2d"][3]/1000 * width) 1715 | converted_bounding_boxes.append([abs_x1, abs_y1, abs_x2, abs_y2]) 1716 | 1717 | print("Image size: ", width, height) 1718 | print("Bounding boxes:", converted_bounding_boxes) 1719 | 1720 | 1721 | 1722 | 1723 | Segmentation 1724 | Starting with Gemini 2.5, models not only detect items but also segment them and provide their contour masks. 1725 | 1726 | The model predicts a JSON list, where each item represents a segmentation mask. Each item has a bounding box ("box_2d") in the format [y0, x0, y1, x1] with normalized coordinates between 0 and 1000, a label ("label") that identifies the object, and finally the segmentation mask inside the bounding box, as base64 encoded png that is a probability map with values between 0 and 255. The mask needs to be resized to match the bounding box dimensions, then binarized at your confidence threshold (127 for the midpoint). 1727 | 1728 | Note: For better results, disable thinking by setting the thinking budget to 0. See code sample below for an example 1729 | 1730 | 1731 | from google import genai 1732 | from google.genai import types 1733 | from PIL import Image, ImageDraw 1734 | import io 1735 | import base64 1736 | import json 1737 | import numpy as np 1738 | import os 1739 | 1740 | client = genai.Client() 1741 | 1742 | def parse_json(json_output: str): 1743 | # Parsing out the markdown fencing 1744 | lines = json_output.splitlines() 1745 | for i, line in enumerate(lines): 1746 | if line == "```json": 1747 | json_output = "\n".join(lines[i+1:]) # Remove everything before "```json" 1748 | output = json_output.split("```")[0] # Remove everything after the closing "```" 1749 | break # Exit the loop once "```json" is found 1750 | return json_output 1751 | 1752 | def extract_segmentation_masks(image_path: str, output_dir: str = "segmentation_outputs"): 1753 | # Load and resize image 1754 | im = Image.open(image_path) 1755 | im.thumbnail([1024, 1024], Image.Resampling.LANCZOS) 1756 | 1757 | prompt = """ 1758 | Give the segmentation masks for the wooden and glass items. 1759 | Output a JSON list of segmentation masks where each entry contains the 2D 1760 | bounding box in the key "box_2d", the segmentation mask in key "mask", and 1761 | the text label in the key "label". Use descriptive labels. 1762 | """ 1763 | 1764 | config = types.GenerateContentConfig( 1765 | thinking_config=types.ThinkingConfig(thinking_budget=0) # set thinking_budget to 0 for better results in object detection 1766 | ) 1767 | 1768 | response = client.models.generate_content( 1769 | model="gemini-2.5-flash", 1770 | contents=[prompt, im], # Pillow images can be directly passed as inputs (which will be converted by the SDK) 1771 | config=config 1772 | ) 1773 | 1774 | # Parse JSON response 1775 | items = json.loads(parse_json(response.text)) 1776 | 1777 | # Create output directory 1778 | os.makedirs(output_dir, exist_ok=True) 1779 | 1780 | # Process each mask 1781 | for i, item in enumerate(items): 1782 | # Get bounding box coordinates 1783 | box = item["box_2d"] 1784 | y0 = int(box[0] / 1000 * im.size[1]) 1785 | x0 = int(box[1] / 1000 * im.size[0]) 1786 | y1 = int(box[2] / 1000 * im.size[1]) 1787 | x1 = int(box[3] / 1000 * im.size[0]) 1788 | 1789 | # Skip invalid boxes 1790 | if y0 >= y1 or x0 >= x1: 1791 | continue 1792 | 1793 | # Process mask 1794 | png_str = item["mask"] 1795 | if not png_str.startswith("data:image/png;base64,"): 1796 | continue 1797 | 1798 | # Remove prefix 1799 | png_str = png_str.removeprefix("data:image/png;base64,") 1800 | mask_data = base64.b64decode(png_str) 1801 | mask = Image.open(io.BytesIO(mask_data)) 1802 | 1803 | # Resize mask to match bounding box 1804 | mask = mask.resize((x1 - x0, y1 - y0), Image.Resampling.BILINEAR) 1805 | 1806 | # Convert mask to numpy array for processing 1807 | mask_array = np.array(mask) 1808 | 1809 | # Create overlay for this mask 1810 | overlay = Image.new('RGBA', im.size, (0, 0, 0, 0)) 1811 | overlay_draw = ImageDraw.Draw(overlay) 1812 | 1813 | # Create overlay for the mask 1814 | color = (255, 255, 255, 200) 1815 | for y in range(y0, y1): 1816 | for x in range(x0, x1): 1817 | if mask_array[y - y0, x - x0] > 128: # Threshold for mask 1818 | overlay_draw.point((x, y), fill=color) 1819 | 1820 | # Save individual mask and its overlay 1821 | mask_filename = f"{item['label']}_{i}_mask.png" 1822 | overlay_filename = f"{item['label']}_{i}_overlay.png" 1823 | 1824 | mask.save(os.path.join(output_dir, mask_filename)) 1825 | 1826 | # Create and save overlay 1827 | composite = Image.alpha_composite(im.convert('RGBA'), overlay) 1828 | composite.save(os.path.join(output_dir, overlay_filename)) 1829 | print(f"Saved mask and overlay for {item['label']} to {output_dir}") 1830 | 1831 | # Example usage 1832 | if __name__ == "__main__": 1833 | extract_segmentation_masks("path/to/image.png") 1834 | 1835 | 1836 | 1837 | 1838 | Video understanding 1839 | 1840 | Gemini models can process videos, enabling many frontier developer use cases that would have historically required domain specific models. Some of Gemini's vision capabilities include the ability to: 1841 | 1842 | Describe, segment, and extract information from videos 1843 | Answer questions about video content 1844 | Refer to specific timestamps within a video 1845 | Gemini was built to be multimodal from the ground up and we continue to push the frontier of what is possible. This guide shows how to use the Gemini API to generate text responses based on video inputs. 1846 | 1847 | Video input 1848 | You can provide videos as input to Gemini in the following ways: 1849 | 1850 | Upload a video file using the File API before making a request to generateContent. Use this method for files larger than 20MB, videos longer than approximately 1 minute, or when you want to reuse the file across multiple requests. 1851 | Pass inline video data with the request to generateContent. Use this method for smaller files (<20MB) and shorter durations. 1852 | Include a YouTube URL directly in the prompt. 1853 | Upload a video file 1854 | You can use the Files API to upload a video file. Always use the Files API when the total request size (including the file, text prompt, system instructions, etc.) is larger than 20 MB, the video duration is significant, or if you intend to use the same video in multiple prompts. The File API accepts video file formats directly. 1855 | 1856 | The following code downloads the sample video, uploads it using the File API, waits for it to be processed, and then uses the file reference in a generateContent request. 1857 | 1858 | 1859 | from google import genai 1860 | 1861 | client = genai.Client() 1862 | 1863 | myfile = client.files.upload(file="path/to/sample.mp4") 1864 | 1865 | response = client.models.generate_content( 1866 | model="gemini-2.5-flash", contents=[myfile, "Summarize this video. Then create a quiz with an answer key based on the information in this video."] 1867 | ) 1868 | 1869 | print(response.text) 1870 | 1871 | 1872 | Pass video data inline 1873 | Instead of uploading a video file using the File API, you can pass smaller videos directly in the request to generateContent. This is suitable for shorter videos under 20MB total request size. 1874 | 1875 | Here's an example of providing inline video data: 1876 | 1877 | 1878 | # Only for videos of size <20Mb 1879 | video_file_name = "/path/to/your/video.mp4" 1880 | video_bytes = open(video_file_name, 'rb').read() 1881 | 1882 | response = client.models.generate_content( 1883 | model='models/gemini-2.5-flash', 1884 | contents=types.Content( 1885 | parts=[ 1886 | types.Part( 1887 | inline_data=types.Blob(data=video_bytes, mime_type='video/mp4') 1888 | ), 1889 | types.Part(text='Please summarize the video in 3 sentences.') 1890 | ] 1891 | ) 1892 | ) 1893 | 1894 | 1895 | Include a YouTube URL 1896 | Preview: The YouTube URL feature is in preview and is available at no charge. Pricing and rate limits are likely to change. 1897 | The Gemini API and AI Studio support YouTube URLs as a file data Part. You can include a YouTube URL with a prompt asking the model to summarize, translate, or otherwise interact with the video content. 1898 | 1899 | Limitations: 1900 | 1901 | For the free tier, you can't upload more than 8 hours of YouTube video per day. 1902 | For the paid tier, there is no limit based on video length. 1903 | For models before 2.5, you can upload only 1 video per request. For models after 2.5, you can upload a maximum of 10 videos per request. 1904 | You can only upload public videos (not private or unlisted videos). 1905 | The following example shows how to include a YouTube URL with a prompt: 1906 | 1907 | 1908 | response = client.models.generate_content( 1909 | model='models/gemini-2.5-flash', 1910 | contents=types.Content( 1911 | parts=[ 1912 | types.Part( 1913 | file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=9hE5-98ZeCg') 1914 | ), 1915 | types.Part(text='Please summarize the video in 3 sentences.') 1916 | ] 1917 | ) 1918 | ) 1919 | 1920 | Refer to timestamps in the content 1921 | You can ask questions about specific points in time within the video using timestamps of the form MM:SS. 1922 | 1923 | prompt = "What are the examples given at 00:05 and 00:10 supposed to show us?" # Adjusted timestamps for the NASA video 1924 | 1925 | 1926 | Transcribe video and provide visual descriptions 1927 | The Gemini models can transcribe and provide visual descriptions of video content by processing both the audio track and visual frames. For visual descriptions, the model samples the video at a rate of 1 frame per second. This sampling rate may affect the level of detail in the descriptions, particularly for videos with rapidly changing visuals. 1928 | 1929 | 1930 | Customize video processing 1931 | You can customize video processing in the Gemini API by setting clipping intervals or providing custom frame rate sampling. 1932 | 1933 | Tip: Video clipping and frames per second (FPS) are supported by all models, but the quality is significantly higher from 2.5 series models. 1934 | Set clipping intervals 1935 | You can clip video by specifying videoMetadata with start and end offsets. 1936 | 1937 | 1938 | response = client.models.generate_content( 1939 | model='models/gemini-2.5-flash', 1940 | contents=types.Content( 1941 | parts=[ 1942 | types.Part( 1943 | file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=XEzRZ35urlk'), 1944 | video_metadata=types.VideoMetadata( 1945 | start_offset='1250s', 1946 | end_offset='1570s' 1947 | ) 1948 | ), 1949 | types.Part(text='Please summarize the video in 3 sentences.') 1950 | ] 1951 | ) 1952 | ) 1953 | 1954 | 1955 | Set a custom frame rate 1956 | You can set custom frame rate sampling by passing an fps argument to videoMetadata. 1957 | 1958 | # Only for videos of size <20Mb 1959 | video_file_name = "/path/to/your/video.mp4" 1960 | video_bytes = open(video_file_name, 'rb').read() 1961 | 1962 | response = client.models.generate_content( 1963 | model='models/gemini-2.5-flash', 1964 | contents=types.Content( 1965 | parts=[ 1966 | types.Part( 1967 | inline_data=types.Blob( 1968 | data=video_bytes, 1969 | mime_type='video/mp4'), 1970 | video_metadata=types.VideoMetadata(fps=5) 1971 | ), 1972 | types.Part(text='Please summarize the video in 3 sentences.') 1973 | ] 1974 | ) 1975 | ) 1976 | 1977 | 1978 | Audio understanding 1979 | 1980 | Gemini can analyze and understand audio input, enabling use cases like the following: 1981 | 1982 | Describe, summarize, or answer questions about audio content. 1983 | Provide a transcription of the audio. 1984 | Analyze specific segments of the audio. 1985 | This guide shows you how to use the Gemini API to generate a text response to audio input. 1986 | 1987 | Before you begin 1988 | Before calling the Gemini API, ensure you have your SDK of choice installed, and a Gemini API key configured and ready to use. 1989 | 1990 | Input audio 1991 | You can provide audio data to Gemini in the following ways: 1992 | 1993 | Upload an audio file before making a request to generateContent. 1994 | Pass inline audio data with the request to generateContent. 1995 | Upload an audio file 1996 | You can use the Files API to upload an audio file. Always use the Files API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB. 1997 | 1998 | The following code uploads an audio file and then uses the file in a call to generateContent. 1999 | 2000 | 2001 | from google import genai 2002 | 2003 | client = genai.Client() 2004 | 2005 | myfile = client.files.upload(file="path/to/sample.mp3") 2006 | 2007 | response = client.models.generate_content( 2008 | model="gemini-2.5-flash", contents=["Describe this audio clip", myfile] 2009 | ) 2010 | 2011 | print(response.text) 2012 | 2013 | 2014 | 2015 | Code execution 2016 | 2017 | The Gemini API provides a code execution tool that enables the model to generate and run Python code. The model can then learn iteratively from the code execution results until it arrives at a final output. You can use code execution to build applications that benefit from code-based reasoning. For example, you can use code execution to solve equations or process text. You can also use the libraries included in the code execution environment to perform more specialized tasks. 2018 | 2019 | Gemini is only able to execute code in Python. You can still ask Gemini to generate code in another language, but the model can't use the code execution tool to run it. 2020 | 2021 | Enable code execution 2022 | To enable code execution, configure the code execution tool on the model. This allows the model to generate and run code. 2023 | 2024 | from google import genai 2025 | from google.genai import types 2026 | 2027 | client = genai.Client() 2028 | 2029 | response = client.models.generate_content( 2030 | model="gemini-2.5-flash", 2031 | contents="What is the sum of the first 50 prime numbers? " 2032 | "Generate and run code for the calculation, and make sure you get all 50.", 2033 | config=types.GenerateContentConfig( 2034 | tools=[types.Tool(code_execution=types.ToolCodeExecution)] 2035 | ), 2036 | ) 2037 | 2038 | for part in response.candidates[0].content.parts: 2039 | if part.text is not None: 2040 | print(part.text) 2041 | if part.executable_code is not None: 2042 | print(part.executable_code.code) 2043 | if part.code_execution_result is not None: 2044 | print(part.code_execution_result.output) 2045 | 2046 | 2047 | Use code execution in chat 2048 | You can also use code execution as part of a chat. 2049 | 2050 | 2051 | from google import genai 2052 | from google.genai import types 2053 | 2054 | client = genai.Client() 2055 | 2056 | chat = client.chats.create( 2057 | model="gemini-2.5-flash", 2058 | config=types.GenerateContentConfig( 2059 | tools=[types.Tool(code_execution=types.ToolCodeExecution)] 2060 | ), 2061 | ) 2062 | 2063 | response = chat.send_message("I have a math question for you.") 2064 | print(response.text) 2065 | 2066 | response = chat.send_message( 2067 | "What is the sum of the first 50 prime numbers? " 2068 | "Generate and run code for the calculation, and make sure you get all 50." 2069 | ) 2070 | 2071 | for part in response.candidates[0].content.parts: 2072 | if part.text is not None: 2073 | print(part.text) 2074 | if part.executable_code is not None: 2075 | print(part.executable_code.code) 2076 | if part.code_execution_result is not None: 2077 | print(part.code_execution_result.output) 2078 | 2079 | 2080 | 2081 | 2082 | Supported libraries 2083 | The code execution environment includes the following libraries: 2084 | 2085 | attrs 2086 | chess 2087 | contourpy 2088 | fpdf 2089 | geopandas 2090 | imageio 2091 | jinja2 2092 | joblib 2093 | jsonschema 2094 | jsonschema-specifications 2095 | lxml 2096 | matplotlib 2097 | mpmath 2098 | numpy 2099 | opencv-python 2100 | openpyxl 2101 | packaging 2102 | pandas 2103 | pillow 2104 | protobuf 2105 | pylatex 2106 | pyparsing 2107 | PyPDF2 2108 | python-dateutil 2109 | python-docx 2110 | python-pptx 2111 | reportlab 2112 | scikit-learn 2113 | scipy 2114 | seaborn 2115 | six 2116 | striprtf 2117 | sympy 2118 | tabulate 2119 | tensorflow 2120 | toolz 2121 | xlrd 2122 | 2123 | 2124 | 2125 | URL context 2126 | 2127 | Experimental: The URL context tool is an experimental feature. 2128 | Using the URL context tool, you can provide Gemini with URLs as additional context for your prompt. The model can then retrieve content from the URLs and use that content to inform and shape its response. 2129 | 2130 | This tool is useful for tasks like the following: 2131 | 2132 | Extracting key data points or talking points from articles 2133 | Comparing information across multiple links 2134 | Synthesizing data from several sources 2135 | Answering questions based on the content of a specific page or pages 2136 | Analyzing content for specific purposes (like writing a job description or creating test questions) 2137 | This guide explains how to use the URL context tool in the Gemini API. 2138 | 2139 | Use URL context 2140 | You can use the URL context tool in two main ways, by itself or in conjunction with Grounding with Google Search. 2141 | 2142 | URL Context Only 2143 | 2144 | You provide specific URLs that you want the model to analyze directly in your prompt. 2145 | 2146 | Example prompts: 2147 | 2148 | 2149 | Summarize this document: YOUR_URLs 2150 | 2151 | Extract the key features from the product description on this page: YOUR_URLs 2152 | Grounding with Google Search + URL Context 2153 | 2154 | You can also enable both URL context and Grounding with Google Search together. You can enter a prompt with or without URLs. The model may first search for relevant information and then use the URL context tool to read the content of the search results for a more in-depth understanding. 2155 | 2156 | Example prompts: 2157 | 2158 | 2159 | Give me three day events schedule based on YOUR_URL. Also let me know what needs to taken care of considering weather and commute. 2160 | 2161 | Recommend 3 books for beginners to read to learn more about the latest YOUR_subject. 2162 | Code examples with URL context only 2163 | 2164 | 2165 | from google import genai 2166 | from google.genai.types import Tool, GenerateContentConfig, GoogleSearch 2167 | 2168 | client = genai.Client() 2169 | model_id = "gemini-2.5-flash" 2170 | 2171 | url_context_tool = Tool( 2172 | url_context = types.UrlContext 2173 | ) 2174 | 2175 | response = client.models.generate_content( 2176 | model=model_id, 2177 | contents="Compare recipes from YOUR_URL1 and YOUR_URL2", 2178 | config=GenerateContentConfig( 2179 | tools=[url_context_tool], 2180 | response_modalities=["TEXT"], 2181 | ) 2182 | ) 2183 | 2184 | for each in response.candidates[0].content.parts: 2185 | print(each.text) 2186 | # get URLs retrieved for context 2187 | print(response.candidates[0].url_context_metadata) 2188 | 2189 | 2190 | Code examples with Grounding with Google Search 2191 | 2192 | 2193 | from google import genai 2194 | from google.genai.types import Tool, GenerateContentConfig, GoogleSearch 2195 | 2196 | client = genai.Client() 2197 | model_id = "gemini-2.5-flash" 2198 | 2199 | tools = [] 2200 | tools.append(Tool(url_context=types.UrlContext)) 2201 | tools.append(Tool(google_search=types.GoogleSearch)) 2202 | 2203 | response = client.models.generate_content( 2204 | model=model_id, 2205 | contents="Give me three day events schedule based on YOUR_URL. Also let me know what needs to taken care of considering weather and commute.", 2206 | config=GenerateContentConfig( 2207 | tools=tools, 2208 | response_modalities=["TEXT"], 2209 | ) 2210 | ) 2211 | 2212 | for each in response.candidates[0].content.parts: 2213 | print(each.text) 2214 | # get URLs retrieved for context 2215 | print(response.candidates[0].url_context_metadata) 2216 | 2217 | 2218 | 2219 | Limitations 2220 | The tool will consume up to 20 URLs per request for analysis. 2221 | For best results during experimental phase, use the tool on standard web pages rather than multimedia content such as YouTube videos. 2222 | During experimental phase, the tool is free to use. Billing to come later. 2223 | The experimental release has the following quotas: 2224 | 2225 | 1500 queries per day per project for requests made through the Gemini API 2226 | 100 queries per day per user in Google AI Studio 2227 | 2228 | 2229 | 2230 | 2231 | 2232 | Grounding with Google Search 2233 | 2234 | Grounding with Google Search connects the Gemini model to real-time web content and works with all available languages. This allows Gemini to provide more accurate answers and cite verifiable sources beyond its knowledge cutoff. 2235 | 2236 | Grounding helps you build applications that can: 2237 | 2238 | Increase factual accuracy: Reduce model hallucinations by basing responses on real-world information. 2239 | Access real-time information: Answer questions about recent events and topics. 2240 | Provide citations: Build user trust by showing the sources for the model's claims. 2241 | 2242 | 2243 | 2244 | from google import genai 2245 | from google.genai import types 2246 | 2247 | # Configure the client 2248 | client = genai.Client() 2249 | 2250 | # Define the grounding tool 2251 | grounding_tool = types.Tool( 2252 | google_search=types.GoogleSearch() 2253 | ) 2254 | 2255 | # Configure generation settings 2256 | config = types.GenerateContentConfig( 2257 | tools=[grounding_tool] 2258 | ) 2259 | 2260 | # Make the request 2261 | response = client.models.generate_content( 2262 | model="gemini-2.5-flash", 2263 | contents="Who won the euro 2024?", 2264 | config=config, 2265 | ) 2266 | 2267 | # Print the grounded response 2268 | print(response.text) 2269 | 2270 | 2271 | 2272 | How grounding with Google Search works 2273 | When you enable the google_search tool, the model handles the entire workflow of searching, processing, and citing information automatically. 2274 | 2275 | grounding-overview 2276 | 2277 | User Prompt: Your application sends a user's prompt to the Gemini API with the google_search tool enabled. 2278 | Prompt Analysis: The model analyzes the prompt and determines if a Google Search can improve the answer. 2279 | Google Search: If needed, the model automatically generates one or multiple search queries and executes them. 2280 | Search Results Processing: The model processes the search results, synthesizes the information, and formulates a response. 2281 | Grounded Response: The API returns a final, user-friendly response that is grounded in the search results. This response includes the model's text answer and groundingMetadata with the search queries, web results, and citations. 2282 | Understanding the Grounding Response 2283 | When a response is successfully grounded, the response includes a groundingMetadata field. This structured data is essential for verifying claims and building a rich citation experience in your application. 2284 | 2285 | 2286 | { 2287 | "candidates": [ 2288 | { 2289 | "content": { 2290 | "parts": [ 2291 | { 2292 | "text": "Spain won Euro 2024, defeating England 2-1 in the final. This victory marks Spain's record fourth European Championship title." 2293 | } 2294 | ], 2295 | "role": "model" 2296 | }, 2297 | "groundingMetadata": { 2298 | "webSearchQueries": [ 2299 | "UEFA Euro 2024 winner", 2300 | "who won euro 2024" 2301 | ], 2302 | "searchEntryPoint": { 2303 | "renderedContent": "" 2304 | }, 2305 | "groundingChunks": [ 2306 | {"web": {"uri": "https://vertexaisearch.cloud.google.com.....", "title": "aljazeera.com"}}, 2307 | {"web": {"uri": "https://vertexaisearch.cloud.google.com.....", "title": "uefa.com"}} 2308 | ], 2309 | "groundingSupports": [ 2310 | { 2311 | "segment": {"startIndex": 0, "endIndex": 85, "text": "Spain won Euro 2024, defeatin..."}, 2312 | "groundingChunkIndices": [0] 2313 | }, 2314 | { 2315 | "segment": {"startIndex": 86, "endIndex": 210, "text": "This victory marks Spain's..."}, 2316 | "groundingChunkIndices": [0, 1] 2317 | } 2318 | ] 2319 | } 2320 | } 2321 | ] 2322 | } 2323 | 2324 | 2325 | 2326 | The Gemini API returns the following information with the groundingMetadata: 2327 | 2328 | webSearchQueries : Array of the search queries used. This is useful for debugging and understanding the model's reasoning process. 2329 | searchEntryPoint : Contains the HTML and CSS to render the required Search Suggestions. Full usage requirements are detailed in the Terms of Service. 2330 | groundingChunks : Array of objects containing the web sources (uri and title). 2331 | groundingSupports : Array of chunks to connect model response text to the sources in groundingChunks. Each chunk links a text segment (defined by startIndex and endIndex) to one or more groundingChunkIndices. This is the key to building inline citations. 2332 | Grounding with Google Search can also be used in combination with the URL context tool to ground responses in both public web data and the specific URLs you provide. 2333 | 2334 | Attributing Sources with inline Citations 2335 | The API returns structured citation data, giving you complete control over how you display sources in your user interface. You can use the groundingSupports and groundingChunks fields to link the model's statements directly to their sources. Here is a common pattern for processing the metadata to create a response with inline, clickable citations. 2336 | 2337 | 2338 | def add_citations(response): 2339 | text = response.text 2340 | supports = response.candidates[0].grounding_metadata.grounding_supports 2341 | chunks = response.candidates[0].grounding_metadata.grounding_chunks 2342 | 2343 | # Sort supports by end_index in descending order to avoid shifting issues when inserting. 2344 | sorted_supports = sorted(supports, key=lambda s: s.segment.end_index, reverse=True) 2345 | 2346 | for support in sorted_supports: 2347 | end_index = support.segment.end_index 2348 | if support.grounding_chunk_indices: 2349 | # Create citation string like [1](link1)[2](link2) 2350 | citation_links = [] 2351 | for i in support.grounding_chunk_indices: 2352 | if i < len(chunks): 2353 | uri = chunks[i].web.uri 2354 | citation_links.append(f"[{i + 1}]({uri})") 2355 | 2356 | citation_string = ", ".join(citation_links) 2357 | text = text[:end_index] + citation_string + text[end_index:] 2358 | 2359 | return text 2360 | 2361 | # Assuming response with grounding metadata 2362 | text_with_citations = add_citations(response) 2363 | print(text_with_citations) 2364 | 2365 | 2366 | 2367 | 2368 | 2369 | 2370 | 2371 | Get started with Live API 2372 | 2373 | Preview: The Live API is in preview. 2374 | The Live API enables low-latency, real-time voice and video interactions with Gemini. It processes continuous streams of audio, video, or text to deliver immediate, human-like spoken responses, creating a natural conversational experience for your users. 2375 | 2376 | Live API Overview 2377 | 2378 | Live API offers a comprehensive set of features such as Voice Activity Detection, tool use and function calling, session management (for managing long running conversations) and ephemeral tokens (for secure client-sided authentication). 2379 | 2380 | This page gets you up and running with examples and basic code samples. 2381 | 2382 | Example applications 2383 | Check out the following example applications that illustrate how to use Live API for end-to-end use cases: 2384 | 2385 | Live audio starter app on AI Studio, using JavaScript libraries to connect to Live API and stream bidirectional audio through your microphone and speakers. 2386 | Live API Python cookbook using Pyaudio that connects to Live API. 2387 | Partner integrations 2388 | If you prefer a simpler development process, you can use Daily or LiveKit. These are third-party partner platforms that have already integrated the Gemini Live API over the WebRTC protocol to streamline the development of real-time audio and video applications. 2389 | 2390 | Before you begin building 2391 | There are two important decisions to make before you begin building with the Live API: choosing a model and choosing an implementation approach. 2392 | 2393 | Choose an audio generation architecture 2394 | If you're building an audio-based use case, your choice of model determines the audio generation architecture used to create the audio response: 2395 | 2396 | Native audio: This option provides the most natural and realistic-sounding speech and better multilingual performance. It also enables advanced features like affective (emotion-aware) dialogue, proactive audio (where the model can decide to ignore or respond to certain inputs), and "thinking". Native audio is supported by the following native audio models: 2397 | gemini-2.5-flash-preview-native-audio-dialog 2398 | gemini-2.5-flash-exp-native-audio-thinking-dialog 2399 | Half-cascade audio: This option uses a cascaded model architecture (native audio input and text-to-speech output). It offers better performance and reliability in production environments, especially with tool use. Half-cascaded audio is supported by the following models: 2400 | gemini-live-2.5-flash-preview 2401 | gemini-2.0-flash-live-001 2402 | Choose an implementation approach 2403 | When integrating with Live API, you'll need to choose one of the following implementation approaches: 2404 | 2405 | Server-to-server: Your backend connects to the Live API using WebSockets. Typically, your client sends stream data (audio, video, text) to your server, which then forwards it to the Live API. 2406 | Client-to-server: Your frontend code connects directly to the Live API using WebSockets to stream data, bypassing your backend. 2407 | Note: Client-to-server generally offers better performance for streaming audio and video, since it bypasses the need to send the stream to your backend first. It's also easier to set up since you don't need to implement a proxy that sends data from your client to your server and then your server to the API. However, for production environments, in order to mitigate security risks, we recommend using ephemeral tokens instead of standard API keys. 2408 | Get started 2409 | This example reads a WAV file, sends it in the correct format, and saves the received data as WAV file. 2410 | 2411 | You can send audio by converting it to 16-bit PCM, 16kHz, mono format, and you can receive audio by setting AUDIO as response modality. The output uses a sample rate of 24kHz. 2412 | 2413 | 2414 | 2415 | # Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav 2416 | # Install helpers for converting files: pip install librosa soundfile 2417 | import asyncio 2418 | import io 2419 | from pathlib import Path 2420 | import wave 2421 | from google import genai 2422 | from google.genai import types 2423 | import soundfile as sf 2424 | import librosa 2425 | 2426 | client = genai.Client() 2427 | 2428 | # Half cascade model: 2429 | # model = "gemini-live-2.5-flash-preview" 2430 | 2431 | # Native audio output model: 2432 | model = "gemini-2.5-flash-preview-native-audio-dialog" 2433 | 2434 | config = { 2435 | "response_modalities": ["AUDIO"], 2436 | "system_instruction": "You are a helpful assistant and answer in a friendly tone.", 2437 | } 2438 | 2439 | async def main(): 2440 | async with client.aio.live.connect(model=model, config=config) as session: 2441 | 2442 | buffer = io.BytesIO() 2443 | y, sr = librosa.load("sample.wav", sr=16000) 2444 | sf.write(buffer, y, sr, format='RAW', subtype='PCM_16') 2445 | buffer.seek(0) 2446 | audio_bytes = buffer.read() 2447 | 2448 | # If already in correct format, you can use this: 2449 | # audio_bytes = Path("sample.pcm").read_bytes() 2450 | 2451 | await session.send_realtime_input( 2452 | audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000") 2453 | ) 2454 | 2455 | wf = wave.open("audio.wav", "wb") 2456 | wf.setnchannels(1) 2457 | wf.setsampwidth(2) 2458 | wf.setframerate(24000) # Output is 24kHz 2459 | 2460 | async for response in session.receive(): 2461 | if response.data is not None: 2462 | wf.writeframes(response.data) 2463 | 2464 | # Un-comment this code to print audio data info 2465 | # if response.server_content.model_turn is not None: 2466 | # print(response.server_content.model_turn.parts[0].inline_data.mime_type) 2467 | 2468 | wf.close() 2469 | 2470 | if __name__ == "__main__": 2471 | asyncio.run(main()) 2472 | 2473 | 2474 | 2475 | Establishing a connection 2476 | The following example shows how to create a connection with an API key: 2477 | 2478 | 2479 | import asyncio 2480 | from google import genai 2481 | 2482 | client = genai.Client() 2483 | 2484 | model = "gemini-live-2.5-flash-preview" 2485 | config = {"response_modalities": ["TEXT"]} 2486 | 2487 | async def main(): 2488 | async with client.aio.live.connect(model=model, config=config) as session: 2489 | print("Session started") 2490 | 2491 | if __name__ == "__main__": 2492 | asyncio.run(main()) 2493 | 2494 | 2495 | 2496 | 2497 | Sending and receiving audio 2498 | The most common audio example, audio-to-audio, is covered in the Getting started guide. 2499 | 2500 | Here's an audio-to-text example that reads a WAV file, sends it in the correct format and receives text output: 2501 | 2502 | 2503 | # Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav 2504 | # Install helpers for converting files: pip install librosa soundfile 2505 | import asyncio 2506 | import io 2507 | from pathlib import Path 2508 | from google import genai 2509 | from google.genai import types 2510 | import soundfile as sf 2511 | import librosa 2512 | 2513 | client = genai.Client() 2514 | model = "gemini-live-2.5-flash-preview" 2515 | 2516 | config = {"response_modalities": ["TEXT"]} 2517 | 2518 | async def main(): 2519 | async with client.aio.live.connect(model=model, config=config) as session: 2520 | 2521 | buffer = io.BytesIO() 2522 | y, sr = librosa.load("sample.wav", sr=16000) 2523 | sf.write(buffer, y, sr, format='RAW', subtype='PCM_16') 2524 | buffer.seek(0) 2525 | audio_bytes = buffer.read() 2526 | 2527 | # If already in correct format, you can use this: 2528 | # audio_bytes = Path("sample.pcm").read_bytes() 2529 | 2530 | await session.send_realtime_input( 2531 | audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000") 2532 | ) 2533 | 2534 | async for response in session.receive(): 2535 | if response.text is not None: 2536 | print(response.text) 2537 | 2538 | if __name__ == "__main__": 2539 | asyncio.run(main()) 2540 | 2541 | 2542 | And here is a text-to-audio example. You can receive audio by setting AUDIO as response modality. This example saves the received data as WAV file: 2543 | 2544 | 2545 | 2546 | import asyncio 2547 | import wave 2548 | from google import genai 2549 | 2550 | client = genai.Client() 2551 | model = "gemini-live-2.5-flash-preview" 2552 | 2553 | config = {"response_modalities": ["AUDIO"]} 2554 | 2555 | async def main(): 2556 | async with client.aio.live.connect(model=model, config=config) as session: 2557 | wf = wave.open("audio.wav", "wb") 2558 | wf.setnchannels(1) 2559 | wf.setsampwidth(2) 2560 | wf.setframerate(24000) 2561 | 2562 | message = "Hello how are you?" 2563 | await session.send_client_content( 2564 | turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True 2565 | ) 2566 | 2567 | async for response in session.receive(): 2568 | if response.data is not None: 2569 | wf.writeframes(response.data) 2570 | 2571 | # Un-comment this code to print audio data info 2572 | # if response.server_content.model_turn is not None: 2573 | # print(response.server_content.model_turn.parts[0].inline_data.mime_type) 2574 | 2575 | wf.close() 2576 | 2577 | if __name__ == "__main__": 2578 | asyncio.run(main()) 2579 | 2580 | 2581 | 2582 | 2583 | 2584 | Context caching 2585 | 2586 | In a typical AI workflow, you might pass the same input tokens over and over to a model. The Gemini API offers two different caching mechanisms: 2587 | 2588 | Implicit caching (automatically enabled on Gemini 2.5 models, no cost saving guarantee) 2589 | Explicit caching (can be manually enabled on most models, cost saving guarantee) 2590 | Explicit caching is useful in cases where you want to guarantee cost savings, but with some added developer work. 2591 | 2592 | Implicit caching 2593 | Implicit caching is enabled by default for all Gemini 2.5 models. We automatically pass on cost savings if your request hits caches. There is nothing you need to do in order to enable this. It is effective as of May 8th, 2025. The minimum input token count for context caching is 1,024 for 2.5 Flash and 2,048 for 2.5 Pro. 2594 | 2595 | To increase the chance of an implicit cache hit: 2596 | 2597 | Try putting large and common contents at the beginning of your prompt 2598 | Try to send requests with similar prefix in a short amount of time 2599 | You can see the number of tokens which were cache hits in the response object's usage_metadata field. 2600 | 2601 | Explicit caching 2602 | Using the Gemini API explicit caching feature, you can pass some content to the model once, cache the input tokens, and then refer to the cached tokens for subsequent requests. At certain volumes, using cached tokens is lower cost than passing in the same corpus of tokens repeatedly. 2603 | 2604 | When you cache a set of tokens, you can choose how long you want the cache to exist before the tokens are automatically deleted. This caching duration is called the time to live (TTL). If not set, the TTL defaults to 1 hour. The cost for caching depends on the input token size and how long you want the tokens to persist. 2605 | 2606 | This section assumes that you've installed a Gemini SDK (or have curl installed) and that you've configured an API key, as shown in the quickstart. 2607 | 2608 | Generate content using a cache 2609 | The following example shows how to generate content using a cached system instruction and video file. 2610 | 2611 | 2612 | import os 2613 | import pathlib 2614 | import requests 2615 | import time 2616 | 2617 | from google import genai 2618 | from google.genai import types 2619 | 2620 | client = genai.Client() 2621 | 2622 | # Download video file 2623 | url = 'https://storage.googleapis.com/generativeai-downloads/data/SherlockJr._10min.mp4' 2624 | path_to_video_file = pathlib.Path('SherlockJr._10min.mp4') 2625 | if not path_to_video_file.exists(): 2626 | with path_to_video_file.open('wb') as wf: 2627 | response = requests.get(url, stream=True) 2628 | for chunk in response.iter_content(chunk_size=32768): 2629 | wf.write(chunk) 2630 | 2631 | # Upload the video using the Files API 2632 | video_file = client.files.upload(file=path_to_video_file) 2633 | 2634 | # Wait for the file to finish processing 2635 | while video_file.state.name == 'PROCESSING': 2636 | print('Waiting for video to be processed.') 2637 | time.sleep(2) 2638 | video_file = client.files.get(name=video_file.name) 2639 | 2640 | print(f'Video processing complete: {video_file.uri}') 2641 | 2642 | # You must use an explicit version suffix: "-flash-001", not just "-flash". 2643 | model='models/gemini-2.0-flash-001' 2644 | 2645 | # Create a cache with a 5 minute TTL 2646 | cache = client.caches.create( 2647 | model=model, 2648 | config=types.CreateCachedContentConfig( 2649 | display_name='sherlock jr movie', # used to identify the cache 2650 | system_instruction=( 2651 | 'You are an expert video analyzer, and your job is to answer ' 2652 | 'the user\'s query based on the video file you have access to.' 2653 | ), 2654 | contents=[video_file], 2655 | ttl="300s", 2656 | ) 2657 | ) 2658 | 2659 | # Construct a GenerativeModel which uses the created cache. 2660 | response = client.models.generate_content( 2661 | model = model, 2662 | contents= ( 2663 | 'Introduce different characters in the movie by describing ' 2664 | 'their personality, looks, and names. Also list the timestamps ' 2665 | 'they were introduced for the first time.'), 2666 | config=types.GenerateContentConfig(cached_content=cache.name) 2667 | ) 2668 | 2669 | print(response.usage_metadata) 2670 | 2671 | # The output should look something like this: 2672 | # 2673 | # prompt_token_count: 696219 2674 | # cached_content_token_count: 696190 2675 | # candidates_token_count: 214 2676 | # total_token_count: 696433 2677 | 2678 | print(response.text) 2679 | 2680 | 2681 | 2682 | List caches 2683 | It's not possible to retrieve or view cached content, but you can retrieve cache metadata (name, model, display_name, usage_metadata, create_time, update_time, and expire_time). 2684 | 2685 | To list metadata for all uploaded caches, use CachedContent.list(): 2686 | 2687 | 2688 | for cache in client.caches.list(): 2689 | print(cache) 2690 | To fetch the metadata for one cache object, if you know its name, use get: 2691 | 2692 | 2693 | client.caches.get(name=name) 2694 | Update a cache 2695 | You can set a new ttl or expire_time for a cache. Changing anything else about the cache isn't supported. 2696 | 2697 | The following example shows how to update the ttl of a cache using client.caches.update(). 2698 | 2699 | 2700 | from google import genai 2701 | from google.genai import types 2702 | 2703 | client.caches.update( 2704 | name = cache.name, 2705 | config = types.UpdateCachedContentConfig( 2706 | ttl='300s' 2707 | ) 2708 | ) 2709 | To set the expiry time, it will accepts either a datetime object or an ISO-formatted datetime string (dt.isoformat(), like 2025-01-27T16:02:36.473528+00:00). Your time must include a time zone (datetime.utcnow() doesn't attach a time zone, datetime.now(datetime.timezone.utc) does attach a time zone). 2710 | 2711 | 2712 | from google import genai 2713 | from google.genai import types 2714 | import datetime 2715 | 2716 | # You must use a time zone-aware time. 2717 | in10min = datetime.datetime.now(datetime.timezone.utc) + datetime.timedelta(minutes=10) 2718 | 2719 | client.caches.update( 2720 | name = cache.name, 2721 | config = types.UpdateCachedContentConfig( 2722 | expire_time=in10min 2723 | ) 2724 | ) 2725 | Delete a cache 2726 | The caching service provides a delete operation for manually removing content from the cache. The following example shows how to delete a cache: 2727 | 2728 | 2729 | client.caches.delete(cache.name) 2730 | Explicit caching using the OpenAI library 2731 | If you're using an OpenAI library, you can enable explicit caching using the cached_content property on extra_body. 2732 | 2733 | When to use explicit caching 2734 | Context caching is particularly well suited to scenarios where a substantial initial context is referenced repeatedly by shorter requests. Consider using context caching for use cases such as: 2735 | 2736 | Chatbots with extensive system instructions 2737 | Repetitive analysis of lengthy video files 2738 | Recurring queries against large document sets 2739 | Frequent code repository analysis or bug fixing 2740 | 2741 | 2742 | 2743 | 2744 | 2745 | Files API 2746 | 2747 | The Gemini family of artificial intelligence (AI) models is built to handle various types of input data, including text, images, and audio. Since these models can handle more than one type or mode of data, the Gemini models are called multimodal models or explained as having multimodal capabilities. 2748 | 2749 | This guide shows you how to work with media files using the Files API. The basic operations are the same for audio files, images, videos, documents, and other supported file types. 2750 | 2751 | For file prompting guidance, check out the File prompt guide section. 2752 | 2753 | Upload a file 2754 | You can use the Files API to upload a media file. Always use the Files API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB. 2755 | 2756 | The following code uploads a file and then uses the file in a call to generateContent. 2757 | 2758 | 2759 | from google import genai 2760 | 2761 | client = genai.Client() 2762 | 2763 | myfile = client.files.upload(file="path/to/sample.mp3") 2764 | 2765 | response = client.models.generate_content( 2766 | model="gemini-2.5-flash", contents=["Describe this audio clip", myfile] 2767 | ) 2768 | 2769 | print(response.text) 2770 | 2771 | 2772 | Get metadata for a file 2773 | You can verify that the API successfully stored the uploaded file and get its metadata by calling files.get. 2774 | 2775 | 2776 | 2777 | 2778 | 2779 | 2780 | Batch Mode 2781 | 2782 | The Gemini API's Batch Mode is designed to process large volumes of requests asynchronously at 50% of the standard cost. The target turnaround time is 24 hours, but in majority of cases, it is much quicker. 2783 | 2784 | Use Batch Mode for large-scale, non-urgent tasks such as data pre-processing or running evaluations where an immediate response is not required. 2785 | 2786 | Note: You can access Batch Mode via our Python libraries or the REST API. Support for other languages (including JavaScript/TypeScript) is coming soon. 2787 | Creating a batch job 2788 | You have two ways to submit your requests in Batch Mode: 2789 | 2790 | Inline Requests: A list of GenerateContentRequest objects directly included in your batch creation request. This is suitable for smaller batches that keep the total request size under 20MB. The output returned from the model is a list of inlineResponse objects. 2791 | Input File: A JSON Lines (JSONL) file where each line contains a complete GenerateContentRequest object. This method is recommended for larger requests. The output returned from the model is a JSONL file where each line is either a GenerateContentResponse or a status object. 2792 | Inline requests 2793 | For a small number of requests, you can directly embed the GenerateContentRequest objects within your BatchGenerateContentRequest. The following example calls the BatchGenerateContent method with inline requests: 2794 | 2795 | 2796 | from google import genai 2797 | from google.genai import types 2798 | 2799 | client = genai.Client() 2800 | 2801 | # A list of dictionaries, where each is a GenerateContentRequest 2802 | inline_requests = [ 2803 | { 2804 | 'contents': [{ 2805 | 'parts': [{'text': 'Tell me a one-sentence joke.'}], 2806 | 'role': 'user' 2807 | }] 2808 | }, 2809 | { 2810 | 'contents': [{ 2811 | 'parts': [{'text': 'Why is the sky blue?'}], 2812 | 'role': 'user' 2813 | }] 2814 | } 2815 | ] 2816 | 2817 | inline_batch_job = client.batches.create( 2818 | model="models/gemini-2.5-flash", 2819 | src=inline_requests, 2820 | config={ 2821 | 'display_name': "inlined-requests-job-1", 2822 | }, 2823 | ) 2824 | 2825 | print(f"Created batch job: {inline_batch_job.name}") 2826 | 2827 | 2828 | 2829 | You can upload this file using the File API as shown in the following example. If you are working with multimodal input, you can reference other uploaded files within your JSONL file. 2830 | 2831 | 2832 | 2833 | from google import genai 2834 | from google.genai import types 2835 | 2836 | client = genai.Client() 2837 | 2838 | # Create a sample JSONL file 2839 | with open("my-batch-requests.jsonl", "w") as f: 2840 | requests = [ 2841 | {"key": "request-1", "request": {"contents": [{"parts": [{"text": "Describe the process of photosynthesis."}]}]}}, 2842 | {"key": "request-2", "request": {"contents": [{"parts": [{"text": "What are the main ingredients in a Margherita pizza?"}]}]}} 2843 | ] 2844 | for req in requests: 2845 | f.write(json.dumps(req) + "\n") 2846 | 2847 | # Upload the file to the File API 2848 | uploaded_file = client.files.upload( 2849 | file='my-batch-requests.jsonl', 2850 | config=types.UploadFileConfig(display_name='my-batch-requests', mime_type='jsonl') 2851 | ) 2852 | 2853 | print(f"Uploaded file: {uploaded_file.name}") 2854 | 2855 | 2856 | 2857 | 2858 | The following example calls the BatchGenerateContent method with the input file uploaded using File API: 2859 | 2860 | 2861 | # Assumes `uploaded_file` is the file object from the previous step 2862 | file_batch_job = client.batches.create( 2863 | model="gemini-2.5-flash", 2864 | src=uploaded_file.name, 2865 | config={ 2866 | 'display_name': "file-upload-job-1", 2867 | }, 2868 | ) 2869 | 2870 | print(f"Created batch job: {file_batch_job.name}") 2871 | 2872 | 2873 | Monitoring job status 2874 | Use the operation name obtained when creating the batch job to poll its status. The state field of the batch job will indicate its current status. A batch job can be in one of the following states: 2875 | 2876 | JOB_STATE_PENDING: The job has been created and is waiting to be processed by the service. 2877 | JOB_STATE_SUCCEEDED: The job completed successfully. You can now retrieve the results. 2878 | JOB_STATE_FAILED: The job failed. Check the error details for more information. 2879 | JOB_STATE_CANCELLED: The job was cancelled by the user. 2880 | You can poll the job status periodically to check for completion. 2881 | 2882 | 2883 | 2884 | # Use the name of the job you want to check 2885 | # e.g., inline_batch_job.name from the previous step 2886 | job_name = "YOUR_BATCH_JOB_NAME" # (e.g. 'batches/your-batch-id') 2887 | batch_job = client.batches.get(name=job_name) 2888 | 2889 | completed_states = set([ 2890 | 'JOB_STATE_SUCCEEDED', 2891 | 'JOB_STATE_FAILED', 2892 | 'JOB_STATE_CANCELLED', 2893 | ]) 2894 | 2895 | print(f"Polling status for job: {job_name}") 2896 | batch_job = client.batches.get(name=job_name) # Initial get 2897 | while batch_job.state.name not in completed_states: 2898 | print(f"Current state: {batch_job.state.name}") 2899 | time.sleep(30) # Wait for 30 seconds before polling again 2900 | batch_job = client.batches.get(name=job_name) 2901 | 2902 | print(f"Job finished with state: {batch_job.state.name}") 2903 | if batch_job.state.name == 'JOB_STATE_FAILED': 2904 | print(f"Error: {batch_job.error}") 2905 | 2906 | 2907 | 2908 | Retrieving results 2909 | Once the job status indicates your batch job has succeeded, the results are available in the response field. 2910 | 2911 | 2912 | import json 2913 | 2914 | # Use the name of the job you want to check 2915 | # e.g., inline_batch_job.name from the previous step 2916 | job_name = "YOUR_BATCH_JOB_NAME" 2917 | batch_job = client.batches.get(name=job_name) 2918 | 2919 | if batch_job.state.name == 'JOB_STATE_SUCCEEDED': 2920 | 2921 | # If batch job was created with a file 2922 | if batch_job.dest and batch_job.dest.file_name: 2923 | # Results are in a file 2924 | result_file_name = batch_job.dest.file_name 2925 | print(f"Results are in file: {result_file_name}") 2926 | 2927 | print("Downloading result file content...") 2928 | file_content = client.files.download(file=result_file_name) 2929 | # Process file_content (bytes) as needed 2930 | print(file_content.decode('utf-8')) 2931 | 2932 | # If batch job was created with inline request 2933 | elif batch_job.dest and batch_job.dest.inlined_responses: 2934 | # Results are inline 2935 | print("Results are inline:") 2936 | for i, inline_response in enumerate(batch_job.dest.inlined_responses): 2937 | print(f"Response {i+1}:") 2938 | if inline_response.response: 2939 | # Accessing response, structure may vary. 2940 | try: 2941 | print(inline_response.response.text) 2942 | except AttributeError: 2943 | print(inline_response.response) # Fallback 2944 | elif inline_response.error: 2945 | print(f"Error: {inline_response.error}") 2946 | else: 2947 | print("No results found (neither file nor inline).") 2948 | else: 2949 | print(f"Job did not succeed. Final state: {batch_job.state.name}") 2950 | if batch_job.error: 2951 | print(f"Error: {batch_job.error}") 2952 | 2953 | 2954 | 2955 | 2956 | Embeddings 2957 | 2958 | Note: gemini-embedding-001 is our newest text embedding model available in the Gemini API. 2959 | The Gemini API offers text embedding models to generate embeddings for words, phrases, sentences, and code. These foundational embeddings power advanced NLP tasks such as semantic search, classification, and clustering, providing more accurate, context-aware results than keyword-based approaches. 2960 | 2961 | Building Retrieval Augmented Generation (RAG) systems is a common use case for embeddings. Embeddings plays a key role in significantly enhancing model outputs with improved factual accuracy, coherence, and contextual richness. They efficiently retrieve relevant information from knowledge bases, represented by embeddings, which are then passed as additional context in the input prompt to language models, guiding it to generate more informed and accurate responses. 2962 | 2963 | Generating embeddings 2964 | Use the embedContent method to generate text embeddings: 2965 | 2966 | 2967 | from google import genai 2968 | 2969 | client = genai.Client() 2970 | 2971 | result = client.models.embed_content( 2972 | model="gemini-embedding-001", 2973 | contents="What is the meaning of life?") 2974 | 2975 | print(result.embeddings) 2976 | 2977 | 2978 | You can also generate embeddings for multiple chunks at once by passing them in as a list of strings. 2979 | 2980 | 2981 | from google import genai 2982 | 2983 | client = genai.Client() 2984 | 2985 | result = client.models.embed_content( 2986 | model="gemini-embedding-001", 2987 | contents= [ 2988 | "What is the meaning of life?", 2989 | "What is the purpose of existence?", 2990 | "How do I bake a cake?" 2991 | ]) 2992 | 2993 | for embedding in result.embeddings: 2994 | print(embedding) 2995 | 2996 | 2997 | 2998 | Specify task type to improve performance 2999 | You can use embeddings for a wide range of tasks from classification to document search. Specifying the right task type helps optimize the embeddings for the intended relationships, maximizing accuracy and efficiency. For a complete list of supported task types, see the Supported task types table. 3000 | 3001 | The following example shows how you can use SEMANTIC_SIMILARITY to check how similar in meaning strings of texts are. 3002 | 3003 | 3004 | from google import genai 3005 | from google.genai import types 3006 | import numpy as np 3007 | from sklearn.metrics.pairwise import cosine_similarity 3008 | 3009 | client = genai.Client() 3010 | 3011 | texts = [ 3012 | "What is the meaning of life?", 3013 | "What is the purpose of existence?", 3014 | "How do I bake a cake?"] 3015 | 3016 | result = [ 3017 | np.array(e.values) for e in client.models.embed_content( 3018 | model="gemini-embedding-001", 3019 | contents=texts, 3020 | config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")).embeddings 3021 | ] 3022 | 3023 | # Calculate cosine similarity. Higher scores = greater semantic similarity. 3024 | 3025 | embeddings_matrix = np.array(result) 3026 | similarity_matrix = cosine_similarity(embeddings_matrix) 3027 | 3028 | for i, text1 in enumerate(texts): 3029 | for j in range(i + 1, len(texts)): 3030 | text2 = texts[j] 3031 | similarity = similarity_matrix[i, j] 3032 | print(f"Similarity between '{text1}' and '{text2}': {similarity:.4f}") 3033 | 3034 | 3035 | 3036 | 3037 | Supported task types 3038 | Task type Description Examples 3039 | SEMANTIC_SIMILARITY Embeddings optimized to assess text similarity. Recommendation systems, duplicate detection 3040 | CLASSIFICATION Embeddings optimized to classify texts according to preset labels. Sentiment analysis, spam detection 3041 | CLUSTERING Embeddings optimized to cluster texts based on their similarities. Document organization, market research, anomaly detection 3042 | RETRIEVAL_DOCUMENT Embeddings optimized for document search. Indexing articles, books, or web pages for search. 3043 | RETRIEVAL_QUERY Embeddings optimized for general search queries. Use RETRIEVAL_QUERY for queries; RETRIEVAL_DOCUMENT for documents to be retrieved. Custom search 3044 | CODE_RETRIEVAL_QUERY Embeddings optimized for retrieval of code blocks based on natural language queries. Use CODE_RETRIEVAL_QUERY for queries; RETRIEVAL_DOCUMENT for code blocks to be retrieved. Code suggestions and search 3045 | QUESTION_ANSWERING Embeddings for questions in a question-answering system, optimized for finding documents that answer the question. Use QUESTION_ANSWERING for questions; RETRIEVAL_DOCUMENT for documents to be retrieved. Chatbox 3046 | FACT_VERIFICATION Embeddings for statements that need to be verified, optimized for retrieving documents that contain evidence supporting or refuting the statement. Use FACT_VERIFICATION for the target text; RETRIEVAL_DOCUMENT for documents to be retrieved Automated fact-checking systems 3047 | Controlling Embedding Size 3048 | The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. You can choose to use the full 3072-dimensional embedding, or you can truncate it to a smaller size without losing quality to save storage space. For best quality, we recommend using the first 768 and 1536. 3049 | 3050 | By using the output_dimensionality parameter, users can control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. 3051 | 3052 | 3053 | from google import genai 3054 | from google.genai import types 3055 | 3056 | client = genai.Client() 3057 | 3058 | result = client.models.embed_content( 3059 | model="gemini-embedding-001", 3060 | contents="What is the meaning of life?", 3061 | config=types.EmbedContentConfig(output_dimensionality=768) 3062 | ) 3063 | 3064 | [embedding_obj] = result.embeddings 3065 | embedding_length = len(embedding_obj.values) 3066 | 3067 | print(f"Length of embedding: {embedding_length}") 3068 | 3069 | 3070 | Safety settings 3071 | 3072 | The Gemini API provides safety settings that you can adjust during the prototyping stage to determine if your application requires more or less restrictive safety configuration. You can adjust these settings across five filter categories to restrict or allow certain types of content. 3073 | 3074 | This guide covers how the Gemini API handles safety settings and filtering and how you can change the safety settings for your application. 3075 | 3076 | Note: Applications that use less restrictive safety settings may be subject to review. See the Terms of Service for more information. 3077 | Safety filters 3078 | The Gemini API's adjustable safety filters cover the following categories: 3079 | 3080 | Category Description 3081 | Harassment Negative or harmful comments targeting identity and/or protected attributes. 3082 | Hate speech Content that is rude, disrespectful, or profane. 3083 | Sexually explicit Contains references to sexual acts or other lewd content. 3084 | Dangerous Promotes, facilitates, or encourages harmful acts. 3085 | Civic integrity Election-related queries. 3086 | These categories are defined in HarmCategory. The Gemini models only support HARM_CATEGORY_HARASSMENT, HARM_CATEGORY_HATE_SPEECH, HARM_CATEGORY_SEXUALLY_EXPLICIT, HARM_CATEGORY_DANGEROUS_CONTENT, and HARM_CATEGORY_CIVIC_INTEGRITY. All other categories are used only by PaLM 2 (Legacy) models. 3087 | You can use these filters to adjust what's appropriate for your use case. For example, if you're building video game dialogue, you may deem it acceptable to allow more content that's rated as Dangerous due to the nature of the game. 3088 | 3089 | In addition to the adjustable safety filters, the Gemini API has built-in protections against core harms, such as content that endangers child safety. These types of harm are always blocked and cannot be adjusted. 3090 | 3091 | Content safety filtering level 3092 | The Gemini API categorizes the probability level of content being unsafe as HIGH, MEDIUM, LOW, or NEGLIGIBLE. 3093 | 3094 | The Gemini API blocks content based on the probability of content being unsafe and not the severity. This is important to consider because some content can have low probability of being unsafe even though the severity of harm could still be high. For example, comparing the sentences: 3095 | 3096 | The robot punched me. 3097 | The robot slashed me up. 3098 | The first sentence might result in a higher probability of being unsafe, but you might consider the second sentence to be a higher severity in terms of violence. Given this, it is important that you carefully test and consider what the appropriate level of blocking is needed to support your key use cases while minimizing harm to end users. 3099 | 3100 | Safety filtering per request 3101 | You can adjust the safety settings for each request you make to the API. When you make a request, the content is analyzed and assigned a safety rating. The safety rating includes the category and the probability of the harm classification. For example, if the content was blocked due to the harassment category having a high probability, the safety rating returned would have category equal to HARASSMENT and harm probability set to HIGH. 3102 | 3103 | By default, safety settings block content (including prompts) with medium or higher probability of being unsafe across any filter. This baseline safety is designed to work for most use cases, so you should only adjust your safety settings if it's consistently required for your application. 3104 | 3105 | The following table describes the block settings you can adjust for each category. For example, if you set the block setting to Block few for the Hate speech category, everything that has a high probability of being hate speech content is blocked. But anything with a lower probability is allowed. 3106 | 3107 | Threshold (Google AI Studio) Threshold (API) Description 3108 | Block none BLOCK_NONE Always show regardless of probability of unsafe content 3109 | Block few BLOCK_ONLY_HIGH Block when high probability of unsafe content 3110 | Block some BLOCK_MEDIUM_AND_ABOVE Block when medium or high probability of unsafe content 3111 | Block most BLOCK_LOW_AND_ABOVE Block when low, medium or high probability of unsafe content 3112 | N/A HARM_BLOCK_THRESHOLD_UNSPECIFIED Threshold is unspecified, block using default threshold 3113 | If the threshold is not set, the default block threshold is Block none (for gemini-1.5-pro-002 and gemini-1.5-flash-002 and all newer stable GA models) or Block some (in all other models) for all categories except the Civic integrity category. 3114 | 3115 | The default block threshold for the Civic integrity category is Block none (for gemini-2.0-flash-001 aliased as gemini-2.0-flash, gemini-2.0-pro-exp-02-05, and gemini-2.0-flash-lite) both for Google AI Studio and the Gemini API, and Block most for all other models in Google AI Studio only. 3116 | 3117 | You can set these settings for each request you make to the generative service. See the HarmBlockThreshold API reference for details. 3118 | 3119 | Safety feedback 3120 | generateContent returns a GenerateContentResponse which includes safety feedback. 3121 | 3122 | Prompt feedback is included in promptFeedback. If promptFeedback.blockReason is set, then the content of the prompt was blocked. 3123 | 3124 | Response candidate feedback is included in Candidate.finishReason and Candidate.safetyRatings. If response content was blocked and the finishReason was SAFETY, you can inspect safetyRatings for more details. The content that was blocked is not returned. 3125 | 3126 | 3127 | 3128 | 3129 | 3130 | 3131 | 3132 | ReAct agent from scratch with Gemini 2.5 and LangGraph 3133 | 3134 | LangGraph is a framework for building stateful LLM applications, making it a good choice for constructing ReAct (Reasoning and Acting) Agents. 3135 | 3136 | ReAct agents combine LLM reasoning with action execution. They iteratively think, use tools, and act on observations to achieve user goals, dynamically adapting their approach. Introduced in "ReAct: Synergizing Reasoning and Acting in Language Models" (2023), this pattern tries to mirror human-like, flexible problem-solving over rigid workflows. 3137 | 3138 | While LangGraph offers a prebuilt ReAct agent (create_react_agent), it shines when you need more control and customization for your ReAct implementations. 3139 | 3140 | LangGraph models agents as graphs using three key components: 3141 | 3142 | State: Shared data structure (typically TypedDict or Pydantic BaseModel) representing the application's current snapshot. 3143 | Nodes: Encodes logic of your agents. They receive the current State as input, perform some computation or side-effect, and return an updated State, such as LLM calls or tool calls. 3144 | Edges: Define the next Node to execute based on the current State, allowing for conditional logic and fixed transitions. 3145 | If you don't have an API Key yet, you can get one for free at the Google AI Studio. 3146 | 3147 | 3148 | pip install langgraph langchain-google-genai geopy requests 3149 | Set your API key in the environment variable GEMINI_API_KEY. 3150 | 3151 | 3152 | import os 3153 | 3154 | # Read your API key from the environment variable or set it manually 3155 | api_key = os.getenv("GEMINI_API_KEY") 3156 | To better understand how to implement a ReAct agent using LangGraph, let's walk through a practical example. You will create a simple agent whose goal is to use a tool to find the current weather for a specified location. 3157 | 3158 | For this weather agent, its State will need to maintain the ongoing conversation history (as a list of messages) and a counter for the number of steps taken to further illustrate state management. 3159 | 3160 | LangGraph provides a convenient helper, add_messages, for updating message lists in the state. It functions as a reducer, meaning it takes the current list and new messages, then returns a combined list. It smartly handles updates by message ID and defaults to an "append-only" behavior for new, unique messages. 3161 | 3162 | Note: Since having a list of messages in the state is so common, there exists a prebuilt state called MessagesState which makes it easy to use messages. 3163 | 3164 | from typing import Annotated,Sequence, TypedDict 3165 | 3166 | from langchain_core.messages import BaseMessage 3167 | from langgraph.graph.message import add_messages # helper function to add messages to the state 3168 | 3169 | 3170 | class AgentState(TypedDict): 3171 | """The state of the agent.""" 3172 | messages: Annotated[Sequence[BaseMessage], add_messages] 3173 | number_of_steps: int 3174 | Next, you define your weather tool. 3175 | 3176 | 3177 | from langchain_core.tools import tool 3178 | from geopy.geocoders import Nominatim 3179 | from pydantic import BaseModel, Field 3180 | import requests 3181 | 3182 | geolocator = Nominatim(user_agent="weather-app") 3183 | 3184 | class SearchInput(BaseModel): 3185 | location:str = Field(description="The city and state, e.g., San Francisco") 3186 | date:str = Field(description="the forecasting date for when to get the weather format (yyyy-mm-dd)") 3187 | 3188 | @tool("get_weather_forecast", args_schema=SearchInput, return_direct=True) 3189 | def get_weather_forecast(location: str, date: str): 3190 | """Retrieves the weather using Open-Meteo API for a given location (city) and a date (yyyy-mm-dd). Returns a list dictionary with the time and temperature for each hour.""" 3191 | location = geolocator.geocode(location) 3192 | if location: 3193 | try: 3194 | response = requests.get(f"https://api.open-meteo.com/v1/forecast?latitude={location.latitude}&longitude={location.longitude}&hourly=temperature_2m&start_date={date}&end_date={date}") 3195 | data = response.json() 3196 | return {time: temp for time, temp in zip(data["hourly"]["time"], data["hourly"]["temperature_2m"])} 3197 | except Exception as e: 3198 | return {"error": str(e)} 3199 | else: 3200 | return {"error": "Location not found"} 3201 | 3202 | tools = [get_weather_forecast] 3203 | Next, you initialize your model and bind the tools to the model. 3204 | 3205 | 3206 | from datetime import datetime 3207 | from langchain_google_genai import ChatGoogleGenerativeAI 3208 | 3209 | # Create LLM class 3210 | llm = ChatGoogleGenerativeAI( 3211 | model= "gemini-2.5-pro", 3212 | temperature=1.0, 3213 | max_retries=2, 3214 | google_api_key=api_key, 3215 | ) 3216 | 3217 | # Bind tools to the model 3218 | model = llm.bind_tools([get_weather_forecast]) 3219 | 3220 | # Test the model with tools 3221 | res=model.invoke(f"What is the weather in Berlin on {datetime.today()}?") 3222 | 3223 | print(res) 3224 | The last step before you can run your agent is to define your nodes and edges. In this example, you have two nodes and one edge. - call_tool node that executes your tool method. LangGraph has a prebuilt node for this called ToolNode. - call_model node that uses the model_with_tools to call the model. - should_continue edge that decides whether to call the tool or the model. 3225 | 3226 | The number of nodes and edges is not fixed. You can add as many nodes and edges as you want to your graph. For example, you could add a node for adding structured output or a self-verification/reflection node to check the model output before calling the tool or the model. 3227 | 3228 | 3229 | from langchain_core.messages import ToolMessage 3230 | from langchain_core.runnables import RunnableConfig 3231 | 3232 | tools_by_name = {tool.name: tool for tool in tools} 3233 | 3234 | # Define our tool node 3235 | def call_tool(state: AgentState): 3236 | outputs = [] 3237 | # Iterate over the tool calls in the last message 3238 | for tool_call in state["messages"][-1].tool_calls: 3239 | # Get the tool by name 3240 | tool_result = tools_by_name[tool_call["name"]].invoke(tool_call["args"]) 3241 | outputs.append( 3242 | ToolMessage( 3243 | content=tool_result, 3244 | name=tool_call["name"], 3245 | tool_call_id=tool_call["id"], 3246 | ) 3247 | ) 3248 | return {"messages": outputs} 3249 | 3250 | def call_model( 3251 | state: AgentState, 3252 | config: RunnableConfig, 3253 | ): 3254 | # Invoke the model with the system prompt and the messages 3255 | response = model.invoke(state["messages"], config) 3256 | # We return a list, because this will get added to the existing messages state using the add_messages reducer 3257 | return {"messages": [response]} 3258 | 3259 | 3260 | # Define the conditional edge that determines whether to continue or not 3261 | def should_continue(state: AgentState): 3262 | messages = state["messages"] 3263 | # If the last message is not a tool call, then we finish 3264 | if not messages[-1].tool_calls: 3265 | return "end" 3266 | # default to continue 3267 | return "continue" 3268 | Now you have all the components to build your agent. Let's put them together. 3269 | 3270 | 3271 | from langgraph.graph import StateGraph, END 3272 | 3273 | # Define a new graph with our state 3274 | workflow = StateGraph(AgentState) 3275 | 3276 | # 1. Add our nodes 3277 | workflow.add_node("llm", call_model) 3278 | workflow.add_node("tools", call_tool) 3279 | # 2. Set the entrypoint as `agent`, this is the first node called 3280 | workflow.set_entry_point("llm") 3281 | # 3. Add a conditional edge after the `llm` node is called. 3282 | workflow.add_conditional_edges( 3283 | # Edge is used after the `llm` node is called. 3284 | "llm", 3285 | # The function that will determine which node is called next. 3286 | should_continue, 3287 | # Mapping for where to go next, keys are strings from the function return, and the values are other nodes. 3288 | # END is a special node marking that the graph is finish. 3289 | { 3290 | # If `tools`, then we call the tool node. 3291 | "continue": "tools", 3292 | # Otherwise we finish. 3293 | "end": END, 3294 | }, 3295 | ) 3296 | # 4. Add a normal edge after `tools` is called, `llm` node is called next. 3297 | workflow.add_edge("tools", "llm") 3298 | 3299 | # Now we can compile and visualize our graph 3300 | graph = workflow.compile() 3301 | You can visualize your graph using the draw_mermaid_png method. 3302 | 3303 | 3304 | from IPython.display import Image, display 3305 | 3306 | display(Image(graph.get_graph().draw_mermaid_png())) 3307 | png 3308 | 3309 | Now let's run the agent. 3310 | 3311 | 3312 | from datetime import datetime 3313 | # Create our initial message dictionary 3314 | inputs = {"messages": [("user", f"What is the weather in Berlin on {datetime.today()}?")]} 3315 | 3316 | # call our graph with streaming to see the steps 3317 | for state in graph.stream(inputs, stream_mode="values"): 3318 | last_message = state["messages"][-1] 3319 | last_message.pretty_print() 3320 | You can now continue with your conversation and for example ask for the weather in another city or let it compare it. 3321 | 3322 | 3323 | state["messages"].append(("user", "Would it be in Munich warmer?")) 3324 | 3325 | for state in graph.stream(state, stream_mode="values"): 3326 | last_message = state["messages"][-1] 3327 | last_message.pretty_print() 3328 | 3329 | 3330 | 3331 | Customer Support Analysis with Gemini 2.5 Pro and CrewAI 3332 | 3333 | CrewAI is a framework for orchestrating autonomous AI agents that collaborate to achieve complex goals. It lets you define agents by specifying roles, goals, and backstories, and then define tasks for them. 3334 | 3335 | This example demonstrates how to build a multi-agent system for analyzing customer support data to identify issues and propose process improvements using Gemini 2.5 Pro, generating a report intended to be read by a Chief Operating Officer (COO). 3336 | 3337 | The guide will show you how to create a "crew" of AI agents that can do the following tasks: 3338 | 3339 | Fetch and analyze customer support data (simulated in this example). 3340 | Identify recurring problems and process bottlenecks. 3341 | Suggest actionable improvements. 3342 | Compile the findings into a concise report suitable for a COO. 3343 | You need a Gemini API key. If you don't already have one, you can get one in Google AI Studio. 3344 | 3345 | 3346 | pip install "crewai[tools]" 3347 | Set your Gemini API key as an environment variable named GEMINI_API_KEY, then configure CrewAI to use the Gemini 2.5 Pro model. 3348 | 3349 | 3350 | import os 3351 | from crewai import LLM 3352 | 3353 | # Read your API key from the environment variable 3354 | gemini_api_key = os.getenv("GEMINI_API_KEY") 3355 | 3356 | # Use Gemini 2.5 Pro Experimental model 3357 | gemini_llm = LLM( 3358 | model='gemini/gemini-2.5-pro', 3359 | api_key=gemini_api_key, 3360 | temperature=0.0 # Lower temperature for more consistent results. 3361 | ) 3362 | Define components 3363 | CrewAI applications are built using Tools, Agents, Tasks, and the Crew itself. Each of these is explained in the following sections. 3364 | 3365 | Tools 3366 | Tools are capabilities that agents can use to interact with the outside world or perform specific actions. Here, you define a placeholder tool to simulate fetching customer support data. In a real application, you would connect to a database, API or file system. For more information on tools, see the CrewAI tools guide. 3367 | 3368 | 3369 | from crewai.tools import BaseTool 3370 | 3371 | # Placeholder tool for fetching customer support data 3372 | class CustomerSupportDataTool(BaseTool): 3373 | name: str = "Customer Support Data Fetcher" 3374 | description: str = ( 3375 | "Fetches recent customer support interactions, tickets, and feedback. " 3376 | "Returns a summary string.") 3377 | 3378 | def _run(self, argument: str) -> str: 3379 | # In a real scenario, this would query a database or API. 3380 | # For this example, return simulated data. 3381 | print(f"--- Fetching data for query: {argument} ---") 3382 | return ( 3383 | """Recent Support Data Summary: 3384 | - 50 tickets related to 'login issues'. High resolution time (avg 48h). 3385 | - 30 tickets about 'billing discrepancies'. Mostly resolved within 12h. 3386 | - 20 tickets on 'feature requests'. Often closed without resolution. 3387 | - Frequent feedback mentions 'confusing user interface' for password reset. 3388 | - High volume of calls related to 'account verification process'. 3389 | - Sentiment analysis shows growing frustration with 'login issues' resolution time. 3390 | - Support agent notes indicate difficulty reproducing 'login issues'.""" 3391 | ) 3392 | 3393 | support_data_tool = CustomerSupportDataTool() 3394 | Agents 3395 | Agents are the individual AI workers in your crew. Each agent has a specific role, goal, backstory, assigned llm, and optional tools. For more information on agents, see the CrewAI agents guide. 3396 | 3397 | 3398 | from crewai import Agent 3399 | 3400 | # Agent 1: Data analyst 3401 | data_analyst = Agent( 3402 | role='Customer Support Data Analyst', 3403 | goal='Analyze customer support data to identify trends, recurring issues, and key pain points.', 3404 | backstory=( 3405 | """You are an expert data analyst specializing in customer support operations. 3406 | Your strength lies in identifying patterns and quantifying problems from raw support data.""" 3407 | ), 3408 | verbose=True, 3409 | allow_delegation=False, # This agent focuses on its specific task 3410 | tools=[support_data_tool], # Assign the data fetching tool 3411 | llm=gemini_llm # Use the configured Gemini LLM 3412 | ) 3413 | 3414 | # Agent 2: Process optimizer 3415 | process_optimizer = Agent( 3416 | role='Process Optimization Specialist', 3417 | goal='Identify bottlenecks and inefficiencies in current support processes based on the data analysis. Propose actionable improvements.', 3418 | backstory=( 3419 | """You are a specialist in optimizing business processes, particularly in customer support. 3420 | You excel at pinpointing root causes of delays and inefficiencies and suggesting concrete solutions.""" 3421 | ), 3422 | verbose=True, 3423 | allow_delegation=False, 3424 | # No tools needed, this agent relies on the context provided by data_analyst. 3425 | llm=gemini_llm 3426 | ) 3427 | 3428 | # Agent 3: Report writer 3429 | report_writer = Agent( 3430 | role='Executive Report Writer', 3431 | goal='Compile the analysis and improvement suggestions into a concise, clear, and actionable report for the COO.', 3432 | backstory=( 3433 | """You are a skilled writer adept at creating executive summaries and reports. 3434 | You focus on clarity, conciseness, and highlighting the most critical information and recommendations for senior leadership.""" 3435 | ), 3436 | verbose=True, 3437 | allow_delegation=False, 3438 | llm=gemini_llm 3439 | ) 3440 | Tasks 3441 | Tasks define the specific assignments for the agents. Each task has a description, expected_output, and is assigned to an agent. Tasks are run sequentially by default and include the context of the previous task. For more information on tasks, see the CrewAI tasks guide. 3442 | 3443 | 3444 | from crewai import Task 3445 | 3446 | # Task 1: Analyze data 3447 | analysis_task = Task( 3448 | description=( 3449 | """Fetch and analyze the latest customer support interaction data (tickets, feedback, call logs) 3450 | focusing on the last quarter. Identify the top 3-5 recurring issues, quantify their frequency 3451 | and impact (e.g., resolution time, customer sentiment). Use the Customer Support Data Fetcher tool.""" 3452 | ), 3453 | expected_output=( 3454 | """A summary report detailing the key findings from the customer support data analysis, including: 3455 | - Top 3-5 recurring issues with frequency. 3456 | - Average resolution times for these issues. 3457 | - Key customer pain points mentioned in feedback. 3458 | - Any notable trends in sentiment or support agent observations.""" 3459 | ), 3460 | agent=data_analyst # Assign task to the data_analyst agent 3461 | ) 3462 | 3463 | # Task 2: Identify bottlenecks and suggest improvements 3464 | optimization_task = Task( 3465 | description=( 3466 | """Based on the data analysis report provided by the Data Analyst, identify the primary bottlenecks 3467 | in the support processes contributing to the identified issues (especially the top recurring ones). 3468 | Propose 2-3 concrete, actionable process improvements to address these bottlenecks. 3469 | Consider potential impact and ease of implementation.""" 3470 | ), 3471 | expected_output=( 3472 | """A concise list identifying the main process bottlenecks (e.g., lack of documentation for agents, 3473 | complex escalation path, UI issues) linked to the key problems. 3474 | A list of 2-3 specific, actionable recommendations for process improvement 3475 | (e.g., update agent knowledge base, simplify password reset UI, implement proactive monitoring).""" 3476 | ), 3477 | agent=process_optimizer # Assign task to the process_optimizer agent 3478 | # This task implicitly uses the output of analysis_task as context 3479 | ) 3480 | 3481 | # Task 3: Compile COO report 3482 | report_task = Task( 3483 | description=( 3484 | """Compile the findings from the Data Analyst and the recommendations from the Process Optimization Specialist 3485 | into a single, concise executive report for the COO. The report should clearly state: 3486 | 1. The most critical customer support issues identified (with brief data points). 3487 | 2. The key process bottlenecks causing these issues. 3488 | 3. The recommended process improvements. 3489 | Ensure the report is easy to understand, focuses on actionable insights, and is formatted professionally.""" 3490 | ), 3491 | expected_output=( 3492 | """A well-structured executive report (max 1 page) summarizing the critical support issues, 3493 | underlying process bottlenecks, and clear, actionable recommendations for the COO. 3494 | Use clear headings and bullet points.""" 3495 | ), 3496 | agent=report_writer # Assign task to the report_writer agent 3497 | ) 3498 | Crew 3499 | The Crew brings the agents and tasks together, defining the workflow process (such as "sequential"). 3500 | 3501 | 3502 | from crewai import Crew, Process 3503 | 3504 | # Define the crew with agents, tasks, and process 3505 | support_analysis_crew = Crew( 3506 | agents=[data_analyst, process_optimizer, report_writer], 3507 | tasks=[analysis_task, optimization_task, report_task], 3508 | process=Process.sequential, # Tasks will run sequentially in the order defined 3509 | verbose=True 3510 | ) 3511 | Run the Crew 3512 | Finally, kick off the crew execution with any necessary inputs. 3513 | 3514 | 3515 | # Start the crew's work 3516 | print("--- Starting Customer Support Analysis Crew ---") 3517 | # The 'inputs' dictionary provides initial context if needed by the first task. 3518 | # In this case, the tool simulates data fetching regardless of the input. 3519 | result = support_analysis_crew.kickoff(inputs={'data_query': 'last quarter support data'}) 3520 | 3521 | print("--- Crew Execution Finished ---") 3522 | print("--- Final Report for COO ---") 3523 | print(result) 3524 | The script will now execute. The Data Analyst will use the tool, the Process Optimizer will analyze the findings, and the Report Writer will compile the final report, which is then printed to the console. The verbose=True setting will show the detailed thought process and actions of each agent 3525 | 3526 | 3527 | 3528 | 3529 | 3530 | 3531 | Research Agent with Gemini 2.5 Pro and LlamaIndex 3532 | 3533 | LlamaIndex is a framework for building knowledge agents using LLMs connected to your data. This example shows you how to build a multi-agent workflow for a Research Agent. In LlamaIndex, Workflows are the building blocks of agent or multi-agent systems. 3534 | 3535 | You need a Gemini API key. If you don't already have one, you can get one in Google AI Studio. First, install all reuqired LlamaIndex libraries. LlamaIndex uses the google-genai package under the hood. 3536 | 3537 | 3538 | pip install llama-index llama-index-utils-workflow llama-index-llms-google-genai llama-index-tools-google 3539 | Set up Gemini 2.5 Pro in LlamaIndex 3540 | The engine of any LlamaIndex agent is an LLM that handles reasoning and text processing. This example uses Gemini 2.5 Pro. Make sure you set your API key as an environment variable. 3541 | 3542 | 3543 | from llama_index.llms.google_genai import GoogleGenAI 3544 | 3545 | llm = GoogleGenAI(model="gemini-2.5-pro") 3546 | Build tools 3547 | Agents use tools to interact with the outside world, like searching the web or storing information. Tools in LlamaIndex can be regular Python functions, or imported from pre-existing ToolSpecs. Gemini comes with a built-in tool for using Google Search which is used here. 3548 | 3549 | 3550 | from google.genai import types 3551 | 3552 | google_search_tool = types.Tool( 3553 | google_search=types.GoogleSearch() 3554 | ) 3555 | 3556 | llm_with_search = GoogleGenAI( 3557 | model="gemini-2.5-pro", 3558 | generation_config=types.GenerateContentConfig(tools=[google_search_tool]) 3559 | ) 3560 | Now test the LLM instance with a query that requires search: 3561 | 3562 | 3563 | response = llm_with_search.complete("What's the weather like today in Biarritz?") 3564 | print(response) 3565 | The Research Agent will use Python functions as tools. There are a lot of ways you could go about building a system to perform this task. In this example, you will use the following: 3566 | 3567 | search_web uses Gemini with Google Search to search the web for information on the given topic. 3568 | record_notes saves research found on the web to the state so that the other tools can use it. 3569 | write_report writes the report using the information found by the ResearchAgent 3570 | review_report reviews the report and provides feedback. 3571 | The Context class passes the state between agents/tools, and each agent will have access to the current state of the system. 3572 | 3573 | 3574 | from llama_index.core.workflow import Context 3575 | 3576 | async def search_web(ctx: Context, query: str) -> str: 3577 | """Useful for searching the web about a specific query or topic""" 3578 | response = await llm_with_search.acomplete(f"""Please research given this query or topic, 3579 | and return the result\n{query}""") 3580 | return response 3581 | 3582 | async def record_notes(ctx: Context, notes: str, notes_title: str) -> str: 3583 | """Useful for recording notes on a given topic.""" 3584 | current_state = await ctx.store.get("state") 3585 | if "research_notes" not in current_state: 3586 | current_state["research_notes"] = {} 3587 | current_state["research_notes"][notes_title] = notes 3588 | await ctx.store.set("state", current_state) 3589 | return "Notes recorded." 3590 | 3591 | async def write_report(ctx: Context, report_content: str) -> str: 3592 | """Useful for writing a report on a given topic.""" 3593 | current_state = await ctx.store.get("state") 3594 | current_state["report_content"] = report_content 3595 | await ctx.store.set("state", current_state) 3596 | return "Report written." 3597 | 3598 | async def review_report(ctx: Context, review: str) -> str: 3599 | """Useful for reviewing a report and providing feedback.""" 3600 | current_state = await ctx.store.get("state") 3601 | current_state["review"] = review 3602 | await ctx.store.set("state", current_state) 3603 | return "Report reviewed." 3604 | Build a multi-agent assistant 3605 | To build a multi-agent system, you define the agents and their interactions. Your system will have three agents: 3606 | 3607 | A ResearchAgent searches the web for information on the given topic. 3608 | A WriteAgent writes the report using the information found by the ResearchAgent. 3609 | A ReviewAgent reviews the report and provides feedback. 3610 | This example uses the AgentWorkflow class to create a multi-agent system that will execute these agents in order. Each agent takes a system_prompt that tells it what it should do, and suggests how to work with the other agents. 3611 | 3612 | Optionally, you can help your multi-agent system by specifying which other agents it can talk to using can_handoff_to (if not, it will try to figure this out on its own). 3613 | 3614 | 3615 | from llama_index.core.agent.workflow import ( 3616 | AgentInput, 3617 | AgentOutput, 3618 | ToolCall, 3619 | ToolCallResult, 3620 | AgentStream, 3621 | ) 3622 | from llama_index.core.agent.workflow import FunctionAgent, ReActAgent 3623 | 3624 | research_agent = FunctionAgent( 3625 | name="ResearchAgent", 3626 | description="Useful for searching the web for information on a given topic and recording notes on the topic.", 3627 | system_prompt=( 3628 | "You are the ResearchAgent that can search the web for information on a given topic and record notes on the topic. " 3629 | "Once notes are recorded and you are satisfied, you should hand off control to the WriteAgent to write a report on the topic." 3630 | ), 3631 | llm=llm, 3632 | tools=[search_web, record_notes], 3633 | can_handoff_to=["WriteAgent"], 3634 | ) 3635 | 3636 | write_agent = FunctionAgent( 3637 | name="WriteAgent", 3638 | description="Useful for writing a report on a given topic.", 3639 | system_prompt=( 3640 | "You are the WriteAgent that can write a report on a given topic. " 3641 | "Your report should be in a markdown format. The content should be grounded in the research notes. " 3642 | "Once the report is written, you should get feedback at least once from the ReviewAgent." 3643 | ), 3644 | llm=llm, 3645 | tools=[write_report], 3646 | can_handoff_to=["ReviewAgent", "ResearchAgent"], 3647 | ) 3648 | 3649 | review_agent = FunctionAgent( 3650 | name="ReviewAgent", 3651 | description="Useful for reviewing a report and providing feedback.", 3652 | system_prompt=( 3653 | "You are the ReviewAgent that can review a report and provide feedback. " 3654 | "Your feedback should either approve the current report or request changes for the WriteAgent to implement." 3655 | ), 3656 | llm=llm, 3657 | tools=[review_report], 3658 | can_handoff_to=["ResearchAgent","WriteAgent"], 3659 | ) 3660 | The Agents are defined, now you can create the AgentWorkflow and run it. 3661 | 3662 | 3663 | from llama_index.core.agent.workflow import AgentWorkflow 3664 | 3665 | agent_workflow = AgentWorkflow( 3666 | agents=[research_agent, write_agent, review_agent], 3667 | root_agent=research_agent.name, 3668 | initial_state={ 3669 | "research_notes": {}, 3670 | "report_content": "Not written yet.", 3671 | "review": "Review required.", 3672 | }, 3673 | ) 3674 | During execution of the workflow, you can stream events, tool calls and updates to the console. 3675 | 3676 | 3677 | from llama_index.core.agent.workflow import ( 3678 | AgentInput, 3679 | AgentOutput, 3680 | ToolCall, 3681 | ToolCallResult, 3682 | AgentStream, 3683 | ) 3684 | 3685 | research_topic = """Write me a report on the history of the web. 3686 | Briefly describe the history of the world wide web, including 3687 | the development of the internet and the development of the web, 3688 | including 21st century developments""" 3689 | 3690 | handler = agent_workflow.run( 3691 | user_msg=research_topic 3692 | ) 3693 | 3694 | current_agent = None 3695 | current_tool_calls = "" 3696 | async for event in handler.stream_events(): 3697 | if ( 3698 | hasattr(event, "current_agent_name") 3699 | and event.current_agent_name != current_agent 3700 | ): 3701 | current_agent = event.current_agent_name 3702 | print(f"\n{'='*50}") 3703 | print(f"🤖 Agent: {current_agent}") 3704 | print(f"{'='*50}\n") 3705 | elif isinstance(event, AgentOutput): 3706 | if event.response.content: 3707 | print("📤 Output:", event.response.content) 3708 | if event.tool_calls: 3709 | print( 3710 | "🛠️ Planning to use tools:", 3711 | [call.tool_name for call in event.tool_calls], 3712 | ) 3713 | elif isinstance(event, ToolCallResult): 3714 | print(f"🔧 Tool Result ({event.tool_name}):") 3715 | print(f" Arguments: {event.tool_kwargs}") 3716 | print(f" Output: {event.tool_output}") 3717 | elif isinstance(event, ToolCall): 3718 | print(f"🔨 Calling Tool: {event.tool_name}") 3719 | print(f" With arguments: {event.tool_kwargs}") 3720 | After the workflow is complete, you can print the final output of the report, as well as the final review state from then review agent. 3721 | 3722 | 3723 | state = await handler.ctx.store.get("state") 3724 | print("Report Content:\n", state["report_content"]) 3725 | print("\n------------\nFinal Review:\n", state["review"]) 3726 | Go further with custom workflows 3727 | The AgentWorkflow is a great way to get started with multi-agent systems. But what if you need more control? You can build a workflow from scratch. Here are some reasons why you might want to build your own workflow: 3728 | 3729 | More control over the process: You can decide the exact path your agents take. This includes creating loops, making decisions at certain points, or having agents work in parallel on different tasks. 3730 | Use complex data: Go beyond simple text. Custom workflows let you use more structured data, like JSON objects or custom classes, for your inputs and outputs. 3731 | Work with different media: Build agents that can understand and process not just text, but also images, audio, and video. 3732 | Smarter planning: You can design a workflow that first creates a detailed plan before the agents start working. This is useful for complex tasks that require multiple steps. 3733 | Enable self-correction: Create agents that can review their own work. If the output isn't good enough, the agent can try again, creating a loop of improvement until the result is perfect. --------------------------------------------------------------------------------