├── README.md
├── index.html
├── package.json
├── server
    ├── bot-gemini.py
    ├── env.example
    ├── requirements.txt
    ├── runner.py
    └── server.py
├── src
    ├── app.ts
    └── styles.css
└── tsconfig.json


/README.md:
--------------------------------------------------------------------------------
  1 | # Gemini Multimodal Live WebRTC Example
  2 | 
  3 | This example shows how to build a very simple voice AI application using the Gemini Multimodal Live API and WebRTC.
  4 | 
  5 | The client is a web app in a single [app.ts](./src/app.ts) file and has just four dependencies:
  6 |   1. the Open Source [Pipecat JavaScript SDK](https://github.com/pipecat-ai/pipecat-client-web)
  7 |   2. a [Daily WebRTC transport](https://github.com/pipecat-ai/pipecat-client-web-transports) plugin
  8 |   3. [ts-node-dev](https://www.npmjs.com/package/ts-node-dev)
  9 |   4. [vite](https://vite.dev/)
 10 | 
 11 | The WebRTC connection is just this code, plus event handlers for setting up audio playback and handling any events that you want to wire up to your user interface.
 12 | 
 13 | ```typescript
 14 |   const rtviClient = new RTVIClient({
 15 |     transport,
 16 |     params: {
 17 |       baseUrl: "http://localhost:7860/", 
 18 |     },
 19 |     enableMic: true,
 20 |     enableCam: false,
 21 |     timeout: 30 * 1000,
 22 |   });
 23 | ```
 24 | 
 25 | The server is a [Pipecat](https://github.com/pipecat-ai/pipecat) pipeline that uses the Gemini Multimodal Live WebSocket API [implementation](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/gemini_multimodal_live) in Pipecat core.
 26 | 
 27 | ## Installation and first run
 28 | 
 29 | Use two terminals, one for the web app (client) and one for the server.
 30 | 
 31 | ### Server
 32 | 
 33 | ```
 34 | cd server
 35 | cp env.example .env
 36 | ```
 37 | 
 38 | You'll need a Gemini API key and a Daily API Key
 39 |   * Create a free Gemini API key at [https://aistudio.google.com/](https://aistudio.google.com/)
 40 |   * Create a free Daily account at [https://dashboard.daily.co/u/signup](https://dashboard.daily.co/u/signup)
 41 | 
 42 | Add both API keys to `.env`
 43 | 
 44 | Create a Python virtual environment and run `python server.py`.
 45 | 
 46 | ```
 47 | python3.12 -m venv venv
 48 | source venv/bin/activate
 49 | pip install -r requirements.txt
 50 | python server.py
 51 | ```
 52 | 
 53 | ### Web app
 54 | 
 55 | ```
 56 | npm i
 57 | npm run dev
 58 | ```
 59 | 
 60 | Open a web browser to the **Local** url that `npm run dev` printed to the console. This defaults to [http://localhost:5173](http://localhost:5173/)
 61 | 
 62 | ### Architecture
 63 | 
 64 | The web app connects to a server running a [Pipecat](https://github.com/pipecat-ai/pipecat) process. Pipecat is an orchestration framework for realtime, multimodal AI. In this use case, Pipecat is translating between the WebRTC media streams (and Pipecat client/server events), and the Multimodal Live API.
 65 | 
 66 | You can think of this architecture as as smart proxy.
 67 | 
 68 | 1. The client is sending/receiving audio and events using the WebRTC protocol.
 69 | 2. The Pipecat pipeline is relaying the audio and events to the Multimodal Live API.
 70 | 
 71 | This two-hop approach has several advantages over connecting directly from the client to the Multimodal API WebSocket infrastructure. 
 72 | 
 73 |   - WebRTC delivers lower latency and better robustness for the median real-world user. Even though we are using a proxy/relay approach, for a large percentage of users, voice-to-voice response times will be measurably faster than using direct WebSocket connection. See the next section — "Why WebRTC?" — for more details on this. 
 74 |   - You can add functionality using Pipecat's built-in services, or write custom server-side realtime logic in Python. As an example, this Pipecat pipeline implements audio transcription by sending both the user and the model audio to Gemini's standard (non-Live) API.
 75 |   - The SDK implements the [RTVI open standard](https://docs.pipecat.ai/client/introduction#about-rtvi) for AI client-server events. This rich set of events make it easy to implement core features of voice-to-voice and multimodal AI applications. Pipecat's React components, for example, leverage this event system.
 76 |   - Compatible Web, React, React Native, iOS, Android, Python, and C++ SDKs, all maintained and developed by a large and active Open Source community.
 77 | 
 78 | There are also two disadvantages to using WebRTC this way.
 79 | 
 80 |   - WebRTC is more complicated than WebSockets. You'll need to use an SDK rather than write code that calls low-level WebRTC APIs. (This is generally true for Web applications, though arguing about *how* true it is for programmers with different levels of networking experience is outside the scope of this readme. It's *definitely* true for native mobile applications.) This is the pain point that the Pipecat open source client SDKs aim to alleviate.
 81 |   - In production, you probably will not want to run your own WebRTC server clusters. WebRTC is closer to telephony, in complexity, than to Web servers. You probably don't run your own SIP/PSTN/SMS infrastructure. So you'll likely pay a WebRTC infrastructure provider per-minute or per-gigabyte for each bot session. The Pipecat SDKs support both WebSockets and WebRTC. You can use WebSockets in development and then transition over to WebRTC when you're ready to deploy to real-world users.
 82 | 
 83 | 
 84 | ```
 85 | 
 86 |                                  ┌─────────────────────────────────────────┐     
 87 |                                  │                                         │     
 88 |                                  │ Server                                  │     
 89 |                                  │                                         │     
 90 |                                  │                                         │     
 91 |                                  │                 ┌────────────────────┐  │     
 92 |                                  │                 │                    │  │     
 93 |                                  │                 │  Pipecat           │  │     
 94 |                                  │                 │  Pipeline          │  │     
 95 |                                  │                 │                    │  │     
 96 |                                  │                 │                    │  │     
 97 | ┌──────────────────────────┐     │                 │  Audio Processing  │  │     
 98 | │                          │     │                 │         ▼          │  │     
 99 | │      Pipecat Client      │     │   ┌─────────────│   Gemini Flash    ─┼──┼────►
100 | │    ┌───────────────┐     │     │   │             │   Transcription   ◄┼──┼─────
101 | │    │ WebRTC (Daily)│ ────┼────────►│WebRTC (Daily)         ▼          │  │     
102 | │    │   Transport   │ ◄───┼─────────│  Transport  │  Gemini Multimodal─┼──┼────►
103 | │    └───────────────┘     │     │   │             │     Live API      ◄┼──┼─────
104 | │                          │     │   └─────────────│         ▼          │  │     
105 | └──────────────────────────┘     │                 │   Gemini Flash    ─┼──┼────►
106 |                                  │                 │   Transcription   ◄┼──┼─────
107 |                                  │                 │         ▼          │  │     
108 |                                  │                 │   Conversation     │  │     
109 |                                  │                 │     Context        │  │     
110 |                                  │                 │    Management      │  │     
111 |                                  │                 │         ▼          │  │     
112 |                                  │                 │   RTVI Events      │  │     
113 |                                  │                 │                    │  │     
114 |                                  │                 └────────────────────┘  │     
115 |                                  │                                         │     
116 |                                  └─────────────────────────────────────────┘  
117 | ```     
118 | 
119 | 
120 | One other note about infrastructure: the instructions in this readme show you how to run the Pipecat process on your local machine. For a production application, you would would run the bot on a machine in the cloud.
121 | 
122 | If you expect your maximum traffic to be ~100 or so concurrent connections, you can just deploy the `/server` directory from this project. It will scale fine on a single, mid-range, cloud VM. For higher traffic volumes, or if you'd rather somebody else handle the infrastructure devops for you, you might consider using a hosted Pipecat platform such as [Daily Bots](https://bots.daily.co/). 
123 | 
124 | ## Why WebRTC?
125 | 
126 | If you're just starting out with voice AI, you might gravitate towards using a WebSocket library for networking. WebSockets are familiar, simple, and widely supported. And they are great for server-to-server use cases, for use cases where latency is not a primary concern, and are fine for prototyping and general hacking.
127 | 
128 | But WebSockets shouldn't be used in production for client-server, real-time media connections.
129 | 
130 | For production apps, you need to use WebRTC. WebRTC was designed from the ground up as *the* protocol for real-time media on the Internet.
131 | 
132 | The major problems with WebSockets for real-time media delivery to and from end-user devices are:
133 | 
134 | - WebSockets are built on TCP, so audio streams will be subject to head-of-line blocking and will automatically attempt packet resends even if packets are delayed so much that they can not be used for playout.
135 | 
136 | - The Opus audio codec used for WebRTC is tightly coupled to WebRTC's bandwidth estimation and packet pacing (congestion control) logic, making a WebRTC audio stream resilient to a wide range of real-world network behaviors that would cause a WebSocket connection to accumulate latency.
137 | 
138 | - The Opus audio codec has very good forward error correction, making the audio stream resilient to relatively high amounts of packet loss. (This only helps you if your network transport can drop late-arriving packets and doesn't do head of line blocking, though.)
139 | 
140 | - Audio sent and received over WebRTC is automatically time-stamped so both playout and interruption logic are trivial. These are harder to get right for all corner cases, when using WebSockets.
141 | 
142 | - WebRTC includes hooks for detailed performance and media quality statistics. A good WebRTC platform will give you detailed dashboards and analytics for both aggregate and individual session statistics that are specific to audio and video. This level of observability is somewhere between very hard and impossible to build for WebSockets.
143 | 
144 | - WebSocket reconnection logic is very hard to implement robustly. You will have to build a ping/ack framework (or fully test and understand the framework that your WebSocket library provides). TCP timeouts and connection events behave differently on different platforms.
145 | 
146 | - Finally, good WebRTC implementations today come with very good echo cancellation, noise reduction, and automatic gain control. You will likely need to figure out how to stitch this audio processing into an app that uses WebSockets.
147 | 
148 | In addition, long-haul public Internet routes are problematic for latency and real-time media reliability, no matter what the underlying network protocol is. So if your end-users are a significant distance from OpenAI's servers, it's important to try to connect the user to a media router as close to them as possible. Beyond that first "edge" connection, you can then use a more efficient backbone route. A good WebRTC platform will do this for you automatically.
149 | 
150 | In this sample application, for example, the latency for a typical user will actually be faster, routing through Daily's WebRTC network — which has many, many servers close to the edge of the network all over the world — than via a "direct" WebSocket server via a public Internet route. (There are no direct connections on the Internet. Every packet is relayed through multiple routers. Better routing means better response times.)
151 | 
152 | ## More resources
153 | 
154 | The Gemini Multimodal Live docs overview
155 | 
156 | https://ai.google.dev/api/multimodal-live
157 | 
158 | The repo for the Pipecat SDK for Gemini and WebRTC on the web (and React) is here:  
159 | 
160 | https://github.com/pipecat-ai/pipecat-client-web
161 | 
162 | Contributions are welcome! If you want to write a new pluggable network transport for the SDK, check out this repo and README:
163 | 
164 | https://github.com/pipecat-ai/pipecat-client-web-transports
165 | 
166 | Here's a full-featured multimodal chat applicaton that demonstrates how to use the Gemini Multimodal Modal Live WebSocket API, HTTP single-turn APIs, and WebRTC all in one app. (They all have their place for different use cases.)
167 | 
168 | https://github.com/pipecat-ai/gemini-multimodal-live-demo
169 | 
170 | [Pipecat's SDKs] for Web, React, React Native, Android, iOS, Python, and C++ are all architecture-compatible and Open Source.
171 | 
172 | The Pipecat Android SDK repo is here:
173 | 
174 | https://github.com/pipecat-ai/pipecat-client-android
175 | 
176 | The Pipecat iOS SDK repo is here:
177 | 
178 | https://github.com/pipecat-ai/pipecat-client-ios
179 | 
180 | If you're interested in network protocols designed for sending media, here's a technical overview of RTMP, HLS, and WebRTC:
181 | 
182 | https://www.daily.co/blog/video-live-streaming/
183 | 
184 | For a deep dive into WebRTC edge and mesh routing, here's a long post about Daily's global WebRTC infrastructure:
185 | 
186 | https://www.daily.co/blog/global-mesh-network/
187 | 
188 | 


--------------------------------------------------------------------------------
/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE html>
 2 | <html>
 3 | 
 4 | <head>
 5 |   <title>Simple Gemini WebRTC Demo</title>
 6 |   <link rel="stylesheet" href="src/styles.css">
 7 |   <script src="src/app.ts" type="module"></script>
 8 | </head>
 9 | 
10 | <body>
11 |   <div id="join-div">
12 |     <p>
13 |       Click <a role="button" id="start-webrtc-transport-session">here</a> to start a WebRTC session using the Gemini
14 |       Multimodal Live API with audio routed to the cloud using WebRTC</a>.
15 |     </p>
16 |     <p>
17 |       Open the Console to see transport events.
18 |     </p>
19 |   </div>
20 | 
21 |   <div id="chat-text"></div>
22 | 
23 |   <div id="audio"></div>
24 | </body>
25 | 
26 | </html>


--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "name": "1116",
 3 |   "version": "1.0.0",
 4 |   "main": "server/server.ts",
 5 |   "scripts": {
 6 |     "dev": "vite",
 7 |     "build": "vite build",
 8 |     "preview": "vite preview"
 9 |   },
10 |   "keywords": [],
11 |   "author": "",
12 |   "license": "ISC",
13 |   "description": "",
14 |   "dependencies": {
15 |     "@pipecat-ai/client-js": "^0.3.1",
16 |     "@pipecat-ai/daily-transport": "^0.3.2",
17 |     "ts-node-dev": "^2.0.0",
18 |     "vite": "^6.0.4"
19 |   }
20 | }


--------------------------------------------------------------------------------
/server/bot-gemini.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2024, Daily
  3 | #
  4 | # SPDX-License-Identifier: BSD 2-Clause License
  5 | #
  6 | 
  7 | """Gemini Bot Implementation.
  8 | 
  9 | This module implements a chatbot using Google's Gemini Multimodal Live model.
 10 | It includes:
 11 | - Real-time audio/video interaction through Daily
 12 | - Speech-to-speech using the Gemini Multimodal Live API
 13 | - Transcription using Gemini's generate_content API
 14 | - RTVI client/server events
 15 | 
 16 | """
 17 | 
 18 | import asyncio
 19 | from datetime import date
 20 | import os
 21 | import sys
 22 | 
 23 | import aiohttp
 24 | from dotenv import load_dotenv
 25 | from loguru import logger
 26 | from runner import configure
 27 | 
 28 | from pipecat.audio.vad.silero import SileroVADAnalyzer
 29 | from pipecat.audio.vad.vad_analyzer import VADParams
 30 | from pipecat.frames.frames import Frame, EndFrame, TranscriptionFrame
 31 | from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 32 | from pipecat.pipeline.pipeline import Pipeline
 33 | from pipecat.pipeline.runner import PipelineRunner
 34 | from pipecat.pipeline.task import PipelineParams, PipelineTask
 35 | from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
 36 | from pipecat.processors.frameworks.rtvi import (
 37 |     RTVIBotTranscriptionProcessor,
 38 |     RTVIMetricsProcessor,
 39 |     RTVISpeakingProcessor,
 40 |     RTVIUserTranscriptionProcessor,
 41 | )
 42 | from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService
 43 | from pipecat.transports.services.daily import DailyParams, DailyTransport
 44 | 
 45 | load_dotenv(override=True)
 46 | 
 47 | logger.remove(0)
 48 | logger.add(sys.stderr, level="DEBUG")
 49 | 
 50 | SYSTEM_INSTRUCTION = f"""
 51 | "You are Gemini Chatbot, a friendly, helpful robot.
 52 | 
 53 | Your goal is to demonstrate your capabilities in a succinct way.
 54 | 
 55 | Your output will be converted to audio so don't include special characters in your answers.
 56 | 
 57 | Respond to what the user said in a creative and helpful way. Keep your responses brief. One or two sentences at most.
 58 | 
 59 | Today is {date.today().strftime("%A, %B %d, %Y")}
 60 | """
 61 | 
 62 | 
 63 | class UserTranscriptionFrameFilter(FrameProcessor):
 64 |     """Filter out UserTranscription frames."""
 65 | 
 66 |     async def process_frame(self, frame: Frame, direction: FrameDirection):
 67 |         await super().process_frame(frame, direction)
 68 | 
 69 |         if isinstance(frame, TranscriptionFrame) and frame.user_id == "user":
 70 |             return
 71 | 
 72 |         await self.push_frame(frame, direction)
 73 | 
 74 | 
 75 | async def main():
 76 |     """Main bot execution function.
 77 | 
 78 |     Sets up and runs the bot pipeline including:
 79 |     - Daily video transport with specific audio parameters
 80 |     - Gemini Live multimodal model integration
 81 |     - Voice activity detection
 82 |     - Animation processing
 83 |     - RTVI event handling
 84 |     """
 85 |     async with aiohttp.ClientSession() as session:
 86 |         (room_url, token) = await configure(session)
 87 | 
 88 |         # Set up Daily transport with specific audio/video parameters for Gemini
 89 |         transport = DailyTransport(
 90 |             room_url,
 91 |             token,
 92 |             "Chatbot",
 93 |             DailyParams(
 94 |                 audio_in_sample_rate=16000,
 95 |                 audio_out_sample_rate=24000,
 96 |                 audio_out_enabled=True,
 97 |                 camera_out_enabled=True,
 98 |                 camera_out_width=1024,
 99 |                 camera_out_height=576,
100 |                 vad_enabled=True,
101 |                 vad_audio_passthrough=True,
102 |                 vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
103 |             ),
104 |         )
105 | 
106 |         # Initialize the Gemini Multimodal Live model
107 |         llm = GeminiMultimodalLiveLLMService(
108 |             api_key=os.getenv("GOOGLE_API_KEY"),
109 |             voice_id="Puck",  # Aoede, Charon, Fenrir, Kore, Puck
110 |             transcribe_user_audio=True,
111 |             transcribe_model_audio=True,
112 |             system_instruction=SYSTEM_INSTRUCTION,
113 |             tools=[{"google_search": {}}, {"code_execution": {}}],
114 |         )
115 | 
116 |         messages = [
117 |             {
118 |                 "role": "user",
119 |                 "content": 'Start by saying "Hello, I\'m Gemini".',
120 |             },
121 |         ]
122 | 
123 |         # Set up conversation context and management
124 |         # The context_aggregator will automatically collect conversation context
125 |         context = OpenAILLMContext(messages)
126 |         context_aggregator = llm.create_context_aggregator(context)
127 | 
128 |         #
129 |         # RTVI events for Pipecat client UI
130 |         #
131 | 
132 |         # This will send `user-*-speaking` and `bot-*-speaking` messages.
133 |         rtvi_speaking = RTVISpeakingProcessor()
134 | 
135 |         # This will emit UserTranscript events.
136 |         rtvi_user_transcription = RTVIUserTranscriptionProcessor()
137 | 
138 |         # This will emit BotTranscript events.
139 |         rtvi_bot_transcription = RTVIBotTranscriptionProcessor()
140 | 
141 |         # This will send `metrics` messages.
142 |         rtvi_metrics = RTVIMetricsProcessor()
143 | 
144 |         pipeline = Pipeline(
145 |             [
146 |                 transport.input(),
147 |                 context_aggregator.user(),
148 |                 llm,
149 |                 rtvi_speaking,
150 |                 rtvi_user_transcription,
151 |                 UserTranscriptionFrameFilter(),
152 |                 rtvi_bot_transcription,
153 |                 rtvi_metrics,
154 |                 transport.output(),
155 |                 context_aggregator.assistant(),
156 |             ]
157 |         )
158 | 
159 |         task = PipelineTask(
160 |             pipeline,
161 |             PipelineParams(
162 |                 allow_interruptions=True,
163 |                 enable_metrics=True,
164 |                 enable_usage_metrics=True,
165 |             ),
166 |         )
167 | 
168 |         @transport.event_handler("on_first_participant_joined")
169 |         async def on_first_participant_joined(transport, participant):
170 |             await transport.capture_participant_transcription(participant["id"])
171 |             await task.queue_frames([context_aggregator.user().get_context_frame()])
172 | 
173 |         @transport.event_handler("on_participant_left")
174 |         async def on_participant_left(transport, participant, reason):
175 |             print(f"Participant left: {participant}")
176 |             await task.queue_frame(EndFrame())
177 | 
178 |         runner = PipelineRunner()
179 | 
180 |         await runner.run(task)
181 | 
182 | 
183 | if __name__ == "__main__":
184 |     asyncio.run(main())
185 | 


--------------------------------------------------------------------------------
/server/env.example:
--------------------------------------------------------------------------------
1 | GOOGLE_API_KEY=...
2 | DAILY_API_KEY=...
3 | 
4 | # DAILY_SAMPLE_ROOM_URL=https://yourdomain.daily.co/yourroom # (for joining the bot to the same room repeatedly for local dev)
5 | 


--------------------------------------------------------------------------------
/server/requirements.txt:
--------------------------------------------------------------------------------
1 | python-dotenv
2 | fastapi[all]
3 | uvicorn
4 | pipecat-ai[daily,silero,google,openai]
5 | 


--------------------------------------------------------------------------------
/server/runner.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # Copyright (c) 2024, Daily
 3 | #
 4 | # SPDX-License-Identifier: BSD 2-Clause License
 5 | #
 6 | 
 7 | import argparse
 8 | import os
 9 | 
10 | import aiohttp
11 | 
12 | from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper
13 | 
14 | 
15 | async def configure(aiohttp_session: aiohttp.ClientSession):
16 |     """Configure the Daily room and Daily REST helper."""
17 |     parser = argparse.ArgumentParser(description="Daily AI SDK Bot Sample")
18 |     parser.add_argument(
19 |         "-u", "--url", type=str, required=False, help="URL of the Daily room to join"
20 |     )
21 |     parser.add_argument(
22 |         "-k",
23 |         "--apikey",
24 |         type=str,
25 |         required=False,
26 |         help="Daily API Key (needed to create an owner token for the room)",
27 |     )
28 | 
29 |     args, unknown = parser.parse_known_args()
30 | 
31 |     url = args.url or os.getenv("DAILY_SAMPLE_ROOM_URL")
32 |     key = args.apikey or os.getenv("DAILY_API_KEY")
33 | 
34 |     if not url:
35 |         raise Exception(
36 |             "No Daily room specified. use the -u/--url option from the command line, or set DAILY_SAMPLE_ROOM_URL in your environment to specify a Daily room URL."
37 |         )
38 | 
39 |     if not key:
40 |         raise Exception(
41 |             "No Daily API key specified. use the -k/--apikey option from the command line, or set DAILY_API_KEY in your environment to specify a Daily API key, available from https://dashboard.daily.co/developers."
42 |         )
43 | 
44 |     daily_rest_helper = DailyRESTHelper(
45 |         daily_api_key=key,
46 |         daily_api_url=os.getenv("DAILY_API_URL", "https://api.daily.co/v1"),
47 |         aiohttp_session=aiohttp_session,
48 |     )
49 | 
50 |     # Create a meeting token for the given room with an expiration 1 hour in
51 |     # the future.
52 |     expiry_time: float = 60 * 60
53 | 
54 |     token = await daily_rest_helper.get_token(url, expiry_time)
55 | 
56 |     return (url, token)
57 | 


--------------------------------------------------------------------------------
/server/server.py:
--------------------------------------------------------------------------------
  1 | #
  2 | # Copyright (c) 2024, Daily
  3 | #
  4 | # SPDX-License-Identifier: BSD 2-Clause License
  5 | #
  6 | 
  7 | """RTVI Bot Server Implementation.
  8 | 
  9 | This FastAPI server manages RTVI bot instances and provides endpoints for both
 10 | direct browser access and RTVI client connections. It handles:
 11 | - Creating Daily rooms
 12 | - Managing bot processes
 13 | - Providing connection credentials
 14 | - Monitoring bot status
 15 | 
 16 | Requirements:
 17 | - Daily API key (set in .env file)
 18 | - Python 3.10+
 19 | - FastAPI
 20 | - Running bot implementation
 21 | """
 22 | 
 23 | import argparse
 24 | import os
 25 | import subprocess
 26 | from contextlib import asynccontextmanager
 27 | from typing import Any, Dict
 28 | 
 29 | import aiohttp
 30 | from dotenv import load_dotenv
 31 | from fastapi import FastAPI, HTTPException, Request
 32 | from fastapi.middleware.cors import CORSMiddleware
 33 | from fastapi.responses import JSONResponse, RedirectResponse
 34 | 
 35 | from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper, DailyRoomParams
 36 | 
 37 | # Load environment variables from .env file
 38 | load_dotenv(override=True)
 39 | 
 40 | # Maximum number of bot instances allowed per room
 41 | MAX_BOTS_PER_ROOM = 1
 42 | 
 43 | # Dictionary to track bot processes: {pid: (process, room_url)}
 44 | bot_procs = {}
 45 | 
 46 | # Store Daily API helpers
 47 | daily_helpers = {}
 48 | 
 49 | 
 50 | def cleanup():
 51 |     """Cleanup function to terminate all bot processes.
 52 | 
 53 |     Called during server shutdown.
 54 |     """
 55 |     for entry in bot_procs.values():
 56 |         proc = entry[0]
 57 |         proc.terminate()
 58 |         proc.wait()
 59 | 
 60 | 
 61 | def get_bot_file():
 62 |     return "bot-gemini"
 63 | 
 64 | 
 65 | @asynccontextmanager
 66 | async def lifespan(app: FastAPI):
 67 |     """FastAPI lifespan manager that handles startup and shutdown tasks.
 68 | 
 69 |     - Creates aiohttp session
 70 |     - Initializes Daily API helper
 71 |     - Cleans up resources on shutdown
 72 |     """
 73 |     aiohttp_session = aiohttp.ClientSession()
 74 |     daily_helpers["rest"] = DailyRESTHelper(
 75 |         daily_api_key=os.getenv("DAILY_API_KEY", ""),
 76 |         daily_api_url=os.getenv("DAILY_API_URL", "https://api.daily.co/v1"),
 77 |         aiohttp_session=aiohttp_session,
 78 |     )
 79 |     yield
 80 |     await aiohttp_session.close()
 81 |     cleanup()
 82 | 
 83 | 
 84 | # Initialize FastAPI app with lifespan manager
 85 | app = FastAPI(lifespan=lifespan)
 86 | 
 87 | # Configure CORS to allow requests from any origin
 88 | app.add_middleware(
 89 |     CORSMiddleware,
 90 |     allow_origins=["*"],
 91 |     allow_credentials=True,
 92 |     allow_methods=["*"],
 93 |     allow_headers=["*"],
 94 | )
 95 | 
 96 | 
 97 | async def create_room_and_token() -> tuple[str, str]:
 98 |     """Helper function to create a Daily room and generate an access token.
 99 | 
100 |     Returns:
101 |         tuple[str, str]: A tuple containing (room_url, token)
102 | 
103 |     Raises:
104 |         HTTPException: If room creation or token generation fails
105 |     """
106 |     room = await daily_helpers["rest"].create_room(DailyRoomParams())
107 |     if not room.url:
108 |         raise HTTPException(status_code=500, detail="Failed to create room")
109 | 
110 |     token = await daily_helpers["rest"].get_token(room.url)
111 |     if not token:
112 |         raise HTTPException(status_code=500, detail=f"Failed to get token for room: {room.url}")
113 | 
114 |     return room.url, token
115 | 
116 | 
117 | @app.get("/")
118 | async def start_agent(request: Request):
119 |     """Endpoint for direct browser access to the bot.
120 | 
121 |     Creates a room, starts a bot instance, and redirects to the Daily room URL.
122 | 
123 |     Returns:
124 |         RedirectResponse: Redirects to the Daily room URL
125 | 
126 |     Raises:
127 |         HTTPException: If room creation, token generation, or bot startup fails
128 |     """
129 |     print("Creating room")
130 |     room_url, token = await create_room_and_token()
131 |     print(f"Room URL: {room_url}")
132 | 
133 |     # Check if there is already an existing process running in this room
134 |     num_bots_in_room = sum(
135 |         1 for proc in bot_procs.values() if proc[1] == room_url and proc[0].poll() is None
136 |     )
137 |     if num_bots_in_room >= MAX_BOTS_PER_ROOM:
138 |         raise HTTPException(status_code=500, detail=f"Max bot limit reached for room: {room_url}")
139 | 
140 |     # Spawn a new bot process
141 |     try:
142 |         bot_file = get_bot_file()
143 |         proc = subprocess.Popen(
144 |             [f"python3 -m {bot_file} -u {room_url} -t {token}"],
145 |             shell=True,
146 |             bufsize=1,
147 |             cwd=os.path.dirname(os.path.abspath(__file__)),
148 |         )
149 |         bot_procs[proc.pid] = (proc, room_url)
150 |     except Exception as e:
151 |         raise HTTPException(status_code=500, detail=f"Failed to start subprocess: {e}")
152 | 
153 |     return RedirectResponse(room_url)
154 | 
155 | 
156 | @app.post("/connect")
157 | async def rtvi_connect(request: Request) -> Dict[Any, Any]:
158 |     """RTVI connect endpoint that creates a room and returns connection credentials.
159 | 
160 |     This endpoint is called by RTVI clients to establish a connection.
161 | 
162 |     Returns:
163 |         Dict[Any, Any]: Authentication bundle containing room_url and token
164 | 
165 |     Raises:
166 |         HTTPException: If room creation, token generation, or bot startup fails
167 |     """
168 |     print("Creating room for Pipecat connection")
169 |     room_url, token = await create_room_and_token()
170 |     print(f"Room URL: {room_url}")
171 | 
172 |     # Start the bot process
173 |     try:
174 |         bot_file = get_bot_file()
175 |         proc = subprocess.Popen(
176 |             [f"python3 -m {bot_file} -u {room_url} -t {token}"],
177 |             shell=True,
178 |             bufsize=1,
179 |             cwd=os.path.dirname(os.path.abspath(__file__)),
180 |         )
181 |         bot_procs[proc.pid] = (proc, room_url)
182 |     except Exception as e:
183 |         raise HTTPException(status_code=500, detail=f"Failed to start subprocess: {e}")
184 | 
185 |     # Return the authentication bundle in format expected by DailyTransport
186 |     return {"room_url": room_url, "token": token}
187 | 
188 | 
189 | @app.get("/status/{pid}")
190 | def get_status(pid: int):
191 |     """Get the status of a specific bot process.
192 | 
193 |     Args:
194 |         pid (int): Process ID of the bot
195 | 
196 |     Returns:
197 |         JSONResponse: Status information for the bot
198 | 
199 |     Raises:
200 |         HTTPException: If the specified bot process is not found
201 |     """
202 |     # Look up the subprocess
203 |     proc = bot_procs.get(pid)
204 | 
205 |     # If the subprocess doesn't exist, return an error
206 |     if not proc:
207 |         raise HTTPException(status_code=404, detail=f"Bot with process id: {pid} not found")
208 | 
209 |     # Check the status of the subprocess
210 |     status = "running" if proc[0].poll() is None else "finished"
211 |     return JSONResponse({"bot_id": pid, "status": status})
212 | 
213 | 
214 | if __name__ == "__main__":
215 |     import uvicorn
216 | 
217 |     # Parse command line arguments for server configuration
218 |     default_host = os.getenv("HOST", "0.0.0.0")
219 |     default_port = int(os.getenv("FAST_API_PORT", "7860"))
220 | 
221 |     parser = argparse.ArgumentParser(description="Daily Storyteller FastAPI server")
222 |     parser.add_argument("--host", type=str, default=default_host, help="Host address")
223 |     parser.add_argument("--port", type=int, default=default_port, help="Port number")
224 |     parser.add_argument("--reload", action="store_true", help="Reload code on change")
225 | 
226 |     config = parser.parse_args()
227 | 
228 |     # Start the FastAPI server
229 |     uvicorn.run(
230 |         "server:app",
231 |         host=config.host,
232 |         port=config.port,
233 |         reload=config.reload,
234 |     )
235 | 


--------------------------------------------------------------------------------
/src/app.ts:
--------------------------------------------------------------------------------
  1 | import { RTVIClient, RTVIEvent, RTVIMessage, Participant, BotLLMTextData, Transport } from "@pipecat-ai/client-js";
  2 | import { DailyTransport } from "@pipecat-ai/daily-transport";
  3 | 
  4 | //
  5 | //
  6 | //
  7 | 
  8 | let joinDiv: HTMLElement;
  9 | 
 10 | document.addEventListener('DOMContentLoaded', () => {
 11 |   joinDiv = document.getElementById('join-div');
 12 |   document.getElementById('start-webrtc-transport-session').addEventListener('click', () => {
 13 |     startBot('daily');
 14 |   });
 15 | });
 16 | 
 17 | //
 18 | //
 19 | //
 20 | 
 21 | async function startBot(profileChoice: string) {
 22 |   let transport: Transport;
 23 | 
 24 |   joinDiv.textContent = 'Joining...';
 25 | 
 26 | 
 27 |   console.log('-- starting Gemini WebRTC connection --');
 28 |   transport = new DailyTransport();
 29 | 
 30 |   const rtviClient = new RTVIClient({
 31 |     transport,
 32 |     params: {
 33 |       baseUrl: "http://localhost:7860/", 
 34 |     },
 35 |     enableMic: true,
 36 |     enableCam: false,
 37 |     timeout: 30 * 1000,
 38 |   });
 39 | 
 40 |   setupEventHandlers(rtviClient);  
 41 |   
 42 |   try {
 43 |     await rtviClient.initDevices();
 44 |     await rtviClient.connect();
 45 |   } catch (e) {
 46 |     console.log('Error connecting', e);
 47 |   }
 48 | }
 49 | 
 50 | //
 51 | //
 52 | //
 53 | 
 54 | let audioDiv: HTMLDivElement;
 55 | let chatTextDiv: HTMLDivElement;
 56 | 
 57 | let currentUserSpeechDiv: HTMLDivElement;
 58 | let currentBotSpeechDiv: HTMLDivElement;
 59 | let currentSpeaker = ''; // 'user' or 'bot'
 60 | 
 61 | export async function setupEventHandlers(rtviClient: RTVIClient) {
 62 |   audioDiv = document.getElementById('audio') as HTMLDivElement;
 63 |   chatTextDiv = document.getElementById('chat-text') as HTMLDivElement;
 64 | 
 65 |   rtviClient.on(RTVIEvent.TransportStateChanged, (state: string) => {
 66 |     console.log(`-- transport state change: ${state} --`);
 67 |     joinDiv.textContent = `Transport state: ${state}`;
 68 |   });
 69 | 
 70 |   rtviClient.on(RTVIEvent.Connected, () => {
 71 |     console.log("-- user connected --");
 72 |   });
 73 | 
 74 |   rtviClient.on(RTVIEvent.Disconnected, () => {
 75 |     console.log("-- user disconnected --");
 76 |   });    
 77 | 
 78 |   rtviClient.on(RTVIEvent.BotConnected, () => {
 79 |     console.log("-- bot connected --");
 80 |   });
 81 | 
 82 |   rtviClient.on(RTVIEvent.BotDisconnected, () => {
 83 |     console.log("--bot disconnected --");
 84 |   });
 85 | 
 86 |   rtviClient.on(RTVIEvent.BotReady, () => {
 87 |     console.log("-- bot ready to chat! --");
 88 |   });
 89 | 
 90 |   rtviClient.on(RTVIEvent.TrackStarted, (track: MediaStreamTrack, participant: Participant) => {
 91 |     console.log(" --> track started", participant, track);
 92 |     if (participant.local) {
 93 |       return;
 94 |     }
 95 |     let audio = document.createElement("audio");
 96 |     audio.srcObject = new MediaStream([track]);
 97 |     audio.autoplay = true;
 98 |     audioDiv.appendChild(audio);
 99 |   });
100 | 
101 |   rtviClient.on(RTVIEvent.UserStartedSpeaking, startUserSpeechBubble);
102 | 
103 |   rtviClient.on(RTVIEvent.UserStoppedSpeaking, finishUserSpeechBubble);
104 | 
105 |   rtviClient.on(RTVIEvent.BotStartedSpeaking, startBotSpeechBubble);
106 | 
107 |   rtviClient.on(RTVIEvent.BotStoppedSpeaking, finishBotSpeechBubble);
108 | 
109 |   rtviClient.on(RTVIEvent.UserTranscript, (transcript: TranscriptData) => {
110 |     if (transcript.final) {
111 |       handleUserFinalTranscription(transcript.text);
112 |     } else {
113 |       handleUserInterimTranscription(transcript.text);
114 |     }
115 |   });
116 | 
117 |   rtviClient.on(RTVIEvent.BotTranscript, handleBotLLMText);
118 | 
119 |   rtviClient.on(RTVIEvent.Error, (message: RTVIMessage) => {
120 |     console.log("[EVENT] RTVI Error!", message);
121 |   });
122 | 
123 |   rtviClient.on(RTVIEvent.MessageError, (message: RTVIMessage) => {
124 |     console.log("[EVENT] RTVI ErrorMessage error!", message);
125 |   });
126 | 
127 |   rtviClient.on(RTVIEvent.Metrics, (data) => {
128 |     // let's only print out ttfb for now
129 |     if (! data.ttfb) {
130 |       return;
131 |     }
132 |     data.ttfb.map((metric) => {
133 |       console.log(`[METRICS] ${metric.processor} ttfb: ${metric.value}`);
134 |     });
135 |   });
136 | }
137 | 
138 | 
139 | async function startUserSpeechBubble() {
140 |   console.log('-- user started speaking -- ');
141 |   if (currentSpeaker === 'user') {
142 |     if (currentUserSpeechDiv) {
143 |       return;
144 |     }
145 |     // Should never get here, but, you know.
146 |   }
147 |   currentSpeaker = 'user';
148 |   currentUserSpeechDiv = document.createElement('div');
149 |   currentUserSpeechDiv.className = 'user-message';
150 |   let span = document.createElement('span');
151 |   currentUserSpeechDiv.appendChild(span);
152 |   chatTextDiv.appendChild(currentUserSpeechDiv);
153 | }
154 | 
155 | async function finishUserSpeechBubble() {
156 |   console.log('-- user stopped speaking -- ');
157 |   // noop for now. Could do UI update here.
158 | }
159 | 
160 | async function startBotSpeechBubble() {
161 |   currentSpeaker = 'bot';
162 |   currentBotSpeechDiv = document.createElement('div');
163 |   currentBotSpeechDiv.className = 'assistant-message';
164 |   chatTextDiv.appendChild(currentBotSpeechDiv);
165 | }
166 | 
167 | async function finishBotSpeechBubble() {
168 |   console.log('-- bot stopped speaking -- ');
169 | }
170 | 
171 | async function handleUserInterimTranscription(text: string) {
172 |   // No interim transcriptions (yet) from Gemini Multimodal Live. Leave this
173 |   // code here for a future update.
174 |   console.log('interim transcription:', text);
175 |   if (currentSpeaker !== 'user') {
176 |     return;
177 |   }
178 |   let span = currentUserSpeechDiv.querySelector('span:last-of-type');
179 |   span.classList.add('interim');
180 |   span.textContent = text + " ";
181 |   scroll();
182 | }
183 | 
184 | async function handleUserFinalTranscription(text: string) {
185 |   console.log('final transcription:', text);
186 |   let span = currentUserSpeechDiv.querySelector('span:last-of-type');
187 |   span.classList.remove('interim');
188 |   span.textContent = text + " ";
189 |   let newSpan = document.createElement('span');
190 |   currentUserSpeechDiv.appendChild(newSpan);
191 |   scroll();
192 | }
193 | 
194 | async function handleBotLLMText(data: BotLLMTextData) {
195 |   console.log('bot llm text:', data.text);
196 |   if (!currentBotSpeechDiv) {
197 |     return;
198 |   }
199 |   currentBotSpeechDiv.textContent += data.text;
200 |   scroll();
201 | }
202 | 
203 | function scroll() {
204 |   window.scrollTo({
205 |     top: document.body.scrollHeight,
206 |     behavior: 'smooth'
207 |   });
208 | }
209 | 


--------------------------------------------------------------------------------
/src/styles.css:
--------------------------------------------------------------------------------
 1 | body {
 2 |   max-width: 800px;
 3 |   margin: 20px 20px;
 4 |   ;
 5 |   font-family: system-ui, -apple-system, sans-serif;
 6 | }
 7 | 
 8 | #join-div a {
 9 |   color: #2563eb;
10 |   cursor: pointer;
11 | }
12 | 
13 | .user-message {
14 |   background-color: #f0f0f0;
15 |   margin: 24px 12px;
16 |   padding: 16px;
17 |   border-radius: 8px;
18 |   line-height: 1.5;
19 | }
20 | 
21 | .user-message .interim {
22 |   color: #707070;
23 | }
24 | 
25 | .user-message::before {
26 |   content: "User\A";
27 |   font-size: 14px;
28 |   color: #666;
29 |   font-weight: 500;
30 |   display: block;
31 |   padding-bottom: 0.4em;
32 | }
33 | 
34 | .assistant-message {
35 |   background-color: #ffffff;
36 |   border: 1px solid #e0e0e0;
37 |   margin: 24px 12px;
38 |   padding: 16px;
39 |   border-radius: 8px;
40 |   line-height: 1.5;
41 | }
42 | 
43 | .assistant-message::before {
44 |   content: "Assistant";
45 |   font-size: 14px;
46 |   color: #666;
47 |   font-weight: 500;
48 |   display: block;
49 |   padding-bottom: 0.4em;
50 | }


--------------------------------------------------------------------------------
/tsconfig.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "compilerOptions": {
 3 |     "target": "ES2020",
 4 |     "useDefineForClassFields": true,
 5 |     "module": "ESNext",
 6 |     "lib": [
 7 |       "ES2020",
 8 |       "DOM",
 9 |       "DOM.Iterable"
10 |     ],
11 |     "skipLibCheck": true,
12 |     /* Bundler mode */
13 |     "moduleResolution": "bundler",
14 |     "allowImportingTsExtensions": true,
15 |     "resolveJsonModule": true,
16 |     "isolatedModules": true,
17 |     "noEmit": true,
18 |     /* Type Checking */
19 |     "strict": false,
20 |     "noUnusedLocals": false,
21 |     "noUnusedParameters": false,
22 |     "noFallthroughCasesInSwitch": true
23 |   }
24 | }


--------------------------------------------------------------------------------