├── README.md ├── index.html ├── package.json ├── server ├── bot-gemini.py ├── env.example ├── requirements.txt ├── runner.py └── server.py ├── src ├── app.ts └── styles.css └── tsconfig.json /README.md: -------------------------------------------------------------------------------- 1 | # Gemini Multimodal Live WebRTC Example 2 | 3 | This example shows how to build a very simple voice AI application using the Gemini Multimodal Live API and WebRTC. 4 | 5 | The client is a web app in a single [app.ts](./src/app.ts) file and has just four dependencies: 6 | 1. the Open Source [Pipecat JavaScript SDK](https://github.com/pipecat-ai/pipecat-client-web) 7 | 2. a [Daily WebRTC transport](https://github.com/pipecat-ai/pipecat-client-web-transports) plugin 8 | 3. [ts-node-dev](https://www.npmjs.com/package/ts-node-dev) 9 | 4. [vite](https://vite.dev/) 10 | 11 | The WebRTC connection is just this code, plus event handlers for setting up audio playback and handling any events that you want to wire up to your user interface. 12 | 13 | ```typescript 14 | const rtviClient = new RTVIClient({ 15 | transport, 16 | params: { 17 | baseUrl: "http://localhost:7860/", 18 | }, 19 | enableMic: true, 20 | enableCam: false, 21 | timeout: 30 * 1000, 22 | }); 23 | ``` 24 | 25 | The server is a [Pipecat](https://github.com/pipecat-ai/pipecat) pipeline that uses the Gemini Multimodal Live WebSocket API [implementation](https://github.com/pipecat-ai/pipecat/tree/main/src/pipecat/services/gemini_multimodal_live) in Pipecat core. 26 | 27 | ## Installation and first run 28 | 29 | Use two terminals, one for the web app (client) and one for the server. 30 | 31 | ### Server 32 | 33 | ``` 34 | cd server 35 | cp env.example .env 36 | ``` 37 | 38 | You'll need a Gemini API key and a Daily API Key 39 | * Create a free Gemini API key at [https://aistudio.google.com/](https://aistudio.google.com/) 40 | * Create a free Daily account at [https://dashboard.daily.co/u/signup](https://dashboard.daily.co/u/signup) 41 | 42 | Add both API keys to `.env` 43 | 44 | Create a Python virtual environment and run `python server.py`. 45 | 46 | ``` 47 | python3.12 -m venv venv 48 | source venv/bin/activate 49 | pip install -r requirements.txt 50 | python server.py 51 | ``` 52 | 53 | ### Web app 54 | 55 | ``` 56 | npm i 57 | npm run dev 58 | ``` 59 | 60 | Open a web browser to the **Local** url that `npm run dev` printed to the console. This defaults to [http://localhost:5173](http://localhost:5173/) 61 | 62 | ### Architecture 63 | 64 | The web app connects to a server running a [Pipecat](https://github.com/pipecat-ai/pipecat) process. Pipecat is an orchestration framework for realtime, multimodal AI. In this use case, Pipecat is translating between the WebRTC media streams (and Pipecat client/server events), and the Multimodal Live API. 65 | 66 | You can think of this architecture as as smart proxy. 67 | 68 | 1. The client is sending/receiving audio and events using the WebRTC protocol. 69 | 2. The Pipecat pipeline is relaying the audio and events to the Multimodal Live API. 70 | 71 | This two-hop approach has several advantages over connecting directly from the client to the Multimodal API WebSocket infrastructure. 72 | 73 | - WebRTC delivers lower latency and better robustness for the median real-world user. Even though we are using a proxy/relay approach, for a large percentage of users, voice-to-voice response times will be measurably faster than using direct WebSocket connection. See the next section — "Why WebRTC?" — for more details on this. 74 | - You can add functionality using Pipecat's built-in services, or write custom server-side realtime logic in Python. As an example, this Pipecat pipeline implements audio transcription by sending both the user and the model audio to Gemini's standard (non-Live) API. 75 | - The SDK implements the [RTVI open standard](https://docs.pipecat.ai/client/introduction#about-rtvi) for AI client-server events. This rich set of events make it easy to implement core features of voice-to-voice and multimodal AI applications. Pipecat's React components, for example, leverage this event system. 76 | - Compatible Web, React, React Native, iOS, Android, Python, and C++ SDKs, all maintained and developed by a large and active Open Source community. 77 | 78 | There are also two disadvantages to using WebRTC this way. 79 | 80 | - WebRTC is more complicated than WebSockets. You'll need to use an SDK rather than write code that calls low-level WebRTC APIs. (This is generally true for Web applications, though arguing about *how* true it is for programmers with different levels of networking experience is outside the scope of this readme. It's *definitely* true for native mobile applications.) This is the pain point that the Pipecat open source client SDKs aim to alleviate. 81 | - In production, you probably will not want to run your own WebRTC server clusters. WebRTC is closer to telephony, in complexity, than to Web servers. You probably don't run your own SIP/PSTN/SMS infrastructure. So you'll likely pay a WebRTC infrastructure provider per-minute or per-gigabyte for each bot session. The Pipecat SDKs support both WebSockets and WebRTC. You can use WebSockets in development and then transition over to WebRTC when you're ready to deploy to real-world users. 82 | 83 | 84 | ``` 85 | 86 | ┌─────────────────────────────────────────┐ 87 | │ │ 88 | │ Server │ 89 | │ │ 90 | │ │ 91 | │ ┌────────────────────┐ │ 92 | │ │ │ │ 93 | │ │ Pipecat │ │ 94 | │ │ Pipeline │ │ 95 | │ │ │ │ 96 | │ │ │ │ 97 | ┌──────────────────────────┐ │ │ Audio Processing │ │ 98 | │ │ │ │ ▼ │ │ 99 | │ Pipecat Client │ │ ┌─────────────│ Gemini Flash ─┼──┼────► 100 | │ ┌───────────────┐ │ │ │ │ Transcription ◄┼──┼───── 101 | │ │ WebRTC (Daily)│ ────┼────────►│WebRTC (Daily) ▼ │ │ 102 | │ │ Transport │ ◄───┼─────────│ Transport │ Gemini Multimodal─┼──┼────► 103 | │ └───────────────┘ │ │ │ │ Live API ◄┼──┼───── 104 | │ │ │ └─────────────│ ▼ │ │ 105 | └──────────────────────────┘ │ │ Gemini Flash ─┼──┼────► 106 | │ │ Transcription ◄┼──┼───── 107 | │ │ ▼ │ │ 108 | │ │ Conversation │ │ 109 | │ │ Context │ │ 110 | │ │ Management │ │ 111 | │ │ ▼ │ │ 112 | │ │ RTVI Events │ │ 113 | │ │ │ │ 114 | │ └────────────────────┘ │ 115 | │ │ 116 | └─────────────────────────────────────────┘ 117 | ``` 118 | 119 | 120 | One other note about infrastructure: the instructions in this readme show you how to run the Pipecat process on your local machine. For a production application, you would would run the bot on a machine in the cloud. 121 | 122 | If you expect your maximum traffic to be ~100 or so concurrent connections, you can just deploy the `/server` directory from this project. It will scale fine on a single, mid-range, cloud VM. For higher traffic volumes, or if you'd rather somebody else handle the infrastructure devops for you, you might consider using a hosted Pipecat platform such as [Daily Bots](https://bots.daily.co/). 123 | 124 | ## Why WebRTC? 125 | 126 | If you're just starting out with voice AI, you might gravitate towards using a WebSocket library for networking. WebSockets are familiar, simple, and widely supported. And they are great for server-to-server use cases, for use cases where latency is not a primary concern, and are fine for prototyping and general hacking. 127 | 128 | But WebSockets shouldn't be used in production for client-server, real-time media connections. 129 | 130 | For production apps, you need to use WebRTC. WebRTC was designed from the ground up as *the* protocol for real-time media on the Internet. 131 | 132 | The major problems with WebSockets for real-time media delivery to and from end-user devices are: 133 | 134 | - WebSockets are built on TCP, so audio streams will be subject to head-of-line blocking and will automatically attempt packet resends even if packets are delayed so much that they can not be used for playout. 135 | 136 | - The Opus audio codec used for WebRTC is tightly coupled to WebRTC's bandwidth estimation and packet pacing (congestion control) logic, making a WebRTC audio stream resilient to a wide range of real-world network behaviors that would cause a WebSocket connection to accumulate latency. 137 | 138 | - The Opus audio codec has very good forward error correction, making the audio stream resilient to relatively high amounts of packet loss. (This only helps you if your network transport can drop late-arriving packets and doesn't do head of line blocking, though.) 139 | 140 | - Audio sent and received over WebRTC is automatically time-stamped so both playout and interruption logic are trivial. These are harder to get right for all corner cases, when using WebSockets. 141 | 142 | - WebRTC includes hooks for detailed performance and media quality statistics. A good WebRTC platform will give you detailed dashboards and analytics for both aggregate and individual session statistics that are specific to audio and video. This level of observability is somewhere between very hard and impossible to build for WebSockets. 143 | 144 | - WebSocket reconnection logic is very hard to implement robustly. You will have to build a ping/ack framework (or fully test and understand the framework that your WebSocket library provides). TCP timeouts and connection events behave differently on different platforms. 145 | 146 | - Finally, good WebRTC implementations today come with very good echo cancellation, noise reduction, and automatic gain control. You will likely need to figure out how to stitch this audio processing into an app that uses WebSockets. 147 | 148 | In addition, long-haul public Internet routes are problematic for latency and real-time media reliability, no matter what the underlying network protocol is. So if your end-users are a significant distance from OpenAI's servers, it's important to try to connect the user to a media router as close to them as possible. Beyond that first "edge" connection, you can then use a more efficient backbone route. A good WebRTC platform will do this for you automatically. 149 | 150 | In this sample application, for example, the latency for a typical user will actually be faster, routing through Daily's WebRTC network — which has many, many servers close to the edge of the network all over the world — than via a "direct" WebSocket server via a public Internet route. (There are no direct connections on the Internet. Every packet is relayed through multiple routers. Better routing means better response times.) 151 | 152 | ## More resources 153 | 154 | The Gemini Multimodal Live docs overview 155 | 156 | https://ai.google.dev/api/multimodal-live 157 | 158 | The repo for the Pipecat SDK for Gemini and WebRTC on the web (and React) is here: 159 | 160 | https://github.com/pipecat-ai/pipecat-client-web 161 | 162 | Contributions are welcome! If you want to write a new pluggable network transport for the SDK, check out this repo and README: 163 | 164 | https://github.com/pipecat-ai/pipecat-client-web-transports 165 | 166 | Here's a full-featured multimodal chat applicaton that demonstrates how to use the Gemini Multimodal Modal Live WebSocket API, HTTP single-turn APIs, and WebRTC all in one app. (They all have their place for different use cases.) 167 | 168 | https://github.com/pipecat-ai/gemini-multimodal-live-demo 169 | 170 | [Pipecat's SDKs] for Web, React, React Native, Android, iOS, Python, and C++ are all architecture-compatible and Open Source. 171 | 172 | The Pipecat Android SDK repo is here: 173 | 174 | https://github.com/pipecat-ai/pipecat-client-android 175 | 176 | The Pipecat iOS SDK repo is here: 177 | 178 | https://github.com/pipecat-ai/pipecat-client-ios 179 | 180 | If you're interested in network protocols designed for sending media, here's a technical overview of RTMP, HLS, and WebRTC: 181 | 182 | https://www.daily.co/blog/video-live-streaming/ 183 | 184 | For a deep dive into WebRTC edge and mesh routing, here's a long post about Daily's global WebRTC infrastructure: 185 | 186 | https://www.daily.co/blog/global-mesh-network/ 187 | 188 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Simple Gemini WebRTC Demo 6 | 7 | 8 | 9 | 10 | 11 |
12 |

13 | Click here to start a WebRTC session using the Gemini 14 | Multimodal Live API with audio routed to the cloud using WebRTC. 15 |

16 |

17 | Open the Console to see transport events. 18 |

19 |
20 | 21 |
22 | 23 |
24 | 25 | 26 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "1116", 3 | "version": "1.0.0", 4 | "main": "server/server.ts", 5 | "scripts": { 6 | "dev": "vite", 7 | "build": "vite build", 8 | "preview": "vite preview" 9 | }, 10 | "keywords": [], 11 | "author": "", 12 | "license": "ISC", 13 | "description": "", 14 | "dependencies": { 15 | "@pipecat-ai/client-js": "^0.3.1", 16 | "@pipecat-ai/daily-transport": "^0.3.2", 17 | "ts-node-dev": "^2.0.0", 18 | "vite": "^6.0.4" 19 | } 20 | } -------------------------------------------------------------------------------- /server/bot-gemini.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2024, Daily 3 | # 4 | # SPDX-License-Identifier: BSD 2-Clause License 5 | # 6 | 7 | """Gemini Bot Implementation. 8 | 9 | This module implements a chatbot using Google's Gemini Multimodal Live model. 10 | It includes: 11 | - Real-time audio/video interaction through Daily 12 | - Speech-to-speech using the Gemini Multimodal Live API 13 | - Transcription using Gemini's generate_content API 14 | - RTVI client/server events 15 | 16 | """ 17 | 18 | import asyncio 19 | from datetime import date 20 | import os 21 | import sys 22 | 23 | import aiohttp 24 | from dotenv import load_dotenv 25 | from loguru import logger 26 | from runner import configure 27 | 28 | from pipecat.audio.vad.silero import SileroVADAnalyzer 29 | from pipecat.audio.vad.vad_analyzer import VADParams 30 | from pipecat.frames.frames import Frame, EndFrame, TranscriptionFrame 31 | from pipecat.processors.frame_processor import FrameDirection, FrameProcessor 32 | from pipecat.pipeline.pipeline import Pipeline 33 | from pipecat.pipeline.runner import PipelineRunner 34 | from pipecat.pipeline.task import PipelineParams, PipelineTask 35 | from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext 36 | from pipecat.processors.frameworks.rtvi import ( 37 | RTVIBotTranscriptionProcessor, 38 | RTVIMetricsProcessor, 39 | RTVISpeakingProcessor, 40 | RTVIUserTranscriptionProcessor, 41 | ) 42 | from pipecat.services.gemini_multimodal_live.gemini import GeminiMultimodalLiveLLMService 43 | from pipecat.transports.services.daily import DailyParams, DailyTransport 44 | 45 | load_dotenv(override=True) 46 | 47 | logger.remove(0) 48 | logger.add(sys.stderr, level="DEBUG") 49 | 50 | SYSTEM_INSTRUCTION = f""" 51 | "You are Gemini Chatbot, a friendly, helpful robot. 52 | 53 | Your goal is to demonstrate your capabilities in a succinct way. 54 | 55 | Your output will be converted to audio so don't include special characters in your answers. 56 | 57 | Respond to what the user said in a creative and helpful way. Keep your responses brief. One or two sentences at most. 58 | 59 | Today is {date.today().strftime("%A, %B %d, %Y")} 60 | """ 61 | 62 | 63 | class UserTranscriptionFrameFilter(FrameProcessor): 64 | """Filter out UserTranscription frames.""" 65 | 66 | async def process_frame(self, frame: Frame, direction: FrameDirection): 67 | await super().process_frame(frame, direction) 68 | 69 | if isinstance(frame, TranscriptionFrame) and frame.user_id == "user": 70 | return 71 | 72 | await self.push_frame(frame, direction) 73 | 74 | 75 | async def main(): 76 | """Main bot execution function. 77 | 78 | Sets up and runs the bot pipeline including: 79 | - Daily video transport with specific audio parameters 80 | - Gemini Live multimodal model integration 81 | - Voice activity detection 82 | - Animation processing 83 | - RTVI event handling 84 | """ 85 | async with aiohttp.ClientSession() as session: 86 | (room_url, token) = await configure(session) 87 | 88 | # Set up Daily transport with specific audio/video parameters for Gemini 89 | transport = DailyTransport( 90 | room_url, 91 | token, 92 | "Chatbot", 93 | DailyParams( 94 | audio_in_sample_rate=16000, 95 | audio_out_sample_rate=24000, 96 | audio_out_enabled=True, 97 | camera_out_enabled=True, 98 | camera_out_width=1024, 99 | camera_out_height=576, 100 | vad_enabled=True, 101 | vad_audio_passthrough=True, 102 | vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)), 103 | ), 104 | ) 105 | 106 | # Initialize the Gemini Multimodal Live model 107 | llm = GeminiMultimodalLiveLLMService( 108 | api_key=os.getenv("GOOGLE_API_KEY"), 109 | voice_id="Puck", # Aoede, Charon, Fenrir, Kore, Puck 110 | transcribe_user_audio=True, 111 | transcribe_model_audio=True, 112 | system_instruction=SYSTEM_INSTRUCTION, 113 | tools=[{"google_search": {}}, {"code_execution": {}}], 114 | ) 115 | 116 | messages = [ 117 | { 118 | "role": "user", 119 | "content": 'Start by saying "Hello, I\'m Gemini".', 120 | }, 121 | ] 122 | 123 | # Set up conversation context and management 124 | # The context_aggregator will automatically collect conversation context 125 | context = OpenAILLMContext(messages) 126 | context_aggregator = llm.create_context_aggregator(context) 127 | 128 | # 129 | # RTVI events for Pipecat client UI 130 | # 131 | 132 | # This will send `user-*-speaking` and `bot-*-speaking` messages. 133 | rtvi_speaking = RTVISpeakingProcessor() 134 | 135 | # This will emit UserTranscript events. 136 | rtvi_user_transcription = RTVIUserTranscriptionProcessor() 137 | 138 | # This will emit BotTranscript events. 139 | rtvi_bot_transcription = RTVIBotTranscriptionProcessor() 140 | 141 | # This will send `metrics` messages. 142 | rtvi_metrics = RTVIMetricsProcessor() 143 | 144 | pipeline = Pipeline( 145 | [ 146 | transport.input(), 147 | context_aggregator.user(), 148 | llm, 149 | rtvi_speaking, 150 | rtvi_user_transcription, 151 | UserTranscriptionFrameFilter(), 152 | rtvi_bot_transcription, 153 | rtvi_metrics, 154 | transport.output(), 155 | context_aggregator.assistant(), 156 | ] 157 | ) 158 | 159 | task = PipelineTask( 160 | pipeline, 161 | PipelineParams( 162 | allow_interruptions=True, 163 | enable_metrics=True, 164 | enable_usage_metrics=True, 165 | ), 166 | ) 167 | 168 | @transport.event_handler("on_first_participant_joined") 169 | async def on_first_participant_joined(transport, participant): 170 | await transport.capture_participant_transcription(participant["id"]) 171 | await task.queue_frames([context_aggregator.user().get_context_frame()]) 172 | 173 | @transport.event_handler("on_participant_left") 174 | async def on_participant_left(transport, participant, reason): 175 | print(f"Participant left: {participant}") 176 | await task.queue_frame(EndFrame()) 177 | 178 | runner = PipelineRunner() 179 | 180 | await runner.run(task) 181 | 182 | 183 | if __name__ == "__main__": 184 | asyncio.run(main()) 185 | -------------------------------------------------------------------------------- /server/env.example: -------------------------------------------------------------------------------- 1 | GOOGLE_API_KEY=... 2 | DAILY_API_KEY=... 3 | 4 | # DAILY_SAMPLE_ROOM_URL=https://yourdomain.daily.co/yourroom # (for joining the bot to the same room repeatedly for local dev) 5 | -------------------------------------------------------------------------------- /server/requirements.txt: -------------------------------------------------------------------------------- 1 | python-dotenv 2 | fastapi[all] 3 | uvicorn 4 | pipecat-ai[daily,silero,google,openai] 5 | -------------------------------------------------------------------------------- /server/runner.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2024, Daily 3 | # 4 | # SPDX-License-Identifier: BSD 2-Clause License 5 | # 6 | 7 | import argparse 8 | import os 9 | 10 | import aiohttp 11 | 12 | from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper 13 | 14 | 15 | async def configure(aiohttp_session: aiohttp.ClientSession): 16 | """Configure the Daily room and Daily REST helper.""" 17 | parser = argparse.ArgumentParser(description="Daily AI SDK Bot Sample") 18 | parser.add_argument( 19 | "-u", "--url", type=str, required=False, help="URL of the Daily room to join" 20 | ) 21 | parser.add_argument( 22 | "-k", 23 | "--apikey", 24 | type=str, 25 | required=False, 26 | help="Daily API Key (needed to create an owner token for the room)", 27 | ) 28 | 29 | args, unknown = parser.parse_known_args() 30 | 31 | url = args.url or os.getenv("DAILY_SAMPLE_ROOM_URL") 32 | key = args.apikey or os.getenv("DAILY_API_KEY") 33 | 34 | if not url: 35 | raise Exception( 36 | "No Daily room specified. use the -u/--url option from the command line, or set DAILY_SAMPLE_ROOM_URL in your environment to specify a Daily room URL." 37 | ) 38 | 39 | if not key: 40 | raise Exception( 41 | "No Daily API key specified. use the -k/--apikey option from the command line, or set DAILY_API_KEY in your environment to specify a Daily API key, available from https://dashboard.daily.co/developers." 42 | ) 43 | 44 | daily_rest_helper = DailyRESTHelper( 45 | daily_api_key=key, 46 | daily_api_url=os.getenv("DAILY_API_URL", "https://api.daily.co/v1"), 47 | aiohttp_session=aiohttp_session, 48 | ) 49 | 50 | # Create a meeting token for the given room with an expiration 1 hour in 51 | # the future. 52 | expiry_time: float = 60 * 60 53 | 54 | token = await daily_rest_helper.get_token(url, expiry_time) 55 | 56 | return (url, token) 57 | -------------------------------------------------------------------------------- /server/server.py: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (c) 2024, Daily 3 | # 4 | # SPDX-License-Identifier: BSD 2-Clause License 5 | # 6 | 7 | """RTVI Bot Server Implementation. 8 | 9 | This FastAPI server manages RTVI bot instances and provides endpoints for both 10 | direct browser access and RTVI client connections. It handles: 11 | - Creating Daily rooms 12 | - Managing bot processes 13 | - Providing connection credentials 14 | - Monitoring bot status 15 | 16 | Requirements: 17 | - Daily API key (set in .env file) 18 | - Python 3.10+ 19 | - FastAPI 20 | - Running bot implementation 21 | """ 22 | 23 | import argparse 24 | import os 25 | import subprocess 26 | from contextlib import asynccontextmanager 27 | from typing import Any, Dict 28 | 29 | import aiohttp 30 | from dotenv import load_dotenv 31 | from fastapi import FastAPI, HTTPException, Request 32 | from fastapi.middleware.cors import CORSMiddleware 33 | from fastapi.responses import JSONResponse, RedirectResponse 34 | 35 | from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper, DailyRoomParams 36 | 37 | # Load environment variables from .env file 38 | load_dotenv(override=True) 39 | 40 | # Maximum number of bot instances allowed per room 41 | MAX_BOTS_PER_ROOM = 1 42 | 43 | # Dictionary to track bot processes: {pid: (process, room_url)} 44 | bot_procs = {} 45 | 46 | # Store Daily API helpers 47 | daily_helpers = {} 48 | 49 | 50 | def cleanup(): 51 | """Cleanup function to terminate all bot processes. 52 | 53 | Called during server shutdown. 54 | """ 55 | for entry in bot_procs.values(): 56 | proc = entry[0] 57 | proc.terminate() 58 | proc.wait() 59 | 60 | 61 | def get_bot_file(): 62 | return "bot-gemini" 63 | 64 | 65 | @asynccontextmanager 66 | async def lifespan(app: FastAPI): 67 | """FastAPI lifespan manager that handles startup and shutdown tasks. 68 | 69 | - Creates aiohttp session 70 | - Initializes Daily API helper 71 | - Cleans up resources on shutdown 72 | """ 73 | aiohttp_session = aiohttp.ClientSession() 74 | daily_helpers["rest"] = DailyRESTHelper( 75 | daily_api_key=os.getenv("DAILY_API_KEY", ""), 76 | daily_api_url=os.getenv("DAILY_API_URL", "https://api.daily.co/v1"), 77 | aiohttp_session=aiohttp_session, 78 | ) 79 | yield 80 | await aiohttp_session.close() 81 | cleanup() 82 | 83 | 84 | # Initialize FastAPI app with lifespan manager 85 | app = FastAPI(lifespan=lifespan) 86 | 87 | # Configure CORS to allow requests from any origin 88 | app.add_middleware( 89 | CORSMiddleware, 90 | allow_origins=["*"], 91 | allow_credentials=True, 92 | allow_methods=["*"], 93 | allow_headers=["*"], 94 | ) 95 | 96 | 97 | async def create_room_and_token() -> tuple[str, str]: 98 | """Helper function to create a Daily room and generate an access token. 99 | 100 | Returns: 101 | tuple[str, str]: A tuple containing (room_url, token) 102 | 103 | Raises: 104 | HTTPException: If room creation or token generation fails 105 | """ 106 | room = await daily_helpers["rest"].create_room(DailyRoomParams()) 107 | if not room.url: 108 | raise HTTPException(status_code=500, detail="Failed to create room") 109 | 110 | token = await daily_helpers["rest"].get_token(room.url) 111 | if not token: 112 | raise HTTPException(status_code=500, detail=f"Failed to get token for room: {room.url}") 113 | 114 | return room.url, token 115 | 116 | 117 | @app.get("/") 118 | async def start_agent(request: Request): 119 | """Endpoint for direct browser access to the bot. 120 | 121 | Creates a room, starts a bot instance, and redirects to the Daily room URL. 122 | 123 | Returns: 124 | RedirectResponse: Redirects to the Daily room URL 125 | 126 | Raises: 127 | HTTPException: If room creation, token generation, or bot startup fails 128 | """ 129 | print("Creating room") 130 | room_url, token = await create_room_and_token() 131 | print(f"Room URL: {room_url}") 132 | 133 | # Check if there is already an existing process running in this room 134 | num_bots_in_room = sum( 135 | 1 for proc in bot_procs.values() if proc[1] == room_url and proc[0].poll() is None 136 | ) 137 | if num_bots_in_room >= MAX_BOTS_PER_ROOM: 138 | raise HTTPException(status_code=500, detail=f"Max bot limit reached for room: {room_url}") 139 | 140 | # Spawn a new bot process 141 | try: 142 | bot_file = get_bot_file() 143 | proc = subprocess.Popen( 144 | [f"python3 -m {bot_file} -u {room_url} -t {token}"], 145 | shell=True, 146 | bufsize=1, 147 | cwd=os.path.dirname(os.path.abspath(__file__)), 148 | ) 149 | bot_procs[proc.pid] = (proc, room_url) 150 | except Exception as e: 151 | raise HTTPException(status_code=500, detail=f"Failed to start subprocess: {e}") 152 | 153 | return RedirectResponse(room_url) 154 | 155 | 156 | @app.post("/connect") 157 | async def rtvi_connect(request: Request) -> Dict[Any, Any]: 158 | """RTVI connect endpoint that creates a room and returns connection credentials. 159 | 160 | This endpoint is called by RTVI clients to establish a connection. 161 | 162 | Returns: 163 | Dict[Any, Any]: Authentication bundle containing room_url and token 164 | 165 | Raises: 166 | HTTPException: If room creation, token generation, or bot startup fails 167 | """ 168 | print("Creating room for Pipecat connection") 169 | room_url, token = await create_room_and_token() 170 | print(f"Room URL: {room_url}") 171 | 172 | # Start the bot process 173 | try: 174 | bot_file = get_bot_file() 175 | proc = subprocess.Popen( 176 | [f"python3 -m {bot_file} -u {room_url} -t {token}"], 177 | shell=True, 178 | bufsize=1, 179 | cwd=os.path.dirname(os.path.abspath(__file__)), 180 | ) 181 | bot_procs[proc.pid] = (proc, room_url) 182 | except Exception as e: 183 | raise HTTPException(status_code=500, detail=f"Failed to start subprocess: {e}") 184 | 185 | # Return the authentication bundle in format expected by DailyTransport 186 | return {"room_url": room_url, "token": token} 187 | 188 | 189 | @app.get("/status/{pid}") 190 | def get_status(pid: int): 191 | """Get the status of a specific bot process. 192 | 193 | Args: 194 | pid (int): Process ID of the bot 195 | 196 | Returns: 197 | JSONResponse: Status information for the bot 198 | 199 | Raises: 200 | HTTPException: If the specified bot process is not found 201 | """ 202 | # Look up the subprocess 203 | proc = bot_procs.get(pid) 204 | 205 | # If the subprocess doesn't exist, return an error 206 | if not proc: 207 | raise HTTPException(status_code=404, detail=f"Bot with process id: {pid} not found") 208 | 209 | # Check the status of the subprocess 210 | status = "running" if proc[0].poll() is None else "finished" 211 | return JSONResponse({"bot_id": pid, "status": status}) 212 | 213 | 214 | if __name__ == "__main__": 215 | import uvicorn 216 | 217 | # Parse command line arguments for server configuration 218 | default_host = os.getenv("HOST", "0.0.0.0") 219 | default_port = int(os.getenv("FAST_API_PORT", "7860")) 220 | 221 | parser = argparse.ArgumentParser(description="Daily Storyteller FastAPI server") 222 | parser.add_argument("--host", type=str, default=default_host, help="Host address") 223 | parser.add_argument("--port", type=int, default=default_port, help="Port number") 224 | parser.add_argument("--reload", action="store_true", help="Reload code on change") 225 | 226 | config = parser.parse_args() 227 | 228 | # Start the FastAPI server 229 | uvicorn.run( 230 | "server:app", 231 | host=config.host, 232 | port=config.port, 233 | reload=config.reload, 234 | ) 235 | -------------------------------------------------------------------------------- /src/app.ts: -------------------------------------------------------------------------------- 1 | import { RTVIClient, RTVIEvent, RTVIMessage, Participant, BotLLMTextData, Transport } from "@pipecat-ai/client-js"; 2 | import { DailyTransport } from "@pipecat-ai/daily-transport"; 3 | 4 | // 5 | // 6 | // 7 | 8 | let joinDiv: HTMLElement; 9 | 10 | document.addEventListener('DOMContentLoaded', () => { 11 | joinDiv = document.getElementById('join-div'); 12 | document.getElementById('start-webrtc-transport-session').addEventListener('click', () => { 13 | startBot('daily'); 14 | }); 15 | }); 16 | 17 | // 18 | // 19 | // 20 | 21 | async function startBot(profileChoice: string) { 22 | let transport: Transport; 23 | 24 | joinDiv.textContent = 'Joining...'; 25 | 26 | 27 | console.log('-- starting Gemini WebRTC connection --'); 28 | transport = new DailyTransport(); 29 | 30 | const rtviClient = new RTVIClient({ 31 | transport, 32 | params: { 33 | baseUrl: "http://localhost:7860/", 34 | }, 35 | enableMic: true, 36 | enableCam: false, 37 | timeout: 30 * 1000, 38 | }); 39 | 40 | setupEventHandlers(rtviClient); 41 | 42 | try { 43 | await rtviClient.initDevices(); 44 | await rtviClient.connect(); 45 | } catch (e) { 46 | console.log('Error connecting', e); 47 | } 48 | } 49 | 50 | // 51 | // 52 | // 53 | 54 | let audioDiv: HTMLDivElement; 55 | let chatTextDiv: HTMLDivElement; 56 | 57 | let currentUserSpeechDiv: HTMLDivElement; 58 | let currentBotSpeechDiv: HTMLDivElement; 59 | let currentSpeaker = ''; // 'user' or 'bot' 60 | 61 | export async function setupEventHandlers(rtviClient: RTVIClient) { 62 | audioDiv = document.getElementById('audio') as HTMLDivElement; 63 | chatTextDiv = document.getElementById('chat-text') as HTMLDivElement; 64 | 65 | rtviClient.on(RTVIEvent.TransportStateChanged, (state: string) => { 66 | console.log(`-- transport state change: ${state} --`); 67 | joinDiv.textContent = `Transport state: ${state}`; 68 | }); 69 | 70 | rtviClient.on(RTVIEvent.Connected, () => { 71 | console.log("-- user connected --"); 72 | }); 73 | 74 | rtviClient.on(RTVIEvent.Disconnected, () => { 75 | console.log("-- user disconnected --"); 76 | }); 77 | 78 | rtviClient.on(RTVIEvent.BotConnected, () => { 79 | console.log("-- bot connected --"); 80 | }); 81 | 82 | rtviClient.on(RTVIEvent.BotDisconnected, () => { 83 | console.log("--bot disconnected --"); 84 | }); 85 | 86 | rtviClient.on(RTVIEvent.BotReady, () => { 87 | console.log("-- bot ready to chat! --"); 88 | }); 89 | 90 | rtviClient.on(RTVIEvent.TrackStarted, (track: MediaStreamTrack, participant: Participant) => { 91 | console.log(" --> track started", participant, track); 92 | if (participant.local) { 93 | return; 94 | } 95 | let audio = document.createElement("audio"); 96 | audio.srcObject = new MediaStream([track]); 97 | audio.autoplay = true; 98 | audioDiv.appendChild(audio); 99 | }); 100 | 101 | rtviClient.on(RTVIEvent.UserStartedSpeaking, startUserSpeechBubble); 102 | 103 | rtviClient.on(RTVIEvent.UserStoppedSpeaking, finishUserSpeechBubble); 104 | 105 | rtviClient.on(RTVIEvent.BotStartedSpeaking, startBotSpeechBubble); 106 | 107 | rtviClient.on(RTVIEvent.BotStoppedSpeaking, finishBotSpeechBubble); 108 | 109 | rtviClient.on(RTVIEvent.UserTranscript, (transcript: TranscriptData) => { 110 | if (transcript.final) { 111 | handleUserFinalTranscription(transcript.text); 112 | } else { 113 | handleUserInterimTranscription(transcript.text); 114 | } 115 | }); 116 | 117 | rtviClient.on(RTVIEvent.BotTranscript, handleBotLLMText); 118 | 119 | rtviClient.on(RTVIEvent.Error, (message: RTVIMessage) => { 120 | console.log("[EVENT] RTVI Error!", message); 121 | }); 122 | 123 | rtviClient.on(RTVIEvent.MessageError, (message: RTVIMessage) => { 124 | console.log("[EVENT] RTVI ErrorMessage error!", message); 125 | }); 126 | 127 | rtviClient.on(RTVIEvent.Metrics, (data) => { 128 | // let's only print out ttfb for now 129 | if (! data.ttfb) { 130 | return; 131 | } 132 | data.ttfb.map((metric) => { 133 | console.log(`[METRICS] ${metric.processor} ttfb: ${metric.value}`); 134 | }); 135 | }); 136 | } 137 | 138 | 139 | async function startUserSpeechBubble() { 140 | console.log('-- user started speaking -- '); 141 | if (currentSpeaker === 'user') { 142 | if (currentUserSpeechDiv) { 143 | return; 144 | } 145 | // Should never get here, but, you know. 146 | } 147 | currentSpeaker = 'user'; 148 | currentUserSpeechDiv = document.createElement('div'); 149 | currentUserSpeechDiv.className = 'user-message'; 150 | let span = document.createElement('span'); 151 | currentUserSpeechDiv.appendChild(span); 152 | chatTextDiv.appendChild(currentUserSpeechDiv); 153 | } 154 | 155 | async function finishUserSpeechBubble() { 156 | console.log('-- user stopped speaking -- '); 157 | // noop for now. Could do UI update here. 158 | } 159 | 160 | async function startBotSpeechBubble() { 161 | currentSpeaker = 'bot'; 162 | currentBotSpeechDiv = document.createElement('div'); 163 | currentBotSpeechDiv.className = 'assistant-message'; 164 | chatTextDiv.appendChild(currentBotSpeechDiv); 165 | } 166 | 167 | async function finishBotSpeechBubble() { 168 | console.log('-- bot stopped speaking -- '); 169 | } 170 | 171 | async function handleUserInterimTranscription(text: string) { 172 | // No interim transcriptions (yet) from Gemini Multimodal Live. Leave this 173 | // code here for a future update. 174 | console.log('interim transcription:', text); 175 | if (currentSpeaker !== 'user') { 176 | return; 177 | } 178 | let span = currentUserSpeechDiv.querySelector('span:last-of-type'); 179 | span.classList.add('interim'); 180 | span.textContent = text + " "; 181 | scroll(); 182 | } 183 | 184 | async function handleUserFinalTranscription(text: string) { 185 | console.log('final transcription:', text); 186 | let span = currentUserSpeechDiv.querySelector('span:last-of-type'); 187 | span.classList.remove('interim'); 188 | span.textContent = text + " "; 189 | let newSpan = document.createElement('span'); 190 | currentUserSpeechDiv.appendChild(newSpan); 191 | scroll(); 192 | } 193 | 194 | async function handleBotLLMText(data: BotLLMTextData) { 195 | console.log('bot llm text:', data.text); 196 | if (!currentBotSpeechDiv) { 197 | return; 198 | } 199 | currentBotSpeechDiv.textContent += data.text; 200 | scroll(); 201 | } 202 | 203 | function scroll() { 204 | window.scrollTo({ 205 | top: document.body.scrollHeight, 206 | behavior: 'smooth' 207 | }); 208 | } 209 | -------------------------------------------------------------------------------- /src/styles.css: -------------------------------------------------------------------------------- 1 | body { 2 | max-width: 800px; 3 | margin: 20px 20px; 4 | ; 5 | font-family: system-ui, -apple-system, sans-serif; 6 | } 7 | 8 | #join-div a { 9 | color: #2563eb; 10 | cursor: pointer; 11 | } 12 | 13 | .user-message { 14 | background-color: #f0f0f0; 15 | margin: 24px 12px; 16 | padding: 16px; 17 | border-radius: 8px; 18 | line-height: 1.5; 19 | } 20 | 21 | .user-message .interim { 22 | color: #707070; 23 | } 24 | 25 | .user-message::before { 26 | content: "User\A"; 27 | font-size: 14px; 28 | color: #666; 29 | font-weight: 500; 30 | display: block; 31 | padding-bottom: 0.4em; 32 | } 33 | 34 | .assistant-message { 35 | background-color: #ffffff; 36 | border: 1px solid #e0e0e0; 37 | margin: 24px 12px; 38 | padding: 16px; 39 | border-radius: 8px; 40 | line-height: 1.5; 41 | } 42 | 43 | .assistant-message::before { 44 | content: "Assistant"; 45 | font-size: 14px; 46 | color: #666; 47 | font-weight: 500; 48 | display: block; 49 | padding-bottom: 0.4em; 50 | } -------------------------------------------------------------------------------- /tsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "target": "ES2020", 4 | "useDefineForClassFields": true, 5 | "module": "ESNext", 6 | "lib": [ 7 | "ES2020", 8 | "DOM", 9 | "DOM.Iterable" 10 | ], 11 | "skipLibCheck": true, 12 | /* Bundler mode */ 13 | "moduleResolution": "bundler", 14 | "allowImportingTsExtensions": true, 15 | "resolveJsonModule": true, 16 | "isolatedModules": true, 17 | "noEmit": true, 18 | /* Type Checking */ 19 | "strict": false, 20 | "noUnusedLocals": false, 21 | "noUnusedParameters": false, 22 | "noFallthroughCasesInSwitch": true 23 | } 24 | } --------------------------------------------------------------------------------