├── .cursorrules ├── .env.sample ├── .gitignore ├── .pre-commit-config.yaml ├── AI_DOCS ├── Agency-Swarm Docs.md ├── event_docs.md ├── js_implementation.md └── realtime_api_docs.md ├── README.md ├── personalization.json ├── pyproject.toml ├── src └── voice_assistant │ ├── __init__.py │ ├── agencies │ ├── ResearchAgency │ │ ├── AnalystAgent │ │ │ ├── AnalystAgent.py │ │ │ └── instructions.md │ │ ├── BrowsingAgent │ │ │ ├── BrowsingAgent.py │ │ │ ├── instructions.md │ │ │ ├── requirements.txt │ │ │ └── tools │ │ │ │ ├── ClickElement.py │ │ │ │ ├── ExportFile.py │ │ │ │ ├── GoBack.py │ │ │ │ ├── ReadURL.py │ │ │ │ ├── Scroll.py │ │ │ │ ├── SelectDropdown.py │ │ │ │ ├── SendKeys.py │ │ │ │ ├── SolveCaptcha.py │ │ │ │ ├── WebPageSummarizer.py │ │ │ │ ├── __init__.py │ │ │ │ └── util │ │ │ │ ├── __init__.py │ │ │ │ ├── get_b64_screenshot.py │ │ │ │ ├── highlights.py │ │ │ │ └── selenium.py │ │ ├── agency.py │ │ └── agency_manifesto.md │ └── __init__.py │ ├── audio.py │ ├── config.py │ ├── icon.png │ ├── main.py │ ├── microphone.py │ ├── models.py │ ├── tests │ └── test_realtime_connection.py │ ├── tools │ ├── CreateFile.py │ ├── DeleteFile.py │ ├── DraftGmail.py │ ├── FetchDailyMeetingSchedule.py │ ├── GetCurrentDateTime.py │ ├── GetGmailSummary.py │ ├── GetResponse.py │ ├── GetScreenDescription.py │ ├── OpenBrowser.py │ ├── SendMessage.py │ ├── SendMessageAsync.py │ ├── UpdateFile.py │ └── __init__.py │ ├── utils │ ├── __init__.py │ ├── decorators.py │ ├── google_services_utils.py │ ├── llm_utils.py │ └── log_utils.py │ ├── visual_interface.py │ └── websocket_handler.py └── uv.lock /.cursorrules: -------------------------------------------------------------------------------- 1 | # AI Developer for Voice Assistant Project Instructions 2 | 3 | You are an expert AI developer, your mission is to develop tools and agents that enhance the capabilities of other agents. 4 | These tools and agents are pivotal for enabling agents to communicate, collaborate, and efficiently achieve their collective objectives. 5 | Below are detailed instructions to guide you through the process of creating tools and agents, ensuring they are both functional and align with the framework's standards. 6 | 7 | ## Understanding Your Role 8 | 9 | Your primary role is to architect tools and agents that fulfill specific needs within the voice assistant project. This involves: 10 | 11 | 1. **Tool Development:** Develop each tool following the Agency Swarm's specifications, ensuring it is robust and ready for production environments. It must not use any placeholders and be located in the correct agent's tools folder. 12 | 2. **Identifying Packages:** Determine the best possible packages or APIs that can be used to create a tool based on the user's requirements. Utilize web search if you are uncertain about which API or package to use. 13 | 3. **Instructions for the Agent**: If the agent is underperforming, you will need to adjust it's instructions based on the user's feedback. Find the instructions.md file for the agent and adjust it. 14 | 15 | ## Voice Assistant Project Introduction 16 | 17 | This document provides comprehensive instructions for developing tools and agents within the Voice Assistant project. The project is structured to include both standalone tools and Agency Swarm agencies, each with its distinct development approach and location within the project structure. 18 | 19 | ## High-level Folder Structure of Voice Assistant Project 20 | 21 | The Voice Assistant project is organized as follows: 22 | 23 | ``` 24 | src/voice_assistant/ 25 | ├── agencies/ 26 | │ ├── agency_name/ 27 | │ │ ├── agent_name/ 28 | │ │ │ ├── __init__.py 29 | │ │ │ ├── agent_name.py 30 | │ │ │ ├── instructions.md 31 | │ │ │ └── tools/ 32 | │ │ │ └── ... 33 | │ │ ├── another_agent/ 34 | │ │ │ ├── __init__.py 35 | │ │ │ ├── another_agent.py 36 | │ │ │ ├── instructions.md 37 | │ │ │ └── tools/ 38 | │ │ │ └── ... 39 | │ │ ├── agency.py 40 | │ │ └── agency_manifesto.md 41 | │ └── ... 42 | ├── tools/ 43 | │ ├── ToolName.py 44 | │ └── ... 45 | ``` 46 | 47 | ## Standalone Tools vs. Agency Swarm Agencies 48 | 49 | It's crucial to understand the distinction between standalone tools and Agency Swarm agencies within this project: 50 | 51 | 1. **Standalone Tools (/tools directory):** 52 | 53 | - Located in the `/tools` directory 54 | - Must be adapted from Agency-Swarm standards 55 | - Developed as individual, reusable components 56 | - Follow specific guidelines for standalone tool development 57 | 58 | 2. **Agency Swarm Agencies (/agencies directory):** 59 | - Located in the `/agencies` directory 60 | - Follow normal Agency Swarm development practices 61 | - Organized into agencies and agents with their respective tools 62 | 63 | Now, let's delve into the specific instructions for Agency Swarm development, which primarily applies to the `/agencies` directory. 64 | 65 | --- Start of Agency Swarm Framework Instructions --- 66 | 67 | ## Agency Swarm Framework Overview 68 | 69 | Agency Swarm started as a desire and effort of Arsenii Shatokhin (aka VRSEN) to fully automate his AI Agency with AI. By building this framework, we aim to simplify the agent creation process and enable anyone to create a collaborative swarm of agents (Agencies), each with distinct roles and capabilities. 70 | 71 | ### Key Features 72 | 73 | - **Customizable Agent Roles**: Define roles like CEO, virtual assistant, developer, etc., and customize their functionalities with [Assistants API](https://platform.openai.com/docs/assistants/overview). 74 | - **Full Control Over Prompts**: Avoid conflicts and restrictions of pre-defined prompts, allowing full customization. 75 | - **Tool Creation**: Tools within Agency Swarm are created using pydantic, which provides a convenient interface and automatic type validation. 76 | - **Efficient Communication**: Agents communicate through a specially designed "send message" tool based on their own descriptions. 77 | - **State Management**: Agency Swarm efficiently manages the state of your assistants on OpenAI, maintaining it in a special `settings.json` file. 78 | - **Deployable in Production**: Agency Swarm is designed to be reliable and easily deployable in production environments. 79 | 80 | ### Folder Structure 81 | 82 | In Agency Swarm, the folder structure is organized as follows: 83 | 84 | 1. Each agency and agent has its own dedicated folder. 85 | 2. Within each agent folder: 86 | 87 | - A 'tools' folder contains all tools for that agent. 88 | - An 'instructions.md' file provides agent-specific instructions. 89 | - An '**init**.py' file contains the import of the agent. 90 | 91 | 3. Tool Import Process: 92 | 93 | - Create a file in the 'tools' folder with the same name as the tool class. 94 | - The tool needs to be added to the tools list in the agent class. Do not overwrite existing tools when adding a new tool. 95 | - All new requirements must be added to the requirements.txt file. 96 | 97 | 4. Agency Configuration: 98 | - The 'agency.py' file is the main file where all new agents are imported. 99 | - When creating a new agency folder, use descriptive names, like for example: marketing_agency, development_agency, etc. 100 | 101 | Follow this folder structure when creating or modifying files within the Agency Swarm framework: 102 | 103 | ``` 104 | agency_name/ 105 | ├── agent_name/ 106 | │ ├── __init__.py 107 | │ ├── agent_name.py 108 | │ ├── instructions.md 109 | │ └── tools/ 110 | │ ├── tool_name1.py 111 | │ ├── tool_name2.py 112 | │ ├── tool_name3.py 113 | │ ├── ... 114 | ├── another_agent/ 115 | │ ├── __init__.py 116 | │ ├── another_agent.py 117 | │ ├── instructions.md 118 | │ └── tools/ 119 | │ ├── tool_name1.py 120 | │ ├── tool_name2.py 121 | │ ├── tool_name3.py 122 | │ ├── ... 123 | ├── agency.py 124 | ├── agency_manifesto.md 125 | ├── requirements.txt 126 | └──... 127 | ``` 128 | 129 | ## Instructions 130 | 131 | ### 1. Create tools 132 | 133 | Tools are the specific actions that agents can perform. They are defined in the `tools` folder. 134 | 135 | When creating a tool, you are defining a new class that extends `BaseTool` from `agency_swarm.tools`. This process involves several key steps, outlined below. 136 | 137 | #### 1.1. Import Necessary Modules 138 | 139 | Start by importing `BaseTool` from `agency_swarm.tools` and `Field` from `pydantic`. These imports will serve as the foundation for your custom tool class. Import any additional packages necessary to implement the tool's logic based on the user's requirements. Import `load_dotenv` from `dotenv` to load the environment variables. 140 | 141 | #### 1.2. Define Your Tool Class 142 | 143 | Create a new class that inherits from `BaseTool`. This class will encapsulate the functionality of your tool. `BaseTool` class inherits from the Pydantic's `BaseModel` class. 144 | 145 | #### 1.3. Specify Tool Fields 146 | 147 | Define the fields your tool will use, utilizing Pydantic's `Field` for clear descriptions and validation. These fields represent the inputs your tool will work with, including only variables that vary with each use. Define any constant variables globally. 148 | 149 | #### 1.4. Implement the `run` Method 150 | 151 | The `run` method is where your tool's logic is executed. Use the fields defined earlier to perform the tool's intended task. It must contain the actual fully functional correct python code. It can utilize various python packages, previously imported in step 1. 152 | 153 | ### Best Practices 154 | 155 | - **Identify Necessary Packages**: Determine the best packages or APIs to use for creating the tool based on the requirements. 156 | - **Documentation**: Ensure each class and method is well-documented. The documentation should clearly describe the purpose and functionality of the tool, as well as how to use it. 157 | - **Code Quality**: Write clean, readable, and efficient code. Adhere to the PEP 8 style guide for Python code. 158 | - **Web Research**: Utilize web browsing to identify the most relevant packages, APIs, or documentation necessary for implementing your tool's logic. 159 | - **Use Python Packages**: Prefer to use various API wrapper packages and SDKs available on pip, rather than calling these APIs directly using requests. 160 | - **Expect API Keys to be defined as env variables**: If a tool requires an API key or an access token, it must be accessed from the environment using os package within the `run` method's logic. 161 | - **Use global variables for constants**: If a tool requires a constant global variable, that does not change from use to use, (for example, ad_account_id, pull_request_id, etc.), define them as constant global variables above the tool class, instead of inside Pydantic `Field`. 162 | - **Add a test case at the bottom of the file**: Add a test case for each tool in if **name** == "**main**": block. 163 | 164 | ### Example of a Tool 165 | 166 | ```python 167 | from agency_swarm.tools import BaseTool 168 | from pydantic import Field 169 | import os 170 | from dotenv import load_dotenv 171 | 172 | load_dotenv() # always load the environment variables 173 | 174 | account_id = "MY_ACCOUNT_ID" 175 | api_key = os.getenv("MY_API_KEY") # or access_token = os.getenv("MY_ACCESS_TOKEN") 176 | 177 | class MyCustomTool(BaseTool): 178 | """ 179 | A brief description of what the custom tool does. 180 | The docstring should clearly explain the tool's purpose and functionality. 181 | It will be used by the agent to determine when to use this tool. 182 | """ 183 | # Define the fields with descriptions using Pydantic Field 184 | example_field: str = Field( 185 | ..., description="Description of the example field, explaining its purpose and usage for the Agent." 186 | ) 187 | 188 | def run(self): 189 | """ 190 | The implementation of the run method, where the tool's main functionality is executed. 191 | This method should utilize the fields defined above to perform the task. 192 | """ 193 | # Your custom tool logic goes here 194 | # Example: 195 | # do_something(self.example_field, api_key, account_id) 196 | 197 | # Return the result of the tool's operation as a string 198 | return "Result of MyCustomTool operation" 199 | 200 | if __name__ == "__main__": 201 | tool = MyCustomTool(example_field="example value") 202 | print(tool.run()) 203 | ``` 204 | 205 | Remember, each tool code snippet you create must be fully ready to use. It must not contain any placeholders or hypothetical examples. 206 | 207 | ### 2. Create agents 208 | 209 | Agents are the core of the framework. Each agent has it's own unique role and functionality and is designed to perform specific tasks. Each file for the agent must be named the same as the agent's name. 210 | 211 | #### Agent Class 212 | 213 | To create an agent, import `Agent` from `agency_swarm` and create a class that inherits from `Agent`. Inside the class you can adjust the following parameters: 214 | 215 | ```python 216 | from agency_swarm import Agent 217 | 218 | class CEO(Agent): 219 | def __init__(self): 220 | super().__init__( 221 | name="CEO", 222 | description="Responsible for client communication, task planning and management.", 223 | instructions="./instructions.md", # instructions for the agent 224 | tools=[MyCustomTool], 225 | temperature=0.5, 226 | max_prompt_tokens=25000, 227 | ) 228 | ``` 229 | 230 | - Name: The agent's name, reflecting its role. 231 | - Description: A brief summary of the agent's responsibilities. 232 | - Instructions: Path to a markdown file containing detailed instructions for the agent. 233 | - Tools: A list of tools (extending BaseTool) that the agent can use. (Tools must not be initialized, so the agent can pass the parameters itself) 234 | - Other Parameters: Additional settings like temperature, max_prompt_tokens, etc. 235 | 236 | Make sure to create a separate folder for each agent, as described in the folder structure above. After creating the agent, you need to import it into the agency.py file. 237 | 238 | #### instructions.md file 239 | 240 | Each agent also needs to have an `instructions.md` file, which is the system prompt for the agent. Inside those instructions, you need to define the following: 241 | 242 | - **Agent Role**: A description of the role of the agent. 243 | - **Goals**: A list of goals that the agent should achieve, aligned with the agency's mission. 244 | - **Process Workflow**: A step by step guide on how the agent should perform its tasks. Each step must be aligned with the other agents in the agency, and with the tools available to this agent. 245 | 246 | Use the following template for the instructions.md file: 247 | 248 | ```md 249 | # Agent Role 250 | 251 | A description of the role of the agent. 252 | 253 | # Goals 254 | 255 | A list of goals that the agent should achieve, aligned with the agency's mission. 256 | 257 | # Process Workflow 258 | 259 | 1. Step 1 260 | 2. Step 2 261 | 3. Step 3 262 | ``` 263 | 264 | Instructions for the agent to be created in markdown format. Instructions should include a description of the role and a specific step by step process that this agent needs to perform in order to execute the tasks. The process must also be aligned with all the other agents in the agency. Agents should be able to collaborate with each other to achieve the common goal of the agency. 265 | 266 | #### Code Interpreter and FileSearch Options 267 | 268 | To utilize the Code Interpreter tool (the Jupyter Notebook Execution environment, without Internet access) and the FileSearch tool (a Retrieval-Augmented Generation (RAG) provided by OpenAI): 269 | 270 | 1. Import the tools: 271 | 272 | ```python 273 | from agency_swarm.tools import CodeInterpreter, FileSearch 274 | 275 | ``` 276 | 277 | 2. Add the tools to the agent's tools list: 278 | 279 | ```python 280 | agent = Agent( 281 | name="MyAgent", 282 | tools=[CodeInterpreter, FileSearch], 283 | # ... other agent parameters 284 | ) 285 | 286 | ``` 287 | 288 | ### 3. Create Agencies 289 | 290 | Agencies are collections of agents that work together to achieve a common goal. They are defined in the `agency.py` file. 291 | 292 | #### Agency Class 293 | 294 | To create an agency, import `Agency` from `agency_swarm` and create a class that inherits from `Agency`. Inside the class you can adjust the following parameters: 295 | 296 | ```python 297 | from agency_swarm import Agency 298 | from CEO import CEO 299 | from .developers.developer import Developer 300 | from .virtual_assistants.virtual_assistant import VirtualAssistant 301 | 302 | dev = Developer() 303 | va = VirtualAssistant() 304 | 305 | agency = Agency([ 306 | ceo, # CEO will be the entry point for communication with the user 307 | [ceo, dev], # CEO can initiate communication with Developer 308 | [ceo, va], # CEO can initiate communication with Virtual Assistant 309 | [dev, va] # Developer can initiate communication with Virtual Assistant 310 | ], 311 | shared_instructions='agency_manifesto.md', #shared instructions for all agents 312 | temperature=0.5, # default temperature for all agents 313 | max_prompt_tokens=25000 # default max tokens in conversation history 314 | ) 315 | 316 | if __name__ == "__main__": 317 | agency.run_demo() # starts the agency in terminal 318 | ``` 319 | 320 | #### Communication Flows 321 | 322 | In Agency Swarm, communication flows are directional, meaning they are established from left to right in the agency_chart definition. For instance, in the example above, the CEO can initiate a chat with the developer (dev), and the developer can respond in this chat. However, the developer cannot initiate a chat with the CEO. The developer can initiate a chat with the virtual assistant (va) and assign new tasks. 323 | 324 | To allow agents to communicate with each other, simply add them in the second level list inside the agency chart like this: `[ceo, dev], [ceo, va], [dev, va]`. The agent on the left will be able to communicate with the agent on the right. 325 | 326 | #### Agency Manifesto 327 | 328 | Agency manifesto is a file that contains shared instructions for all agents in the agency. It is a markdown file that is located in the agency folder. Please write the manifesto file when creating a new agency. Include the following: 329 | 330 | - **Agency Description**: A brief description of the agency. 331 | - **Mission Statement**: A concise statement that encapsulates the purpose and guiding principles of the agency. 332 | - **Operating Environment**: A description of the operating environment of the agency. 333 | 334 | ## Notes 335 | 336 | IMPORTANT: NEVER output code snippets or file contents in the chat. Always create or modify the actual files in the file system. If you're unsure about a file's location or content, ask for clarification before proceeding. 337 | 338 | When creating or modifying files: 339 | 340 | 1. Use the appropriate file creation or modification syntax (e.g., ```python:path/to/file.py for Python files). 341 | 2. Write the full content of the file, not just snippets or placeholders. 342 | 3. Ensure all necessary imports and dependencies are included. 343 | 4. Follow the specified file creation order rigorously: 1. tools, 2. agents, 3. agency, 4. requirements.txt. 344 | 345 | If you find yourself about to output code in the chat, STOP and reconsider your approach. Always prioritize actual file creation and modification over chat explanations. 346 | 347 | --- End of Agency Swarm Instructions --- 348 | 349 | ## Standalone Tools in /tools Directory 350 | 351 | To reiterate the distinction, the `/tools` directory contains standalone tools that are adapted from Agency-Swarm standards but are not directly part of any specific agent or agency. When developing these tools: 352 | 353 | 1. Place all standalone tools in the `src/voice_assistant/tools/` directory. 354 | 2. Each tool should be in its own file, named after the tool class (e.g., `GetCurrentDateTime.py` for `GetCurrentDateTime` class). 355 | 3. Tools must inherit from `BaseTool` from `agency_swarm.tools`. 356 | 4. Use async syntax for the `run` method. 357 | 5. For synchronous operations within async tools, use `asyncio.to_thread`. 358 | 6. Always use environment variables for API keys and sensitive information. 359 | 7. Add a test case at the bottom of each tool file. 360 | 361 | These standalone tools can be used across different agencies or independently, providing flexibility and reusability within the Voice Assistant project. 362 | 363 | Remember, when developing within the Voice Assistant project, always consider whether you're working on a standalone tool (/tools) or an Agency Swarm agency (/agencies) and follow the appropriate guidelines for each. 364 | -------------------------------------------------------------------------------- /.env.sample: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY= 2 | PERSONALIZATION_FILE=./personalization.json 3 | SCRATCH_PAD_DIR=./scratchpad 4 | EMAIL_SENDER=sender@example.com 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | output/ 2 | input/ 3 | 4 | # Based on https://raw.githubusercontent.com/github/gitignore/main/Node.gitignore 5 | 6 | # Logs 7 | 8 | logs 9 | _.log 10 | npm-debug.log_ 11 | yarn-debug.log* 12 | yarn-error.log* 13 | lerna-debug.log* 14 | .pnpm-debug.log* 15 | 16 | # Caches 17 | 18 | .cache 19 | 20 | # Diagnostic reports (https://nodejs.org/api/report.html) 21 | 22 | report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json 23 | 24 | # Runtime data 25 | 26 | pids 27 | _.pid 28 | _.seed 29 | *.pid.lock 30 | 31 | # Directory for instrumented libs generated by jscoverage/JSCover 32 | 33 | lib-cov 34 | 35 | # Coverage directory used by tools like istanbul 36 | 37 | coverage 38 | *.lcov 39 | 40 | # nyc test coverage 41 | 42 | .nyc_output 43 | 44 | # Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files) 45 | 46 | .grunt 47 | 48 | # Bower dependency directory (https://bower.io/) 49 | 50 | bower_components 51 | 52 | # node-waf configuration 53 | 54 | .lock-wscript 55 | 56 | # Compiled binary addons (https://nodejs.org/api/addons.html) 57 | 58 | build/Release 59 | 60 | # Dependency directories 61 | 62 | node_modules/ 63 | jspm_packages/ 64 | 65 | # Snowpack dependency directory (https://snowpack.dev/) 66 | 67 | web_modules/ 68 | 69 | # TypeScript cache 70 | 71 | *.tsbuildinfo 72 | 73 | # Optional npm cache directory 74 | 75 | .npm 76 | 77 | # Optional eslint cache 78 | 79 | .eslintcache 80 | 81 | # Optional stylelint cache 82 | 83 | .stylelintcache 84 | 85 | # Microbundle cache 86 | 87 | .rpt2_cache/ 88 | .rts2_cache_cjs/ 89 | .rts2_cache_es/ 90 | .rts2_cache_umd/ 91 | 92 | # Optional REPL history 93 | 94 | .node_repl_history 95 | 96 | # Output of 'npm pack' 97 | 98 | *.tgz 99 | 100 | # Yarn Integrity file 101 | 102 | .yarn-integrity 103 | 104 | # dotenv environment variable files 105 | 106 | .env 107 | .env.development.local 108 | .env.test.local 109 | .env.production.local 110 | .env.local 111 | 112 | # parcel-bundler cache (https://parceljs.org/) 113 | 114 | .parcel-cache 115 | 116 | # Next.js build output 117 | 118 | .next 119 | out 120 | 121 | # Nuxt.js build / generate output 122 | 123 | .nuxt 124 | dist 125 | 126 | # Gatsby files 127 | 128 | # Comment in the public line in if your project uses Gatsby and not Next.js 129 | 130 | # https://nextjs.org/blog/next-9-1#public-directory-support 131 | 132 | # public 133 | 134 | # vuepress build output 135 | 136 | .vuepress/dist 137 | 138 | # vuepress v2.x temp and cache directory 139 | 140 | .temp 141 | 142 | # Docusaurus cache and generated files 143 | 144 | .docusaurus 145 | 146 | # Serverless directories 147 | 148 | .serverless/ 149 | 150 | # FuseBox cache 151 | 152 | .fusebox/ 153 | 154 | # DynamoDB Local files 155 | 156 | .dynamodb/ 157 | 158 | # TernJS port file 159 | 160 | .tern-port 161 | 162 | # Stores VSCode versions used for testing VSCode extensions 163 | 164 | .vscode-test 165 | 166 | # yarn v2 167 | 168 | .yarn/cache 169 | .yarn/unplugged 170 | .yarn/build-state.yml 171 | .yarn/install-state.gz 172 | .pnp.* 173 | 174 | # IntelliJ based IDEs 175 | .idea 176 | 177 | # Finder (MacOS) folder config 178 | .DS_Store 179 | .aider* 180 | 181 | __pycache__/ 182 | 183 | .venv/ 184 | 185 | apps/marimo-prompt-library/prompt_executions 186 | 187 | .env 188 | 189 | scratchpad/ 190 | 191 | runtime_time_table.jsonl 192 | 193 | settings.json 194 | token.json 195 | credentials.json 196 | 197 | screenshot.jpg 198 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: https://github.com/pre-commit/pre-commit-hooks 3 | rev: v5.0.0 4 | hooks: 5 | - id: trailing-whitespace 6 | - id: end-of-file-fixer 7 | - id: check-yaml 8 | - id: check-toml 9 | - id: debug-statements 10 | language_version: python3 11 | 12 | - repo: https://github.com/astral-sh/ruff-pre-commit 13 | rev: v0.6.9 14 | hooks: 15 | - id: ruff 16 | args: [--fix, --select=I] 17 | - id: ruff-format 18 | -------------------------------------------------------------------------------- /AI_DOCS/Agency-Swarm Docs.md: -------------------------------------------------------------------------------- 1 | # AI Agent Creator Instructions for Agency Swarm Framework 2 | 3 | You are an expert AI developer, your mission is to develop tools and agents that enhance the capabilities of other agents. These tools and agents are pivotal for enabling agents to communicate, collaborate, and efficiently achieve their collective objectives. Below are detailed instructions to guide you through the process of creating tools and agents, ensuring they are both functional and align with the framework's standards. 4 | 5 | ## Understanding Your Role 6 | 7 | Your primary role is to architect tools and agents that fulfill specific needs within the agency. This involves: 8 | 9 | 1. **Tool Development:** Develop each tool following the Agency Swarm's specifications, ensuring it is robust and ready for production environments. It must not use any placeholders and be located in the correct agent's tools folder. 10 | 2. **Identifying Packages:** Determine the best possible packages or APIs that can be used to create a tool based on the user's requirements. Utilize web search if you are uncertain about which API or package to use. 11 | 3. **Instructions for the Agent**: If the agent is underperforming, you will need to adjust it's instructions based on the user's feedback. Find the instructions.md file for the agent and adjust it. 12 | 13 | ## Agency Swarm Framework Overview 14 | 15 | Agency Swarm started as a desire and effort of Arsenii Shatokhin (aka VRSEN) to fully automate his AI Agency with AI. By building this framework, we aim to simplify the agent creation process and enable anyone to create a collaborative swarm of agents (Agencies), each with distinct roles and capabilities. 16 | 17 | ### Key Features 18 | 19 | - **Customizable Agent Roles**: Define roles like CEO, virtual assistant, developer, etc., and customize their functionalities with [Assistants API](https://platform.openai.com/docs/assistants/overview). 20 | - **Full Control Over Prompts**: Avoid conflicts and restrictions of pre-defined prompts, allowing full customization. 21 | - **Tool Creation**: Tools within Agency Swarm are created using pydantic, which provides a convenient interface and automatic type validation. 22 | - **Efficient Communication**: Agents communicate through a specially designed "send message" tool based on their own descriptions. 23 | - **State Management**: Agency Swarm efficiently manages the state of your assistants on OpenAI, maintaining it in a special `settings.json` file. 24 | - **Deployable in Production**: Agency Swarm is designed to be reliable and easily deployable in production environments. 25 | 26 | ### Folder Structure 27 | 28 | In Agency Swarm, the folder structure is organized as follows: 29 | 30 | 1. Each agency and agent has its own dedicated folder. 31 | 2. Within each agent folder: 32 | 33 | - A 'tools' folder contains all tools for that agent. 34 | - An 'instructions.md' file provides agent-specific instructions. 35 | - An '**init**.py' file contains the import of the agent. 36 | 37 | 3. Tool Import Process: 38 | 39 | - Create a file in the 'tools' folder with the same name as the tool class. 40 | - The tool needs to be added to the tools list in the agent class. Do not overwrite existing tools when adding a new tool. 41 | - All new requirements must be added to the requirements.txt file. 42 | 43 | 4. Agency Configuration: 44 | - The 'agency.py' file is the main file where all new agents are imported. 45 | - When creating a new agency folder, use descriptive names, like for example: marketing_agency, development_agency, etc. 46 | 47 | Follow this folder structure when creating or modifying files within the Agency Swarm framework: 48 | 49 | ``` 50 | agency_name/ 51 | ├── agent_name/ 52 | │ ├── __init__.py 53 | │ ├── agent_name.py 54 | │ ├── instructions.md 55 | │ └── tools/ 56 | │ ├── tool_name1.py 57 | │ ├── tool_name2.py 58 | │ ├── tool_name3.py 59 | │ ├── ... 60 | ├── another_agent/ 61 | │ ├── __init__.py 62 | │ ├── another_agent.py 63 | │ ├── instructions.md 64 | │ └── tools/ 65 | │ ├── tool_name1.py 66 | │ ├── tool_name2.py 67 | │ ├── tool_name3.py 68 | │ ├── ... 69 | ├── agency.py 70 | ├── agency_manifesto.md 71 | ├── requirements.txt 72 | └──... 73 | ``` 74 | 75 | ## Instructions 76 | 77 | ### 1. Create tools 78 | 79 | Tools are the specific actions that agents can perform. They are defined in the `tools` folder. 80 | 81 | When creating a tool, you are defining a new class that extends `BaseTool` from `agency_swarm.tools`. This process involves several key steps, outlined below. 82 | 83 | #### 1. Import Necessary Modules 84 | 85 | Start by importing `BaseTool` from `agency_swarm.tools` and `Field` from `pydantic`. These imports will serve as the foundation for your custom tool class. Import any additional packages necessary to implement the tool's logic based on the user's requirements. Import `load_dotenv` from `dotenv` to load the environment variables. 86 | 87 | #### 2. Define Your Tool Class 88 | 89 | Create a new class that inherits from `BaseTool`. This class will encapsulate the functionality of your tool. `BaseTool` class inherits from the Pydantic's `BaseModel` class. 90 | 91 | #### 3. Specify Tool Fields 92 | 93 | Define the fields your tool will use, utilizing Pydantic's `Field` for clear descriptions and validation. These fields represent the inputs your tool will work with, including only variables that vary with each use. Define any constant variables globally. 94 | 95 | #### 4. Implement the `run` Method 96 | 97 | The `run` method is where your tool's logic is executed. Use the fields defined earlier to perform the tool's intended task. It must contain the actual fully functional correct python code. It can utilize various python packages, previously imported in step 1. 98 | 99 | ### Best Practices 100 | 101 | - **Identify Necessary Packages**: Determine the best packages or APIs to use for creating the tool based on the requirements. 102 | - **Documentation**: Ensure each class and method is well-documented. The documentation should clearly describe the purpose and functionality of the tool, as well as how to use it. 103 | - **Code Quality**: Write clean, readable, and efficient code. Adhere to the PEP 8 style guide for Python code. 104 | - **Web Research**: Utilize web browsing to identify the most relevant packages, APIs, or documentation necessary for implementing your tool's logic. 105 | - **Use Python Packages**: Prefer to use various API wrapper packages and SDKs available on pip, rather than calling these APIs directly using requests. 106 | - **Expect API Keys to be defined as env variables**: If a tool requires an API key or an access token, it must be accessed from the environment using os package within the `run` method's logic. 107 | - **Use global variables for constants**: If a tool requires a constant global variable, that does not change from use to use, (for example, ad_account_id, pull_request_id, etc.), define them as constant global variables above the tool class, instead of inside Pydantic `Field`. 108 | - **Add a test case at the bottom of the file**: Add a test case for each tool in if **name** == "**main**": block. 109 | 110 | ### Example of a Tool 111 | 112 | ```python 113 | from agency_swarm.tools import BaseTool 114 | from pydantic import Field 115 | import os 116 | from dotenv import load_dotenv 117 | 118 | load_dotenv() # always load the environment variables 119 | 120 | account_id = "MY_ACCOUNT_ID" 121 | api_key = os.getenv("MY_API_KEY") # or access_token = os.getenv("MY_ACCESS_TOKEN") 122 | 123 | class MyCustomTool(BaseTool): 124 | """ 125 | A brief description of what the custom tool does. 126 | The docstring should clearly explain the tool's purpose and functionality. 127 | It will be used by the agent to determine when to use this tool. 128 | """ 129 | # Define the fields with descriptions using Pydantic Field 130 | example_field: str = Field( 131 | ..., description="Description of the example field, explaining its purpose and usage for the Agent." 132 | ) 133 | 134 | def run(self): 135 | """ 136 | The implementation of the run method, where the tool's main functionality is executed. 137 | This method should utilize the fields defined above to perform the task. 138 | """ 139 | # Your custom tool logic goes here 140 | # Example: 141 | # do_something(self.example_field, api_key, account_id) 142 | 143 | # Return the result of the tool's operation as a string 144 | return "Result of MyCustomTool operation" 145 | 146 | if __name__ == "__main__": 147 | tool = MyCustomTool(example_field="example value") 148 | print(tool.run()) 149 | ``` 150 | 151 | Remember, each tool code snippet you create must be fully ready to use. It must not contain any placeholders or hypothetical examples. 152 | 153 | ## 2. Create agents 154 | 155 | Agents are the core of the framework. Each agent has it's own unique role and functionality and is designed to perform specific tasks. Each file for the agent must be named the same as the agent's name. 156 | 157 | ### Agent Class 158 | 159 | To create an agent, import `Agent` from `agency_swarm` and create a class that inherits from `Agent`. Inside the class you can adjust the following parameters: 160 | 161 | ```python 162 | from agency_swarm import Agent 163 | 164 | class CEO(Agent): 165 | def __init__(self): 166 | super().__init__( 167 | name="CEO", 168 | description="Responsible for client communication, task planning and management.", 169 | instructions="./instructions.md", # instructions for the agent 170 | tools=[MyCustomTool], 171 | temperature=0.5, 172 | max_prompt_tokens=25000, 173 | ) 174 | ``` 175 | 176 | - Name: The agent's name, reflecting its role. 177 | - Description: A brief summary of the agent's responsibilities. 178 | - Instructions: Path to a markdown file containing detailed instructions for the agent. 179 | - Tools: A list of tools (extending BaseTool) that the agent can use. (Tools must not be initialized, so the agent can pass the parameters itself) 180 | - Other Parameters: Additional settings like temperature, max_prompt_tokens, etc. 181 | 182 | Make sure to create a separate folder for each agent, as described in the folder structure above. After creating the agent, you need to import it into the agency.py file. 183 | 184 | #### instructions.md file 185 | 186 | Each agent also needs to have an `instructions.md` file, which is the system prompt for the agent. Inside those instructions, you need to define the following: 187 | 188 | - **Agent Role**: A description of the role of the agent. 189 | - **Goals**: A list of goals that the agent should achieve, aligned with the agency's mission. 190 | - **Process Workflow**: A step by step guide on how the agent should perform its tasks. Each step must be aligned with the other agents in the agency, and with the tools available to this agent. 191 | 192 | Use the following template for the instructions.md file: 193 | 194 | ```md 195 | # Agent Role 196 | 197 | A description of the role of the agent. 198 | 199 | # Goals 200 | 201 | A list of goals that the agent should achieve, aligned with the agency's mission. 202 | 203 | # Process Workflow 204 | 205 | 1. Step 1 206 | 2. Step 2 207 | 3. Step 3 208 | ``` 209 | 210 | Instructions for the agent to be created in markdown format. Instructions should include a description of the role and a specific step by step process that this agent needs to perform in order to execute the tasks. The process must also be aligned with all the other agents in the agency. Agents should be able to collaborate with each other to achieve the common goal of the agency. 211 | 212 | #### Code Interpreter and FileSearch Options 213 | 214 | To utilize the Code Interpreter tool (the Jupyter Notebook Execution environment, without Internet access) and the FileSearch tool (a Retrieval-Augmented Generation (RAG) provided by OpenAI): 215 | 216 | 1. Import the tools: 217 | 218 | ```python 219 | from agency_swarm.tools import CodeInterpreter, FileSearch 220 | 221 | ``` 222 | 223 | 2. Add the tools to the agent's tools list: 224 | 225 | ```python 226 | agent = Agent( 227 | name="MyAgent", 228 | tools=[CodeInterpreter, FileSearch], 229 | # ... other agent parameters 230 | ) 231 | 232 | ``` 233 | 234 | ## 3. Create Agencies 235 | 236 | Agencies are collections of agents that work together to achieve a common goal. They are defined in the `agency.py` file. 237 | 238 | ### Agency Class 239 | 240 | To create an agency, import `Agency` from `agency_swarm` and create a class that inherits from `Agency`. Inside the class you can adjust the following parameters: 241 | 242 | ```python 243 | from agency_swarm import Agency 244 | from CEO import CEO 245 | from Developer import Developer 246 | from VirtualAssistant import VirtualAssistant 247 | 248 | dev = Developer() 249 | va = VirtualAssistant() 250 | 251 | agency = Agency([ 252 | ceo, # CEO will be the entry point for communication with the user 253 | [ceo, dev], # CEO can initiate communication with Developer 254 | [ceo, va], # CEO can initiate communication with Virtual Assistant 255 | [dev, va] # Developer can initiate communication with Virtual Assistant 256 | ], 257 | shared_instructions='agency_manifesto.md', #shared instructions for all agents 258 | temperature=0.5, # default temperature for all agents 259 | max_prompt_tokens=25000 # default max tokens in conversation history 260 | ) 261 | 262 | if __name__ == "__main__": 263 | agency.run_demo() # starts the agency in terminal 264 | ``` 265 | 266 | #### Communication Flows 267 | 268 | In Agency Swarm, communication flows are directional, meaning they are established from left to right in the agency_chart definition. For instance, in the example above, the CEO can initiate a chat with the developer (dev), and the developer can respond in this chat. However, the developer cannot initiate a chat with the CEO. The developer can initiate a chat with the virtual assistant (va) and assign new tasks. 269 | 270 | To allow agents to communicate with each other, simply add them in the second level list inside the agency chart like this: `[ceo, dev], [ceo, va], [dev, va]`. The agent on the left will be able to communicate with the agent on the right. 271 | 272 | #### Agency Manifesto 273 | 274 | Agency manifesto is a file that contains shared instructions for all agents in the agency. It is a markdown file that is located in the agency folder. Please write the manifesto file when creating a new agency. Include the following: 275 | 276 | - **Agency Description**: A brief description of the agency. 277 | - **Mission Statement**: A concise statement that encapsulates the purpose and guiding principles of the agency. 278 | - **Operating Environment**: A description of the operating environment of the agency. 279 | 280 | # Notes 281 | 282 | IMPORTANT: NEVER output code snippets or file contents in the chat. Always create or modify the actual files in the file system. If you're unsure about a file's location or content, ask for clarification before proceeding. 283 | 284 | When creating or modifying files: 285 | 286 | 1. Use the appropriate file creation or modification syntax (e.g., ```python:path/to/file.py for Python files). 287 | 2. Write the full content of the file, not just snippets or placeholders. 288 | 3. Ensure all necessary imports and dependencies are included. 289 | 4. Follow the specified file creation order rigorously: 1. tools, 2. agents, 3. agency, 4. requirements.txt. 290 | 291 | If you find yourself about to output code in the chat, STOP and reconsider your approach. Always prioritize actual file creation and modification over chat explanations. 292 | -------------------------------------------------------------------------------- /AI_DOCS/event_docs.md: -------------------------------------------------------------------------------- 1 | # Realtime API Events 2 | 3 | # Realtime API Events 4 | 5 | - Session Configuration 6 | • session.update 7 | - Configures the connection-wide behavior of the conversation session 8 | - Typically sent immediately after connecting 9 | - Can be sent at any point to reconfigure behavior after the current response is complete 10 | 11 | - Input Audio 12 | • input_audio_buffer.append 13 | - Appends audio data to the shared user input buffer 14 | - Audio not processed until end of speech detected or manual response.create sent 15 | • input_audio_buffer.clear 16 | - Clears the current audio input buffer 17 | - Does not impact responses already in progress 18 | • input_audio_buffer.commit 19 | - Commits current state of user input buffer to subscribed conversations 20 | - Includes it as information for the next response 21 | 22 | - Item Management (for establishing history or including non-audio item information) 23 | • conversation.item.create 24 | - Inserts a new item into the conversation 25 | - Can be positioned according to previous_item_id 26 | - Provides new input, tool responses, or historical information 27 | • conversation.item.delete 28 | - Removes an item from an existing conversation 29 | • conversation.item.truncate 30 | - Manually shortens text and/or audio content in a message 31 | - Useful for situations with faster-than-realtime model generation 32 | 33 | - Response Management 34 | • response.create 35 | - Initiates model processing of unprocessed conversation input 36 | - Signifies the end of the caller's logical turn 37 | - Must be called for text input, tool responses, none mode, etc. 38 | • response.cancel 39 | - Cancels an in-progress response 40 | 41 | - Responses: commands sent by the /realtime endpoint to the caller 42 | • session.created 43 | - Sent upon successful connection establishment 44 | - Provides a connection-specific ID for debugging or logging 45 | • session.updated 46 | - Sent in response to a session.update event 47 | - Reflects changes made to the session configuration 48 | 49 | - Caller Item Acknowledgement 50 | • conversation.item.created 51 | - Acknowledges insertion of a new conversation item 52 | • conversation.item.deleted 53 | - Acknowledges removal of an existing conversation item 54 | • conversation.item.truncated 55 | - Acknowledges truncation of an existing conversation item 56 | 57 | - Response Flow 58 | • response.created 59 | - Notifies start of a new response for a conversation 60 | - Snapshots input state and begins generation of new items 61 | • response.done 62 | - Notifies completion of response generation 63 | • rate_limits.updated 64 | - Sent after response.done 65 | - Provides current rate limit information 66 | 67 | - Item Flow in a Response 68 | • response.output_item.added 69 | - Notifies creation of a new, server-generated conversation item 70 | • response_output_item_done 71 | - Notifies completion of a new conversation item's addition 72 | 73 | - Content Flow within Response Items 74 | • response.content_part.added 75 | - Notifies creation of a new content part within a conversation item 76 | • response.content_part.done 77 | - Signals completion of a newly created content part 78 | • response.audio.delta 79 | - Provides incremental update to binary audio data 80 | • response.audio.done 81 | - Signals completion of audio content part updates 82 | • response.audio_transcript.delta 83 | - Provides incremental update to audio transcription 84 | • response.audio_transcript.done 85 | - Signals completion of audio transcription updates 86 | • response.text.delta 87 | - Provides incremental update to text content 88 | • response.text.done 89 | - Signals completion of text content updates 90 | • response.function_call_arguments.delta 91 | - Provides incremental update to function call arguments 92 | • response.function_call_arguments.done 93 | - Signals completion of function call arguments 94 | 95 | - User Input Audio 96 | • input_audio_buffer.speech_started 97 | - Notifies detection of speech start in input audio buffer 98 | • input_audio_buffer.speech_stopped 99 | - Notifies detection of speech end in input audio buffer 100 | • conversation.item.input_audio_transcription.completed 101 | - Notifies availability of input audio transcription 102 | • conversation.item_input_audio_transcription.failed 103 | - Notifies failure of input audio transcription 104 | • input_audio_buffer_committed 105 | - Acknowledges submission of user audio input buffer 106 | • input_audio_buffer_cleared 107 | - Acknowledges clearing of pending user audio input buffer 108 | 109 | - Other 110 | • error 111 | - Indicates processing error in the session 112 | - Includes detailed error message 113 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Realtime API Async Python Assistant 2 | 3 | This project demonstrates the use of OpenAI's Realtime API to create an AI assistant capable of handling voice input, performing various tasks, and providing audio responses. It showcases the integration of tools, structured output responses, and real-time interaction. 4 | 5 | ## Features 6 | 7 | ### Core Functionality 8 | - Real-time voice interaction with an AI assistant 9 | - Asynchronous audio input and output handling 10 | - Custom tools execution based on user requests 11 | 12 | ### Task Delegation & Communication 13 | - **Synchronous Communication**: Direct, immediate interaction with agents for quick tasks 14 | - **Asynchronous Task Delegation**: Long-running task delegation to agencies/agents 15 | - Send messages to agency CEOs without waiting for responses 16 | - Send messages to subordinate agents on behalf of CEOs 17 | - **Task Status Monitoring**: Check completion status and retrieve responses 18 | - Multiple specialized AI agent teams working collaboratively 19 | 20 | ### Integration Services 21 | - Google Calendar integration for meeting schedule management 22 | - Gmail integration for email handling and drafting 23 | - Browser interaction for web-related tasks 24 | - File system operations (create, update, delete) 25 | 26 | ## Available Tools 27 | 28 | ### Agency Communication Tools 29 | - **SendMessage**: Synchronous communication with agencies/agents for quick tasks 30 | - Direct interaction with immediate response 31 | - Suitable for simple, fast-completing tasks 32 | 33 | - **SendMessageAsync**: Asynchronous task delegation 34 | - Initiates long-running tasks without waiting 35 | - Returns immediately to allow other operations 36 | 37 | - **GetResponse**: Task status and response retrieval 38 | - Checks completion status of async tasks 39 | - Retrieves agent responses when tasks complete 40 | 41 | ### Google Workspace Integration 42 | - **FetchDailyMeetingSchedule**: Fetches and formats the user's daily meeting schedule from Google Calendar 43 | - **GetGmailSummary**: Provides a concise summary of unread Gmail messages from the past 48 hours 44 | - **DraftGmail**: Composes email drafts, either as a reply to an email from GetGmailSummary, or as a new message 45 | 46 | ### System Tools 47 | - **GetScreenDescription**: Captures and analyzes the current screen content for the assistant 48 | - **FileOps**: 49 | - **CreateFile**: Generates new files with user-specified content 50 | - **UpdateFile**: Modifies existing files with new content 51 | - **DeleteFile**: Removes specified files from the system 52 | - **OpenBrowser**: Launches a web browser with a given URL 53 | - **GetCurrentDateTime**: Retrieves and reports the current date and time 54 | 55 | ## Setup 56 | 57 | ### MacOS Installation 58 | 59 | 1. Install [Python 3.12](https://www.python.org/downloads/macos/). 60 | 2. Install [uv](https://docs.astral.sh/uv/), a modern Python package manager 61 | 3. Clone this repository to your local machine 62 | 4. Create a local environment file `.env` based on `.env.sample` 63 | 5. Customize `personalization.json` and `config.py` to your preferences 64 | 6. Install the required audio library: `brew install portaudio` 65 | 7. Install project dependencies: `uv sync` 66 | 8. Launch the assistant: `uv run main` 67 | 68 | ### Google Cloud API Configuration 69 | 70 | To enable Google Cloud API integration, follow these steps: 71 | 72 | 1. Create OAuth 2.0 Client IDs in the Google Cloud Console 73 | 2. Place the `credentials.json` file in the project's root directory 74 | 3. Configure `http://localhost:8080/` as an Authorized Redirect URI in your Google Cloud project settings 75 | 4. Set the OAuth consent screen to "Internal" user type 76 | 5. Enable the following APIs and scopes in your Google Cloud project: 77 | - Gmail API 78 | - `https://www.googleapis.com/auth/gmail.readonly` 79 | - `https://www.googleapis.com/auth/gmail.compose` 80 | - `https://www.googleapis.com/auth/gmail.modify` 81 | - Google Calendar API 82 | - `https://www.googleapis.com/auth/calendar.readonly` 83 | 84 | ## Configuration 85 | 86 | The project relies on environment variables and a `personalization.json` file for configuration. Ensure you have set up: 87 | 88 | - `OPENAI_API_KEY`: Your personal OpenAI API key 89 | - `PERSONALIZATION_FILE`: Path to your customized personalization JSON file 90 | - `SCRATCH_PAD_DIR`: Directory for temporary file storage 91 | 92 | ## Usage 93 | 94 | After launching the assistant, interact using voice commands. Example interactions: 95 | 96 | 1. "What do I have on my schedule for today? Tell me only most important meetings." 97 | 2. "Do I have any important emails?" 98 | 3. "Open ChatGPT in my browser." 99 | 4. "Create a new file named user_data.txt with some example content." 100 | 5. "Update the user_data.txt file by adding more information." 101 | 6. "Delete the user_data.txt file." 102 | 7. "Ask the research team to write a detailed market analysis report." 103 | 8. "Check if the research team has completed the market analysis report." 104 | 105 | ## Code Structure 106 | 107 | ### Core Components 108 | 109 | - `main.py`: Application entry point 110 | - `agencies/`: Agency-Swarm teams of specialized agents 111 | - `tools/`: Standalone tools for various functions 112 | - `config.py`: Configuration settings and environment variable management 113 | - `visual_interface.py`: Visual interface for audio energy visualization 114 | - `websocket_handler.py`: WebSocket event and message processing 115 | 116 | ### Key Features 117 | 118 | 1. **Asynchronous WebSocket Communication**: 119 | Utilizes `websockets` for asynchronous connection with the OpenAI Realtime API 120 | 121 | 2. **Audio Input/Output Handling**: 122 | Manages real-time audio capture and playback with PCM16 format support and VAD (Voice Activity Detection) 123 | 124 | 3. **Function Execution**: 125 | Standalone tools in `tools/` are invoked by the AI assistant based on user requests 126 | 127 | 4. **Structured Output Processing**: 128 | OpenAI's Structured Outputs are used to generate precise, structured responses 129 | 130 | 5. **Visual Interface**: 131 | PyGame-based interface provides real-time visualization of audio volume 132 | 133 | ## Extending Functionality 134 | 135 | ### Adding Standalone Tools 136 | 137 | Standalone tools are independent functions not associated with specific agents or agencies. 138 | 139 | To add a new standalone tool: 140 | 1. Create a new file in the `tools/` directory 141 | 2. Implement the `run` method using async syntax, utilizing `asyncio.to_thread` for blocking operations 142 | 3. Install any necessary dependencies: `uv add ` 143 | 144 | ### Adding New Agencies 145 | 146 | Agencies are Agency-Swarm style teams of specialized agents working together on complex tasks. 147 | 148 | To add a new agency: 149 | 1. Drag-and-drop your agency folder into the `agencies/` directory 150 | 2. Set `async_mode="threading"` in agency configuration to enable async messaging (SendMessageAsync and GetResponse) 151 | 3. Install any required dependencies: `uv add ` 152 | 153 | ## Additional Resources 154 | 155 | - [OpenAI Realtime API Documentation](https://platform.openai.com/docs/guides/realtime) 156 | - [OpenAI Structured Outputs Guide](https://platform.openai.com/docs/guides/structured-outputs) 157 | - [WebSockets Library for Python](https://websockets.readthedocs.io/) 158 | -------------------------------------------------------------------------------- /personalization.json: -------------------------------------------------------------------------------- 1 | { 2 | "browser": "chrome", 3 | "ai_assistant_name": "Sky", 4 | "user_name": "VRSEN", 5 | "assistant_instructions": "You are {ai_assistant_name}, a concise and efficient **voice assistant** for {user_name}.\nKey points:\n1. Provide brief, rapid responses.\n2. Immediately utilize available functions when appropriate, except for destructive actions.\n3. Immediately relay subordinate agent responses. Sometimes it may make sense to wait for the subordinate agent to respond before continuing.\n4. If you find yourself providing a long response, STOP and ask if the user still wants you to continue." 6 | } 7 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [project] 2 | name = "voice-assistant" 3 | version = "0.1.0" 4 | description = "Agency Swarm Voice Interface" 5 | readme = "README.md" 6 | requires-python = ">=3.12" 7 | dependencies = [ 8 | "agency-swarm==0.4.4", 9 | "aiohttp>=3.10.10", 10 | "google-api-python-client>=2.149.0", 11 | "google-auth-httplib2>=0.2.0", 12 | "google-auth-oauthlib>=1.2.1", 13 | "numpy", 14 | "openai", 15 | "pillow>=10.4.0", 16 | "pyaudio", 17 | "pygame>=2.6.1", 18 | "python-dotenv>=1.0.1", 19 | "selenium-stealth>=1.0.6", 20 | "selenium>=4.25.0", 21 | "webdriver-manager>=4.0.2", 22 | "websockets", 23 | ] 24 | 25 | [build-system] 26 | requires = ["hatchling"] 27 | build-backend = "hatchling.build" 28 | 29 | [project.scripts] 30 | main = "voice_assistant.main:main" 31 | -------------------------------------------------------------------------------- /src/voice_assistant/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VRSEN/agency-voice-interface/2d9d39ce02d9cb9628e8de79b3543fe05885ad42/src/voice_assistant/__init__.py -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/AnalystAgent/AnalystAgent.py: -------------------------------------------------------------------------------- 1 | from agency_swarm import Agent 2 | from agency_swarm.tools import CodeInterpreter, FileSearch 3 | 4 | 5 | class AnalystAgent(Agent): 6 | def __init__(self): 7 | super().__init__( 8 | name="AnalystAgent", 9 | description="Analyzes data, generates insights, and performs complex calculations using code interpreter and file search capabilities.", 10 | instructions="./instructions.md", 11 | tools=[CodeInterpreter, FileSearch], 12 | temperature=0.0, 13 | max_prompt_tokens=25000, 14 | ) 15 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/AnalystAgent/instructions.md: -------------------------------------------------------------------------------- 1 | # Agent Role 2 | 3 | As an Analyst Agent, your role is to analyze data, generate insights, and perform complex calculations to support the research process. You have access to both code execution capabilities and file search functionality to enhance your analytical capabilities. 4 | 5 | # Goals 6 | 7 | 1. Analyze data and generate meaningful insights 8 | 2. Perform complex calculations and data manipulations 9 | 3. Search through relevant files and documentation for context 10 | 4. Support other agents with data-driven decision making 11 | 5. Create visualizations and reports when needed 12 | 13 | # Process Workflow 14 | 15 | 1. When receiving a task, first assess if you need to: 16 | - Search existing files for context (using FileSearch) 17 | - Execute code for analysis (using CodeInterpreter) 18 | - Both of the above 19 | 20 | 2. If searching files: 21 | - Use FileSearch to locate relevant documentation or data 22 | - Extract and summarize key information 23 | - Consider how this information affects the analysis 24 | 25 | 3. If performing analysis: 26 | - Use CodeInterpreter to write and execute analytical code 27 | - Ensure code is well-documented and efficient 28 | - Generate visualizations when they would aid understanding 29 | - Validate results before sharing 30 | 31 | 4. When collaborating with other agents: 32 | - Provide clear explanations of your findings 33 | - Include relevant code snippets or file references 34 | - Make specific recommendations based on your analysis 35 | 36 | 5. Always: 37 | - Document your methodology 38 | - Explain your reasoning 39 | - Highlight any assumptions or limitations 40 | - Suggest next steps or areas for further investigation 41 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/BrowsingAgent.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import re 3 | 4 | from agency_swarm.agents import Agent 5 | from typing_extensions import override 6 | 7 | 8 | class BrowsingAgent(Agent): 9 | SCREENSHOT_FILE_NAME = "screenshot.jpg" 10 | 11 | def __init__(self, selenium_config=None, **kwargs): 12 | from .tools.util.selenium import set_selenium_config 13 | 14 | super().__init__( 15 | name="BrowsingAgent", 16 | description="This agent is designed to navigate and search web effectively.", 17 | instructions="./instructions.md", 18 | files_folder="./files", 19 | schemas_folder="./schemas", 20 | tools=[], 21 | tools_folder="./tools", 22 | temperature=0, 23 | max_prompt_tokens=16000, 24 | model="gpt-4o", 25 | validation_attempts=25, 26 | **kwargs, 27 | ) 28 | if selenium_config is not None: 29 | set_selenium_config(selenium_config) 30 | 31 | self.prev_message = "" 32 | 33 | @override 34 | def response_validator(self, message): 35 | from selenium.webdriver.common.by import By 36 | from selenium.webdriver.support.select import Select 37 | 38 | from .tools.util import ( 39 | highlight_elements_with_labels, 40 | remove_highlight_and_labels, 41 | ) 42 | from .tools.util.selenium import get_web_driver, set_web_driver 43 | 44 | # Filter out everything in square brackets 45 | filtered_message = re.sub(r"\[.*?\]", "", message).strip() 46 | 47 | if filtered_message and self.prev_message == filtered_message: 48 | raise ValueError( 49 | "Do not repeat yourself. If you are stuck, try a different approach or search in google for the page you are looking for directly." 50 | ) 51 | 52 | self.prev_message = filtered_message 53 | 54 | if "[send screenshot]" in message.lower(): 55 | wd = get_web_driver() 56 | remove_highlight_and_labels(wd) 57 | self.take_screenshot() 58 | response_text = "Here is the screenshot of the current web page:" 59 | 60 | elif "[highlight clickable elements]" in message.lower(): 61 | wd = get_web_driver() 62 | highlight_elements_with_labels( 63 | wd, 64 | 'a, button, div[onclick], div[role="button"], div[tabindex], ' 65 | 'span[onclick], span[role="button"], span[tabindex]', 66 | ) 67 | self._shared_state.set( 68 | "elements_highlighted", 69 | 'a, button, div[onclick], div[role="button"], div[tabindex], ' 70 | 'span[onclick], span[role="button"], span[tabindex]', 71 | ) 72 | 73 | self.take_screenshot() 74 | 75 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 76 | 77 | all_element_texts = [element.text for element in all_elements] 78 | 79 | element_texts_json = {} 80 | for i, element_text in enumerate(all_element_texts): 81 | element_texts_json[str(i + 1)] = self.remove_unicode(element_text) 82 | 83 | element_texts_json = {k: v for k, v in element_texts_json.items() if v} 84 | 85 | element_texts_formatted = ", ".join( 86 | [f"{k}: {v}" for k, v in element_texts_json.items()] 87 | ) 88 | 89 | response_text = ( 90 | "Here is the screenshot of the current web page with highlighted clickable elements. \n\n" 91 | "Texts of the elements are: " + element_texts_formatted + ".\n\n" 92 | "Elements without text are not shown, but are available on screenshot. \n" 93 | "Please make sure to analyze the screenshot to find the clickable element you need to click on." 94 | ) 95 | 96 | elif "[highlight text fields]" in message.lower(): 97 | wd = get_web_driver() 98 | highlight_elements_with_labels(wd, "input, textarea") 99 | self._shared_state.set("elements_highlighted", "input, textarea") 100 | 101 | self.take_screenshot() 102 | 103 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 104 | 105 | all_element_texts = [element.text for element in all_elements] 106 | 107 | element_texts_json = {} 108 | for i, element_text in enumerate(all_element_texts): 109 | element_texts_json[str(i + 1)] = self.remove_unicode(element_text) 110 | 111 | element_texts_formatted = ", ".join( 112 | [f"{k}: {v}" for k, v in element_texts_json.items()] 113 | ) 114 | 115 | response_text = ( 116 | "Here is the screenshot of the current web page with highlighted text fields: \n" 117 | "Texts of the elements are: " + element_texts_formatted + ".\n" 118 | "Please make sure to analyze the screenshot to find the text field you need to fill." 119 | ) 120 | 121 | elif "[highlight dropdowns]" in message.lower(): 122 | wd = get_web_driver() 123 | highlight_elements_with_labels(wd, "select") 124 | self._shared_state.set("elements_highlighted", "select") 125 | 126 | self.take_screenshot() 127 | 128 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 129 | 130 | all_selector_values = {} 131 | 132 | i = 0 133 | for element in all_elements: 134 | select = Select(element) 135 | options = select.options 136 | selector_values = {} 137 | for j, option in enumerate(options): 138 | selector_values[str(j)] = option.text 139 | if j > 10: 140 | break 141 | all_selector_values[str(i + 1)] = selector_values 142 | 143 | all_selector_values = {k: v for k, v in all_selector_values.items() if v} 144 | all_selector_values_formatted = ", ".join( 145 | [f"{k}: {v}" for k, v in all_selector_values.items()] 146 | ) 147 | 148 | response_text = ( 149 | "Here is the screenshot with highlighted dropdowns. \n" 150 | "Selector values are: " + all_selector_values_formatted + ".\n" 151 | "Please make sure to analyze the screenshot to find the dropdown you need to select." 152 | ) 153 | 154 | else: 155 | return message 156 | 157 | set_web_driver(wd) 158 | content = self.create_response_content(response_text) 159 | raise ValueError(content) 160 | 161 | def take_screenshot(self): 162 | from .tools.util import get_b64_screenshot 163 | from .tools.util.selenium import get_web_driver 164 | 165 | wd = get_web_driver() 166 | screenshot = get_b64_screenshot(wd) 167 | screenshot_data = base64.b64decode(screenshot) 168 | with open(self.SCREENSHOT_FILE_NAME, "wb") as screenshot_file: 169 | screenshot_file.write(screenshot_data) 170 | 171 | def create_response_content(self, response_text): 172 | with open(self.SCREENSHOT_FILE_NAME, "rb") as file: 173 | file_id = self.client.files.create( 174 | file=file, 175 | purpose="vision", 176 | ).id 177 | 178 | content = [ 179 | {"type": "text", "text": response_text}, 180 | {"type": "image_file", "image_file": {"file_id": file_id}}, 181 | ] 182 | return content 183 | 184 | # Function to check for Unicode escape sequences 185 | def remove_unicode(self, data): 186 | return re.sub(r"[^\x00-\x7F]+", "", data) 187 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/instructions.md: -------------------------------------------------------------------------------- 1 | # Browsing Agent Instructions 2 | 3 | As an advanced browsing agent, you are equipped with specialized tools to navigate and search the web effectively. Your primary objective is to fulfill the user's requests by efficiently utilizing these tools. 4 | 5 | ### Primary Instructions: 6 | 7 | 1. **Avoid Guessing URLs**: Never attempt to guess the direct URL. Always perform a Google search if applicable, or return to your previous search results. 8 | 2. **Navigating to New Pages**: Always use the `ClickElement` tool to open links when navigating to a new web page from the current source. Do not guess the direct URL. 9 | 3. **Single Page Interaction**: You can only open and interact with one web page at a time. The previous web page will be closed when you open a new one. To navigate back, use the `GoBack` tool. 10 | 4. **Requesting Screenshots**: Before using tools that interact with the web page, ask the user to send you the appropriate screenshot using one of the commands below. 11 | 12 | ### Commands to Request Screenshots: 13 | 14 | - **'[send screenshot]'**: Sends the current browsing window as an image. Use this command if the user asks what is on the page. 15 | - **'[highlight clickable elements]'**: Highlights all clickable elements on the current web page. This must be done before using the `ClickElement` tool. 16 | - **'[highlight text fields]'**: Highlights all text fields on the current web page. This must be done before using the `SendKeys` tool. 17 | - **'[highlight dropdowns]'**: Highlights all dropdowns on the current web page. This must be done before using the `SelectDropdown` tool. 18 | 19 | ### Important Reminders: 20 | 21 | - Only open and interact with one web page at a time. Do not attempt to read or click on multiple links simultaneously. Complete your interactions with the current web page before proceeding to a different source. 22 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/requirements.txt: -------------------------------------------------------------------------------- 1 | selenium 2 | webdriver-manager 3 | selenium_stealth 4 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/ClickElement.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | from agency_swarm.tools import BaseTool 4 | from pydantic import Field 5 | from selenium.webdriver.common.by import By 6 | 7 | from .util import get_web_driver, set_web_driver 8 | from .util.highlights import remove_highlight_and_labels 9 | 10 | 11 | class ClickElement(BaseTool): 12 | """ 13 | This tool clicks on an element on the current web page based on its number. 14 | 15 | Before using this tool make sure to highlight clickable elements on the page by outputting '[highlight clickable elements]' message. 16 | """ 17 | 18 | element_number: int = Field( 19 | ..., 20 | description="The number of the element to click on. The element numbers are displayed on the page after highlighting elements.", 21 | ) 22 | 23 | def run(self): 24 | wd = get_web_driver() 25 | 26 | if "button" not in self._shared_state.get("elements_highlighted", ""): 27 | raise ValueError( 28 | "Please highlight clickable elements on the page first by outputting '[highlight clickable elements]' message. You must output just the message without calling the tool first, so the user can respond with the screenshot." 29 | ) 30 | 31 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 32 | 33 | # iterate through all elements with a number in the text 34 | try: 35 | element_text = all_elements[self.element_number - 1].text 36 | element_text = element_text.strip() if element_text else "" 37 | # Subtract 1 because sequence numbers start at 1, but list indices start at 0 38 | try: 39 | all_elements[self.element_number - 1].click() 40 | except Exception as e: 41 | if "element click intercepted" in str(e).lower(): 42 | wd.execute_script( 43 | "arguments[0].click();", all_elements[self.element_number - 1] 44 | ) 45 | else: 46 | raise e 47 | 48 | time.sleep(3) 49 | 50 | result = f"Clicked on element {self.element_number}. Text on clicked element: '{element_text}'. Current URL is {wd.current_url} To further analyze the page, output '[send screenshot]' command." 51 | except IndexError: 52 | result = "Element number is invalid. Please try again with a valid element number." 53 | except Exception as e: 54 | result = str(e) 55 | 56 | wd = remove_highlight_and_labels(wd) 57 | 58 | wd.execute_script("document.body.style.zoom='1.5'") 59 | 60 | set_web_driver(wd) 61 | 62 | self._shared_state.set("elements_highlighted", "") 63 | 64 | return result 65 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/ExportFile.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import os 3 | 4 | from agency_swarm.tools import BaseTool 5 | 6 | from .util import get_web_driver 7 | 8 | 9 | class ExportFile(BaseTool): 10 | """This tool converts the current full web page into a file and returns its file_id. You can then send this file id back to the user for further processing.""" 11 | 12 | def run(self): 13 | wd = get_web_driver() 14 | from agency_swarm import get_openai_client 15 | 16 | client = get_openai_client() 17 | 18 | # Define the parameters for the PDF 19 | params = { 20 | "landscape": False, 21 | "displayHeaderFooter": False, 22 | "printBackground": True, 23 | "preferCSSPageSize": True, 24 | } 25 | 26 | # Execute the command to print to PDF 27 | result = wd.execute_cdp_cmd("Page.printToPDF", params) 28 | pdf = result["data"] 29 | 30 | pdf_bytes = base64.b64decode(pdf) 31 | 32 | # Save the PDF to a file 33 | with open("exported_file.pdf", "wb") as f: 34 | f.write(pdf_bytes) 35 | 36 | file_id = client.files.create( 37 | file=open("exported_file.pdf", "rb"), 38 | purpose="assistants", 39 | ).id 40 | 41 | self._shared_state.set("file_id", file_id) 42 | 43 | return ( 44 | "Success. File exported with id: `" 45 | + file_id 46 | + "` You can now send this file id back to the user." 47 | ) 48 | 49 | 50 | if __name__ == "__main__": 51 | wd = get_web_driver() 52 | wd.get("https://www.google.com") 53 | tool = ExportFile() 54 | tool.run() 55 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/GoBack.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | from agency_swarm.tools import BaseTool 4 | 5 | from .util.selenium import get_web_driver, set_web_driver 6 | 7 | 8 | class GoBack(BaseTool): 9 | """W 10 | This tool allows you to go back 1 page in the browser history. Use it in case of a mistake or if a page shows you unexpected content. 11 | """ 12 | 13 | def run(self): 14 | wd = get_web_driver() 15 | 16 | wd.back() 17 | 18 | time.sleep(3) 19 | 20 | set_web_driver(wd) 21 | 22 | return "Success. Went back 1 page. Current URL is: " + wd.current_url 23 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/ReadURL.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | from agency_swarm.tools import BaseTool 4 | from pydantic import Field 5 | 6 | from .util.selenium import get_web_driver, set_web_driver 7 | 8 | 9 | class ReadURL(BaseTool): 10 | """ 11 | This tool reads a single URL and opens it in your current browser window. For each new source, either navigate directly to a URL that you believe contains the answer to the user's question or perform a Google search (e.g., 'https://google.com/search?q=search') if necessary. 12 | 13 | If you are unsure of the direct URL, do not guess. Instead, use the ClickElement tool to click on links that might contain the desired information on the current web page. 14 | 15 | Note: This tool only supports opening one URL at a time. The previous URL will be closed when you open a new one. 16 | """ 17 | 18 | chain_of_thought: str = Field( 19 | ..., 20 | description="Think step-by-step about where you need to navigate next to find the necessary information.", 21 | exclude=True, 22 | ) 23 | url: str = Field( 24 | ..., 25 | description="URL of the webpage.", 26 | examples=["https://google.com/search?q=search"], 27 | ) 28 | 29 | class ToolConfig: 30 | one_call_at_a_time: bool = True 31 | 32 | def run(self): 33 | wd = get_web_driver() 34 | 35 | wd.get(self.url) 36 | 37 | time.sleep(2) 38 | 39 | set_web_driver(wd) 40 | 41 | self._shared_state.set("elements_highlighted", "") 42 | 43 | return ( 44 | "Current URL is: " 45 | + wd.current_url 46 | + "\n" 47 | + "Please output '[send screenshot]' next to analyze the current web page or '[highlight clickable elements]' for further navigation." 48 | ) 49 | 50 | 51 | if __name__ == "__main__": 52 | tool = ReadURL(url="https://google.com") 53 | print(tool.run()) 54 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/Scroll.py: -------------------------------------------------------------------------------- 1 | from typing import Literal 2 | 3 | from agency_swarm.tools import BaseTool 4 | from pydantic import Field 5 | 6 | from .util.selenium import get_web_driver, set_web_driver 7 | 8 | 9 | class Scroll(BaseTool): 10 | """ 11 | This tool allows you to scroll the current web page up or down by 1 screen height. 12 | """ 13 | 14 | direction: Literal["up", "down"] = Field(..., description="Direction to scroll.") 15 | 16 | def run(self): 17 | wd = get_web_driver() 18 | 19 | height = wd.get_window_size()["height"] 20 | 21 | # Get the zoom level 22 | zoom_level = wd.execute_script("return document.body.style.zoom || '1';") 23 | zoom_level = ( 24 | float(zoom_level.strip("%")) / 100 25 | if "%" in zoom_level 26 | else float(zoom_level) 27 | ) 28 | 29 | # Adjust height by zoom level 30 | adjusted_height = height / zoom_level 31 | 32 | current_scroll_position = wd.execute_script("return window.pageYOffset;") 33 | total_scroll_height = wd.execute_script("return document.body.scrollHeight;") 34 | 35 | result = "" 36 | 37 | if self.direction == "up": 38 | if current_scroll_position == 0: 39 | # Reached the top of the page 40 | result = "Reached the top of the page. Cannot scroll up any further.\n" 41 | else: 42 | wd.execute_script(f"window.scrollBy(0, -{adjusted_height});") 43 | result = "Scrolled up by 1 screen height. Make sure to output '[send screenshot]' command to analyze the page after scrolling." 44 | 45 | elif self.direction == "down": 46 | if current_scroll_position + adjusted_height >= total_scroll_height: 47 | # Reached the bottom of the page 48 | result = ( 49 | "Reached the bottom of the page. Cannot scroll down any further.\n" 50 | ) 51 | else: 52 | wd.execute_script(f"window.scrollBy(0, {adjusted_height});") 53 | result = "Scrolled down by 1 screen height. Make sure to output '[send screenshot]' command to analyze the page after scrolling." 54 | 55 | set_web_driver(wd) 56 | 57 | return result 58 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/SelectDropdown.py: -------------------------------------------------------------------------------- 1 | from typing import Dict 2 | 3 | from agency_swarm.tools import BaseTool 4 | from pydantic import Field, model_validator 5 | from selenium.webdriver.common.by import By 6 | from selenium.webdriver.support.select import Select 7 | 8 | from .util import get_web_driver, set_web_driver 9 | from .util.highlights import remove_highlight_and_labels 10 | 11 | 12 | class SelectDropdown(BaseTool): 13 | """ 14 | This tool selects an option in a dropdown on the current web page based on the description of that element and which option to select. 15 | 16 | Before using this tool make sure to highlight dropdown elements on the page by outputting '[highlight dropdowns]' message. 17 | """ 18 | 19 | key_value_pairs: Dict[str, str] = Field( 20 | ..., 21 | description="A dictionary where the key is the sequence number of the dropdown element and the value is the index of the option to select.", 22 | examples=[{"1": 0, "2": 1}, {"3": 2}], 23 | ) 24 | 25 | @model_validator(mode="before") 26 | @classmethod 27 | def check_key_value_pairs(cls, data): 28 | if not data.get("key_value_pairs"): 29 | raise ValueError( 30 | "key_value_pairs is required. Example format: " 31 | "key_value_pairs={'1': 0, '2': 1}" 32 | ) 33 | return data 34 | 35 | def run(self): 36 | wd = get_web_driver() 37 | 38 | if "select" not in self._shared_state.get("elements_highlighted", ""): 39 | raise ValueError( 40 | "Please highlight dropdown elements on the page first by outputting '[highlight dropdowns]' message. You must output just the message without calling the tool first, so the user can respond with the screenshot." 41 | ) 42 | 43 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 44 | 45 | try: 46 | for key, value in self.key_value_pairs.items(): 47 | key = int(key) 48 | element = all_elements[key - 1] 49 | 50 | select = Select(element) 51 | 52 | # Select the first option (index 0) 53 | select.select_by_index(int(value)) 54 | result = f"Success. Option is selected in the dropdown. To further analyze the page, output '[send screenshot]' command." 55 | except Exception as e: 56 | result = str(e) 57 | 58 | remove_highlight_and_labels(wd) 59 | 60 | set_web_driver(wd) 61 | 62 | return result 63 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/SendKeys.py: -------------------------------------------------------------------------------- 1 | import time 2 | from typing import Dict 3 | 4 | from agency_swarm.tools import BaseTool 5 | from pydantic import Field, model_validator 6 | from selenium.webdriver import Keys 7 | from selenium.webdriver.common.by import By 8 | 9 | from .util import get_web_driver, set_web_driver 10 | from .util.highlights import remove_highlight_and_labels 11 | 12 | 13 | class SendKeys(BaseTool): 14 | """ 15 | This tool sends keys into input fields on the current webpage based on the description of that element and what needs to be typed. It then clicks "Enter" on the last element to submit the form. You do not need to tell it to press "Enter"; it will do that automatically. 16 | 17 | Before using this tool make sure to highlight the input elements on the page by outputting '[highlight text fields]' message. 18 | """ 19 | 20 | elements_and_texts: Dict[int, str] = Field( 21 | ..., 22 | description="A dictionary where the key is the element number and the value is the text to be typed.", 23 | examples=[ 24 | {52: "johndoe@gmail.com", 53: "password123"}, 25 | {3: "John Doe", 4: "123 Main St"}, 26 | ], 27 | ) 28 | 29 | @model_validator(mode="before") 30 | @classmethod 31 | def check_elements_and_texts(cls, data): 32 | if not data.get("elements_and_texts"): 33 | raise ValueError( 34 | "elements_and_texts is required. Example format: " 35 | "elements_and_texts={1: 'John Doe', 2: '123 Main St'}" 36 | ) 37 | return data 38 | 39 | def run(self): 40 | wd = get_web_driver() 41 | if "input" not in self._shared_state.get("elements_highlighted", ""): 42 | raise ValueError( 43 | "Please highlight input elements on the page first by outputting '[highlight text fields]' message. You must output just the message without calling the tool first, so the user can respond with the screenshot." 44 | ) 45 | 46 | all_elements = wd.find_elements(By.CSS_SELECTOR, ".highlighted-element") 47 | 48 | i = 0 49 | try: 50 | for key, value in self.elements_and_texts.items(): 51 | key = int(key) 52 | element = all_elements[key - 1] 53 | 54 | try: 55 | element.click() 56 | element.send_keys(Keys.CONTROL + "a") # Select all text in input 57 | element.send_keys(Keys.DELETE) 58 | element.clear() 59 | except Exception as e: 60 | pass 61 | element.send_keys(value) 62 | # send enter key to the last element 63 | if i == len(self.elements_and_texts) - 1: 64 | element.send_keys(Keys.RETURN) 65 | time.sleep(3) 66 | i += 1 67 | result = f"Sent input to element and pressed Enter. Current URL is {wd.current_url} To further analyze the page, output '[send screenshot]' command." 68 | except Exception as e: 69 | result = str(e) 70 | 71 | remove_highlight_and_labels(wd) 72 | 73 | set_web_driver(wd) 74 | 75 | return result 76 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/SolveCaptcha.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import time 3 | 4 | from agency_swarm.tools import BaseTool 5 | from agency_swarm.util import get_openai_client 6 | from selenium.webdriver.common.by import By 7 | from selenium.webdriver.support.expected_conditions import ( 8 | frame_to_be_available_and_switch_to_it, 9 | presence_of_element_located, 10 | ) 11 | from selenium.webdriver.support.wait import WebDriverWait 12 | 13 | from .util import get_b64_screenshot, remove_highlight_and_labels 14 | from .util.selenium import get_web_driver 15 | 16 | 17 | class SolveCaptcha(BaseTool): 18 | """ 19 | This tool asks a human to solve captcha on the current webpage. Make sure that captcha is visible before running it. 20 | """ 21 | 22 | def run(self): 23 | wd = get_web_driver() 24 | 25 | try: 26 | WebDriverWait(wd, 10).until( 27 | frame_to_be_available_and_switch_to_it( 28 | (By.XPATH, "//iframe[@title='reCAPTCHA']") 29 | ) 30 | ) 31 | 32 | element = WebDriverWait(wd, 3).until( 33 | presence_of_element_located((By.ID, "recaptcha-anchor")) 34 | ) 35 | except Exception as e: 36 | return "Could not find captcha checkbox" 37 | 38 | try: 39 | # Scroll the element into view 40 | wd.execute_script("arguments[0].scrollIntoView(true);", element) 41 | time.sleep(1) # Give some time for the scrolling to complete 42 | 43 | # Click the element using JavaScript 44 | wd.execute_script("arguments[0].click();", element) 45 | except Exception as e: 46 | return f"Could not click captcha checkbox: {str(e)}" 47 | 48 | try: 49 | # Now check if the reCAPTCHA is checked 50 | WebDriverWait(wd, 3).until( 51 | lambda d: d.find_element( 52 | By.CLASS_NAME, "recaptcha-checkbox" 53 | ).get_attribute("aria-checked") 54 | == "true" 55 | ) 56 | 57 | return "Success" 58 | except Exception as e: 59 | pass 60 | 61 | wd.switch_to.default_content() 62 | 63 | client = get_openai_client() 64 | 65 | WebDriverWait(wd, 10).until( 66 | frame_to_be_available_and_switch_to_it( 67 | ( 68 | By.XPATH, 69 | "//iframe[@title='recaptcha challenge expires in two minutes']", 70 | ) 71 | ) 72 | ) 73 | 74 | time.sleep(2) 75 | 76 | attempts = 0 77 | while attempts < 5: 78 | tiles = wd.find_elements(By.CLASS_NAME, "rc-imageselect-tile") 79 | 80 | # filter out tiles with rc-imageselect-dynamic-selected class 81 | tiles = [ 82 | tile 83 | for tile in tiles 84 | if not tile.get_attribute("class").endswith( 85 | "rc-imageselect-dynamic-selected" 86 | ) 87 | ] 88 | 89 | image_content = [] 90 | i = 0 91 | for tile in tiles: 92 | i += 1 93 | screenshot = get_b64_screenshot(wd, tile) 94 | 95 | image_content.append( 96 | { 97 | "type": "text", 98 | "text": f"Image {i}:", 99 | } 100 | ) 101 | image_content.append( 102 | { 103 | "type": "image_url", 104 | "image_url": { 105 | "url": f"data:image/jpeg;base64,{screenshot}", 106 | "detail": "high", 107 | }, 108 | }, 109 | ) 110 | # highlight all titles with rc-imageselect-tile class but not with rc-imageselect-dynamic-selected 111 | # wd = highlight_elements_with_labels(wd, 'td.rc-imageselect-tile:not(.rc-imageselect-dynamic-selected)') 112 | 113 | # screenshot = get_b64_screenshot(wd, wd.find_element(By.ID, "rc-imageselect")) 114 | 115 | task_text = ( 116 | wd.find_element(By.CLASS_NAME, "rc-imageselect-instructions") 117 | .text.strip() 118 | .replace("\n", " ") 119 | ) 120 | 121 | continuous_task = "once there are none left" in task_text.lower() 122 | 123 | task_text = task_text.replace("Click verify", "Output 0") 124 | task_text = task_text.replace("click skip", "Output 0") 125 | task_text = task_text.replace("once", "if") 126 | task_text = task_text.replace("none left", "none") 127 | task_text = task_text.replace("all", "only") 128 | task_text = task_text.replace("squares", "images") 129 | 130 | additional_info = "" 131 | if len(tiles) > 9: 132 | additional_info = ( 133 | "Keep in mind that all images are a part of a bigger image " 134 | "from left to right, and top to bottom. The grid is 4x4. " 135 | ) 136 | 137 | messages = [ 138 | { 139 | "role": "system", 140 | "content": f"""You are an advanced AI designed to support users with visual impairments. 141 | User will provide you with {i} images numbered from 1 to {i}. Your task is to output 142 | the numbers of the images that contain the requested object, or at least some part of the requested 143 | object. {additional_info}If there are no individual images that satisfy this condition, output 0. 144 | """.replace("\n", ""), 145 | }, 146 | { 147 | "role": "user", 148 | "content": [ 149 | *image_content, 150 | { 151 | "type": "text", 152 | "text": f"{task_text}. Only output numbers separated by commas and nothing else. " 153 | f"Output 0 if there are none.", 154 | }, 155 | ], 156 | }, 157 | ] 158 | 159 | response = client.chat.completions.create( 160 | model="gpt-4o", 161 | messages=messages, 162 | max_tokens=1024, 163 | temperature=0.0, 164 | ) 165 | 166 | message = response.choices[0].message 167 | message_text = message.content 168 | 169 | # check if 0 is in the message 170 | if "0" in message_text and "10" not in message_text: 171 | # Find the button by its ID 172 | verify_button = wd.find_element(By.ID, "recaptcha-verify-button") 173 | 174 | verify_button_text = verify_button.text 175 | 176 | # Click the button 177 | wd.execute_script("arguments[0].click();", verify_button) 178 | 179 | time.sleep(1) 180 | 181 | try: 182 | if self.verify_checkbox(wd): 183 | return "Success. Captcha solved." 184 | except Exception as e: 185 | print("Not checked") 186 | pass 187 | 188 | else: 189 | numbers = [ 190 | int(s.strip()) 191 | for s in message_text.split(",") 192 | if s.strip().isdigit() 193 | ] 194 | 195 | # Click the tiles based on the provided numbers 196 | for number in numbers: 197 | wd.execute_script("arguments[0].click();", tiles[number - 1]) 198 | time.sleep(0.5) 199 | 200 | time.sleep(3) 201 | 202 | if not continuous_task: 203 | # Find the button by its ID 204 | verify_button = wd.find_element(By.ID, "recaptcha-verify-button") 205 | 206 | verify_button_text = verify_button.text 207 | 208 | # Click the button 209 | wd.execute_script("arguments[0].click();", verify_button) 210 | 211 | try: 212 | if self.verify_checkbox(wd): 213 | return "Success. Captcha solved." 214 | except Exception as e: 215 | pass 216 | else: 217 | continue 218 | 219 | if "verify" in verify_button_text.lower(): 220 | attempts += 1 221 | 222 | wd = remove_highlight_and_labels(wd) 223 | 224 | wd.switch_to.default_content() 225 | 226 | # close captcha 227 | try: 228 | element = WebDriverWait(wd, 3).until( 229 | presence_of_element_located((By.XPATH, "//iframe[@title='reCAPTCHA']")) 230 | ) 231 | 232 | wd.execute_script( 233 | f"document.elementFromPoint({element.location['x']}, {element.location['y']-10}).click();" 234 | ) 235 | except Exception as e: 236 | print(e) 237 | pass 238 | 239 | return "Could not solve captcha." 240 | 241 | def verify_checkbox(self, wd): 242 | wd.switch_to.default_content() 243 | 244 | try: 245 | WebDriverWait(wd, 10).until( 246 | frame_to_be_available_and_switch_to_it( 247 | (By.XPATH, "//iframe[@title='reCAPTCHA']") 248 | ) 249 | ) 250 | 251 | WebDriverWait(wd, 5).until( 252 | lambda d: d.find_element( 253 | By.CLASS_NAME, "recaptcha-checkbox" 254 | ).get_attribute("aria-checked") 255 | == "true" 256 | ) 257 | 258 | return True 259 | except Exception as e: 260 | wd.switch_to.default_content() 261 | 262 | WebDriverWait(wd, 10).until( 263 | frame_to_be_available_and_switch_to_it( 264 | ( 265 | By.XPATH, 266 | "//iframe[@title='recaptcha challenge expires in two minutes']", 267 | ) 268 | ) 269 | ) 270 | 271 | return False 272 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/WebPageSummarizer.py: -------------------------------------------------------------------------------- 1 | from agency_swarm.tools import BaseTool 2 | from selenium.webdriver.common.by import By 3 | 4 | from .util import get_web_driver, set_web_driver 5 | 6 | 7 | class WebPageSummarizer(BaseTool): 8 | """ 9 | This tool summarizes the content of the current web page, extracting the main points and providing a concise summary. 10 | """ 11 | 12 | def run(self): 13 | from agency_swarm import get_openai_client 14 | 15 | wd = get_web_driver() 16 | client = get_openai_client() 17 | 18 | content = wd.find_element(By.TAG_NAME, "body").text 19 | 20 | # only use the first 10000 characters 21 | content = " ".join(content.split()[:10000]) 22 | 23 | completion = client.chat.completions.create( 24 | model="gpt-3.5-turbo", 25 | messages=[ 26 | { 27 | "role": "system", 28 | "content": "Your task is to summarize the content of the provided webpage. The summary should be concise and informative, capturing the main points and takeaways of the page.", 29 | }, 30 | { 31 | "role": "user", 32 | "content": "Summarize the content of the following webpage:\n\n" 33 | + content, 34 | }, 35 | ], 36 | temperature=0.0, 37 | ) 38 | 39 | return completion.choices[0].message.content 40 | 41 | 42 | if __name__ == "__main__": 43 | wd = get_web_driver() 44 | wd.get("https://en.wikipedia.org/wiki/Python_(programming_language)") 45 | set_web_driver(wd) 46 | tool = WebPageSummarizer() 47 | print(tool.run()) 48 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/__init__.py: -------------------------------------------------------------------------------- 1 | from .ClickElement import ClickElement 2 | from .ExportFile import ExportFile 3 | from .GoBack import GoBack 4 | from .ReadURL import ReadURL 5 | from .Scroll import Scroll 6 | from .SelectDropdown import SelectDropdown 7 | from .SendKeys import SendKeys 8 | from .SolveCaptcha import SolveCaptcha 9 | from .WebPageSummarizer import WebPageSummarizer 10 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/util/__init__.py: -------------------------------------------------------------------------------- 1 | from .get_b64_screenshot import get_b64_screenshot 2 | from .highlights import highlight_elements_with_labels, remove_highlight_and_labels 3 | from .selenium import get_web_driver, set_web_driver 4 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/util/get_b64_screenshot.py: -------------------------------------------------------------------------------- 1 | def get_b64_screenshot(wd, element=None): 2 | if element: 3 | screenshot_b64 = element.screenshot_as_base64 4 | else: 5 | screenshot_b64 = wd.get_screenshot_as_base64() 6 | 7 | return screenshot_b64 8 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/util/highlights.py: -------------------------------------------------------------------------------- 1 | def highlight_elements_with_labels(driver, selector): 2 | """ 3 | This function highlights clickable elements like buttons, links, and certain divs and spans 4 | that match the given CSS selector on the webpage with a red border and ensures that labels are visible and positioned 5 | correctly within the viewport. 6 | 7 | :param driver: Instance of Selenium WebDriver. 8 | :param selector: CSS selector for the elements to be highlighted. 9 | """ 10 | script = f""" 11 | // Helper function to check if an element is visible 12 | function isElementVisible(element) {{ 13 | var rect = element.getBoundingClientRect(); 14 | if (rect.width <= 0 || rect.height <= 0 || 15 | rect.top >= (window.innerHeight || document.documentElement.clientHeight) || 16 | rect.bottom <= 0 || 17 | rect.left >= (window.innerWidth || document.documentElement.clientWidth) || 18 | rect.right <= 0) {{ 19 | return false; 20 | }} 21 | // Check if any parent element is hidden, which would hide this element as well 22 | var parent = element; 23 | while (parent) {{ 24 | var style = window.getComputedStyle(parent); 25 | if (style.display === 'none' || style.visibility === 'hidden') {{ 26 | return false; 27 | }} 28 | parent = parent.parentElement; 29 | }} 30 | return true; 31 | }} 32 | 33 | // Remove previous labels and styles if they exist 34 | document.querySelectorAll('.highlight-label').forEach(function(label) {{ 35 | label.remove(); 36 | }}); 37 | document.querySelectorAll('.highlighted-element').forEach(function(element) {{ 38 | element.classList.remove('highlighted-element'); 39 | element.removeAttribute('data-highlighted'); 40 | }}); 41 | 42 | // Inject custom style for highlighting elements 43 | var styleElement = document.getElementById('highlight-style'); 44 | if (!styleElement) {{ 45 | styleElement = document.createElement('style'); 46 | styleElement.id = 'highlight-style'; 47 | document.head.appendChild(styleElement); 48 | }} 49 | styleElement.textContent = ` 50 | .highlighted-element {{ 51 | border: 2px solid red !important; 52 | position: relative; 53 | box-sizing: border-box; 54 | }} 55 | .highlight-label {{ 56 | position: absolute; 57 | z-index: 2147483647; 58 | background: yellow; 59 | color: black; 60 | font-size: 25px; 61 | padding: 3px 5px; 62 | border: 1px solid black; 63 | border-radius: 3px; 64 | white-space: nowrap; 65 | box-shadow: 0px 0px 2px #000; 66 | top: -25px; 67 | left: 0; 68 | display: none; 69 | }} 70 | `; 71 | 72 | // Function to create and append a label to the body 73 | function createAndAdjustLabel(element, index) {{ 74 | if (!isElementVisible(element)) return; 75 | 76 | element.classList.add('highlighted-element'); 77 | var label = document.createElement('div'); 78 | label.className = 'highlight-label'; 79 | label.textContent = index.toString(); 80 | label.style.display = 'block'; // Make the label visible 81 | 82 | // Calculate label position 83 | var rect = element.getBoundingClientRect(); 84 | var top = rect.top + window.scrollY - 25; // Position label above the element 85 | var left = rect.left + window.scrollX; 86 | 87 | label.style.top = top + 'px'; 88 | label.style.left = left + 'px'; 89 | 90 | document.body.appendChild(label); // Append the label to the body 91 | }} 92 | 93 | // Select all clickable elements and apply the styles 94 | var allElements = document.querySelectorAll('{selector}'); 95 | var index = 1; 96 | allElements.forEach(function(element) {{ 97 | // Check if the element is not already highlighted and is visible 98 | if (!element.dataset.highlighted && isElementVisible(element)) {{ 99 | element.dataset.highlighted = 'true'; 100 | createAndAdjustLabel(element, index++); 101 | }} 102 | }}); 103 | """ 104 | 105 | driver.execute_script(script) 106 | 107 | return driver 108 | 109 | 110 | def remove_highlight_and_labels(driver): 111 | """ 112 | This function removes all red borders and labels from the webpage elements, 113 | reversing the changes made by the highlight functions using Selenium WebDriver. 114 | 115 | :param driver: Instance of Selenium WebDriver. 116 | """ 117 | selector = ( 118 | 'a, button, input, textarea, div[onclick], div[role="button"], div[tabindex], span[onclick], ' 119 | 'span[role="button"], span[tabindex]' 120 | ) 121 | script = f""" 122 | // Remove all labels 123 | document.querySelectorAll('.highlight-label').forEach(function(label) {{ 124 | label.remove(); 125 | }}); 126 | 127 | // Remove the added style for red borders 128 | var highlightStyle = document.getElementById('highlight-style'); 129 | if (highlightStyle) {{ 130 | highlightStyle.remove(); 131 | }} 132 | 133 | // Remove inline styles added by highlighting function 134 | document.querySelectorAll('{selector}').forEach(function(element) {{ 135 | element.style.border = ''; 136 | }}); 137 | """ 138 | 139 | driver.execute_script(script) 140 | 141 | return driver 142 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/BrowsingAgent/tools/util/selenium.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | wd = None 4 | 5 | selenium_config = { 6 | "chrome_profile_path": None, 7 | "headless": False, 8 | "full_page_screenshot": True, 9 | } 10 | 11 | 12 | def get_web_driver(): 13 | print("Initializing WebDriver...") 14 | try: 15 | from selenium import webdriver 16 | from selenium.webdriver.chrome.service import Service as ChromeService 17 | 18 | print("Selenium imported successfully.") 19 | except ImportError: 20 | print("Selenium not installed. Please install it with pip install selenium") 21 | raise ImportError 22 | 23 | try: 24 | from webdriver_manager.chrome import ChromeDriverManager 25 | 26 | print("webdriver_manager imported successfully.") 27 | except ImportError: 28 | print( 29 | "webdriver_manager not installed. Please install it with pip install webdriver-manager" 30 | ) 31 | raise ImportError 32 | 33 | try: 34 | from selenium_stealth import stealth 35 | 36 | print("selenium_stealth imported successfully.") 37 | except ImportError: 38 | print( 39 | "selenium_stealth not installed. Please install it with pip install selenium-stealth" 40 | ) 41 | raise ImportError 42 | 43 | global wd, selenium_config 44 | 45 | if wd: 46 | print("Returning existing WebDriver instance.") 47 | return wd 48 | 49 | chrome_profile_path = selenium_config.get("chrome_profile_path", None) 50 | profile_directory = None 51 | user_data_dir = None 52 | if isinstance(chrome_profile_path, str) and os.path.exists(chrome_profile_path): 53 | profile_directory = ( 54 | os.path.split(chrome_profile_path)[-1].strip("\\").rstrip("/") 55 | ) 56 | user_data_dir = os.path.split(chrome_profile_path)[0].strip("\\").rstrip("/") 57 | print(f"Using Chrome profile: {profile_directory}") 58 | print(f"Using Chrome user data dir: {user_data_dir}") 59 | print(f"Using Chrome profile path: {chrome_profile_path}") 60 | 61 | chrome_options = webdriver.ChromeOptions() 62 | print("ChromeOptions initialized.") 63 | 64 | chrome_driver_path = "/usr/bin/chromedriver" 65 | if not os.path.exists(chrome_driver_path): 66 | print( 67 | "ChromeDriver not found at /usr/bin/chromedriver. Installing using webdriver_manager." 68 | ) 69 | chrome_driver_path = ChromeDriverManager().install() 70 | else: 71 | print(f"ChromeDriver found at {chrome_driver_path}.") 72 | 73 | if selenium_config.get("headless", False): 74 | chrome_options.add_argument("--headless") 75 | print("Headless mode enabled.") 76 | if selenium_config.get("full_page_screenshot", False): 77 | chrome_options.add_argument("--start-maximized") 78 | print("Full page screenshot mode enabled.") 79 | else: 80 | chrome_options.add_argument("--window-size=1920,1080") 81 | print("Window size set to 1920,1080.") 82 | 83 | chrome_options.add_argument("--no-sandbox") 84 | chrome_options.add_argument("--disable-gpu") 85 | chrome_options.add_argument("--disable-dev-shm-usage") 86 | chrome_options.add_argument("--remote-debugging-port=9222") 87 | chrome_options.add_argument("--disable-extensions") 88 | chrome_options.add_argument("--disable-popup-blocking") 89 | chrome_options.add_argument("--ignore-certificate-errors") 90 | chrome_options.add_argument("--disable-blink-features=AutomationControlled") 91 | chrome_options.add_argument("--disable-web-security") 92 | chrome_options.add_argument("--allow-running-insecure-content") 93 | chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) 94 | chrome_options.add_experimental_option("useAutomationExtension", False) 95 | print("Chrome options configured.") 96 | 97 | if user_data_dir and profile_directory: 98 | chrome_options.add_argument(f"user-data-dir={user_data_dir}") 99 | chrome_options.add_argument(f"profile-directory={profile_directory}") 100 | print( 101 | f"Using user data dir: {user_data_dir} and profile directory: {profile_directory}" 102 | ) 103 | 104 | try: 105 | wd = webdriver.Chrome( 106 | service=ChromeService(chrome_driver_path), options=chrome_options 107 | ) 108 | print("WebDriver initialized successfully.") 109 | if wd.capabilities["chrome"]["userDataDir"]: 110 | print(f"Profile path in use: {wd.capabilities['chrome']['userDataDir']}") 111 | except Exception as e: 112 | print(f"Error initializing WebDriver: {e}") 113 | raise e 114 | 115 | if not selenium_config.get("chrome_profile_path", None): 116 | stealth( 117 | wd, 118 | languages=["en-US", "en"], 119 | vendor="Google Inc.", 120 | platform="Win32", 121 | webgl_vendor="Intel Inc.", 122 | renderer="Intel Iris OpenGL Engine", 123 | fix_hairline=True, 124 | ) 125 | print("Stealth mode configured.") 126 | 127 | wd.implicitly_wait(3) 128 | print("Implicit wait set to 3 seconds.") 129 | 130 | return wd 131 | 132 | 133 | def set_web_driver(new_wd): 134 | # remove all popups 135 | js_script = """ 136 | var popUpSelectors = ['modal', 'popup', 'overlay', 'dialog']; // Add more selectors that are commonly used for pop-ups 137 | popUpSelectors.forEach(function(selector) { 138 | var elements = document.querySelectorAll(selector); 139 | elements.forEach(function(element) { 140 | // You can choose to hide or remove; here we're removing the element 141 | element.parentNode.removeChild(element); 142 | }); 143 | }); 144 | """ 145 | 146 | new_wd.execute_script(js_script) 147 | 148 | # Close LinkedIn specific popups 149 | if "linkedin.com" in new_wd.current_url: 150 | linkedin_js_script = """ 151 | var linkedinSelectors = ['div.msg-overlay-list-bubble', 'div.ml4.msg-overlay-list-bubble__tablet-height']; 152 | linkedinSelectors.forEach(function(selector) { 153 | var elements = document.querySelectorAll(selector); 154 | elements.forEach(function(element) { 155 | element.parentNode.removeChild(element); 156 | }); 157 | }); 158 | """ 159 | new_wd.execute_script(linkedin_js_script) 160 | 161 | new_wd.execute_script("document.body.style.zoom='1.2'") 162 | 163 | global wd 164 | wd = new_wd 165 | 166 | 167 | def set_selenium_config(config): 168 | global selenium_config 169 | selenium_config = config 170 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/agency.py: -------------------------------------------------------------------------------- 1 | from agency_swarm import Agency 2 | 3 | from .AnalystAgent.AnalystAgent import AnalystAgent 4 | from .BrowsingAgent.BrowsingAgent import BrowsingAgent 5 | 6 | 7 | def create_agency(): 8 | browsing_agent = BrowsingAgent() 9 | analyst_agent = AnalystAgent() 10 | 11 | agency = Agency( 12 | [ 13 | analyst_agent, 14 | [analyst_agent, browsing_agent], 15 | ], 16 | shared_instructions="agency_manifesto.md", 17 | temperature=0.0, 18 | max_prompt_tokens=25000, 19 | async_mode="threading", 20 | ) 21 | 22 | return agency 23 | 24 | 25 | agency = create_agency() 26 | 27 | 28 | if __name__ == "__main__": 29 | agency.run_demo() 30 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/ResearchAgency/agency_manifesto.md: -------------------------------------------------------------------------------- 1 | # Research Agency Manifesto 2 | 3 | The Research Agency leverages advanced AI-driven web browsing capabilities to gather, analyze, and synthesize information from diverse online sources. Our mission is to empower users with timely, relevant, and insightful information, supporting informed decision-making and expanding knowledge on specific topics. 4 | 5 | Our Web Browsing Agent operates in a dynamic digital environment, focusing on: 6 | 7 | 1. Efficient navigation and access to a wide range of online resources 8 | 2. Intelligent analysis of information to extract key insights 9 | 3. Effective synthesis of data from multiple sources to provide comprehensive understanding 10 | 11 | We are committed to delivering high-quality, actionable information that directly addresses user inquiries and enhances their decision-making processes. 12 | -------------------------------------------------------------------------------- /src/voice_assistant/agencies/__init__.py: -------------------------------------------------------------------------------- 1 | import importlib 2 | import os 3 | 4 | from agency_swarm import Agency 5 | 6 | 7 | def load_agencies() -> dict[str, Agency]: 8 | agencies = {} 9 | current_dir = os.path.dirname(os.path.abspath(__file__)) 10 | 11 | for agency_folder in os.listdir(current_dir): 12 | agency_path = os.path.join(current_dir, agency_folder) 13 | if os.path.isdir(agency_path) and agency_folder != "__pycache__": 14 | try: 15 | agency_module = importlib.import_module( 16 | f"voice_assistant.agencies.{agency_folder}.agency" 17 | ) 18 | agencies[agency_folder] = getattr(agency_module, "agency") 19 | except (ImportError, AttributeError) as e: 20 | print(f"Error loading agency {agency_folder}: {e}") 21 | 22 | return agencies 23 | 24 | 25 | # Load all agencies 26 | AGENCIES: dict[str, Agency] = load_agencies() 27 | 28 | AGENCIES_AND_AGENTS_STRING = "\n".join( 29 | f"Agency '{agency_name}' has the following agents: {', '.join(agent.name for agent in agency.agents)}" 30 | for agency_name, agency in AGENCIES.items() 31 | ) 32 | print("Available Agencies and Agents:\n", AGENCIES_AND_AGENTS_STRING) # Debug print 33 | -------------------------------------------------------------------------------- /src/voice_assistant/audio.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import logging 3 | 4 | import pyaudio 5 | 6 | from voice_assistant.config import CHANNELS, FORMAT, RATE 7 | 8 | logger = logging.getLogger(__name__) 9 | 10 | 11 | class AudioPlayer: 12 | def __init__(self): 13 | self.p = pyaudio.PyAudio() 14 | self.stream = self.p.open( 15 | format=FORMAT, channels=CHANNELS, rate=RATE, output=True, start=False 16 | ) 17 | self.is_playing = False 18 | 19 | async def play_audio_chunk(self, audio_chunk: bytes, visual_interface): 20 | if not self.is_playing: 21 | self.stream.start_stream() 22 | self.is_playing = True 23 | visual_interface.set_assistant_speaking(True) 24 | 25 | self.stream.write(audio_chunk) 26 | 27 | # Update energy for visualization 28 | visual_interface.process_audio_data(audio_chunk) 29 | 30 | # Allow other tasks to run 31 | await asyncio.sleep(0) 32 | 33 | async def stop_playback(self, visual_interface): 34 | if self.is_playing: 35 | # Add a small delay of silence at the end 36 | silence_duration = 0.2 # 200ms 37 | silence_frames = int(RATE * silence_duration) 38 | silence = b"\x00" * (silence_frames * CHANNELS * 2) 39 | self.stream.write(silence) 40 | 41 | await asyncio.sleep(0.5) 42 | 43 | self.stream.stop_stream() 44 | self.is_playing = False 45 | visual_interface.set_assistant_speaking(False) 46 | logger.debug("Audio playback completed") 47 | 48 | def close(self): 49 | self.stream.close() 50 | self.p.terminate() 51 | 52 | 53 | audio_player = AudioPlayer() 54 | -------------------------------------------------------------------------------- /src/voice_assistant/config.py: -------------------------------------------------------------------------------- 1 | # src/voice_assistant/config.py 2 | import json 3 | import os 4 | 5 | import pyaudio 6 | from dotenv import load_dotenv 7 | 8 | # Load environment variables 9 | load_dotenv() 10 | 11 | # Constants 12 | PREFIX_PADDING_MS = 300 13 | SILENCE_THRESHOLD = 0.5 14 | SILENCE_DURATION_MS = 400 15 | RUN_TIME_TABLE_LOG_JSON = "runtime_time_table.jsonl" 16 | CHUNK = 1024 17 | FORMAT = pyaudio.paInt16 18 | CHANNELS = 1 19 | RATE = 24000 20 | 21 | # Load personalization settings 22 | PERSONALIZATION_FILE = os.getenv("PERSONALIZATION_FILE", "./personalization.json") 23 | with open(PERSONALIZATION_FILE, "r") as f: 24 | personalization = json.load(f) 25 | 26 | AI_ASSISTANT_NAME = personalization.get("ai_assistant_name", "Assistant") 27 | USER_NAME = personalization.get("user_name", "User") 28 | 29 | # Load assistant instructions from personalization file 30 | SESSION_INSTRUCTIONS = personalization.get("assistant_instructions", "").format( 31 | ai_assistant_name=AI_ASSISTANT_NAME, user_name=USER_NAME 32 | ) 33 | 34 | # Check for required environment variables 35 | REQUIRED_ENV_VARS = ["OPENAI_API_KEY", "PERSONALIZATION_FILE", "SCRATCH_PAD_DIR"] 36 | MISSING_VARS = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)] 37 | if MISSING_VARS: 38 | raise EnvironmentError( 39 | f"Missing required environment variables: {', '.join(MISSING_VARS)}" 40 | ) 41 | 42 | SCRATCH_PAD_DIR = os.getenv("SCRATCH_PAD_DIR", "./scratchpad") 43 | os.makedirs(SCRATCH_PAD_DIR, exist_ok=True) 44 | -------------------------------------------------------------------------------- /src/voice_assistant/icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/VRSEN/agency-voice-interface/2d9d39ce02d9cb9628e8de79b3543fe05885ad42/src/voice_assistant/icon.png -------------------------------------------------------------------------------- /src/voice_assistant/main.py: -------------------------------------------------------------------------------- 1 | # src/voice_assistant/main.py 2 | import asyncio 3 | import json 4 | import logging 5 | import os 6 | 7 | import pygame 8 | import websockets 9 | from websockets.exceptions import ConnectionClosedError 10 | 11 | from voice_assistant.config import ( 12 | PREFIX_PADDING_MS, 13 | SESSION_INSTRUCTIONS, 14 | SILENCE_DURATION_MS, 15 | SILENCE_THRESHOLD, 16 | ) 17 | from voice_assistant.microphone import AsyncMicrophone 18 | from voice_assistant.tools import TOOL_SCHEMAS 19 | from voice_assistant.utils import base64_encode_audio 20 | from voice_assistant.utils.log_utils import log_ws_event 21 | from voice_assistant.visual_interface import ( 22 | VisualInterface, 23 | run_visual_interface, 24 | ) 25 | from voice_assistant.websocket_handler import process_ws_messages 26 | 27 | # Set up logging 28 | logging.basicConfig( 29 | level=logging.INFO, 30 | format="%(asctime)s.%(msecs)03d - %(levelname)s - %(message)s", 31 | datefmt="%H:%M:%S", 32 | ) 33 | logger = logging.getLogger(__name__) 34 | 35 | 36 | async def realtime_api(): 37 | while True: 38 | try: 39 | api_key = os.getenv("OPENAI_API_KEY") 40 | if not api_key: 41 | logger.error("Please set the OPENAI_API_KEY in your .env file.") 42 | return 43 | 44 | exit_event = asyncio.Event() 45 | 46 | url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01" 47 | headers = { 48 | "Authorization": f"Bearer {api_key}", 49 | "OpenAI-Beta": "realtime=v1", 50 | } 51 | 52 | mic = AsyncMicrophone() 53 | visual_interface = VisualInterface() 54 | 55 | async with websockets.connect(url, extra_headers=headers) as websocket: 56 | logger.info("Connected to the server.") 57 | # Initialize the session with voice capabilities and tools 58 | session_update = { 59 | "type": "session.update", 60 | "session": { 61 | "modalities": ["text", "audio"], 62 | "instructions": SESSION_INSTRUCTIONS, 63 | "voice": "shimmer", 64 | "input_audio_format": "pcm16", 65 | "output_audio_format": "pcm16", 66 | "turn_detection": { 67 | "type": "server_vad", 68 | "threshold": SILENCE_THRESHOLD, 69 | "prefix_padding_ms": PREFIX_PADDING_MS, 70 | "silence_duration_ms": SILENCE_DURATION_MS, 71 | }, 72 | "tools": TOOL_SCHEMAS, 73 | }, 74 | } 75 | log_ws_event("outgoing", session_update) 76 | await websocket.send(json.dumps(session_update)) 77 | 78 | ws_task = asyncio.create_task( 79 | process_ws_messages(websocket, mic, visual_interface) 80 | ) 81 | visual_task = asyncio.create_task( 82 | run_visual_interface(visual_interface) 83 | ) 84 | 85 | logger.info( 86 | "Conversation started. Speak freely, and the assistant will respond." 87 | ) 88 | mic.start_recording() 89 | logger.info("Recording started. Listening for speech...") 90 | 91 | try: 92 | while not exit_event.is_set(): 93 | await asyncio.sleep(0.01) # Small delay to reduce CPU usage 94 | if not mic.is_receiving: 95 | audio_data = mic.get_audio_data() 96 | if audio_data: 97 | base64_audio = base64_encode_audio(audio_data) 98 | if base64_audio: 99 | audio_event = { 100 | "type": "input_audio_buffer.append", 101 | "audio": base64_audio, 102 | } 103 | log_ws_event("outgoing", audio_event) 104 | await websocket.send(json.dumps(audio_event)) 105 | # Update energy for visualization 106 | visual_interface.process_audio_data(audio_data) 107 | else: 108 | logger.debug("No audio data to send") 109 | except KeyboardInterrupt: 110 | logger.info("Keyboard interrupt received. Closing the connection.") 111 | except Exception as e: 112 | logger.exception( 113 | f"An unexpected error occurred in the main loop: {e}" 114 | ) 115 | finally: 116 | exit_event.set() 117 | mic.stop_recording() 118 | mic.close() 119 | await websocket.close() 120 | visual_interface.set_active(False) 121 | 122 | # Wait for the WebSocket processing task to complete 123 | try: 124 | await ws_task 125 | await visual_task 126 | except Exception as e: 127 | logging.exception(f"Error in WebSocket processing task: {e}") 128 | 129 | # If execution reaches here without exceptions, exit the loop 130 | break 131 | except ConnectionClosedError as e: 132 | if "keepalive ping timeout" in str(e): 133 | logging.warning( 134 | "WebSocket connection lost due to keepalive ping timeout. Reconnecting..." 135 | ) 136 | await asyncio.sleep(1) # Wait before reconnecting 137 | continue # Retry the connection 138 | logging.exception("WebSocket connection closed unexpectedly.") 139 | break # Exit the loop on other connection errors 140 | except Exception as e: 141 | logging.exception(f"An unexpected error occurred: {e}") 142 | break # Exit the loop on unexpected exceptions 143 | finally: 144 | if "mic" in locals(): 145 | mic.stop_recording() 146 | mic.close() 147 | if "websocket" in locals(): 148 | await websocket.close() 149 | pygame.quit() 150 | 151 | 152 | async def main_async(): 153 | await realtime_api() 154 | 155 | 156 | def main(): 157 | try: 158 | asyncio.run(main_async()) 159 | except KeyboardInterrupt: 160 | logger.info("Program terminated by user") 161 | except Exception as e: 162 | logger.exception(f"An unexpected error occurred: {e}") 163 | 164 | 165 | if __name__ == "__main__": 166 | print("Press Ctrl+C to exit the program.") 167 | main() 168 | -------------------------------------------------------------------------------- /src/voice_assistant/microphone.py: -------------------------------------------------------------------------------- 1 | # src/voice_assistant/microphone.py 2 | import logging 3 | import queue 4 | from typing import Optional 5 | 6 | import pyaudio 7 | 8 | from voice_assistant.config import CHANNELS, CHUNK, FORMAT, RATE 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class AsyncMicrophone: 14 | def __init__(self): 15 | self.p = pyaudio.PyAudio() 16 | self.stream = self.p.open( 17 | format=FORMAT, 18 | channels=CHANNELS, 19 | rate=RATE, 20 | input=True, 21 | frames_per_buffer=CHUNK, 22 | stream_callback=self.callback, 23 | ) 24 | self.queue = queue.Queue() 25 | self.is_recording = False 26 | self.is_receiving = False 27 | logger.info("AsyncMicrophone initialized") 28 | 29 | def callback(self, in_data, frame_count, time_info, status): 30 | if self.is_recording and not self.is_receiving: 31 | self.queue.put(in_data) 32 | return (None, pyaudio.paContinue) 33 | 34 | def start_recording(self): 35 | self.is_recording = True 36 | logger.info("Started recording") 37 | 38 | def stop_recording(self): 39 | self.is_recording = False 40 | logger.info("Stopped recording") 41 | 42 | def start_receiving(self): 43 | self.is_receiving = True 44 | self.is_recording = False 45 | logger.info("Started receiving assistant response") 46 | 47 | def stop_receiving(self): 48 | self.is_receiving = False 49 | logger.info("Stopped receiving assistant response") 50 | 51 | def get_audio_data(self) -> Optional[bytes]: 52 | data = b"" 53 | while not self.queue.empty(): 54 | data += self.queue.get() 55 | return data if data else None 56 | 57 | def close(self): 58 | self.stream.stop_stream() 59 | self.stream.close() 60 | self.p.terminate() 61 | logger.info("AsyncMicrophone closed") 62 | -------------------------------------------------------------------------------- /src/voice_assistant/models.py: -------------------------------------------------------------------------------- 1 | # src/voice_assistant/models.py 2 | from enum import StrEnum 3 | 4 | from pydantic import BaseModel 5 | 6 | 7 | class ModelName(StrEnum): 8 | BASE_MODEL = "gpt-4o" 9 | FAST_MODEL = "gpt-4o-mini" 10 | REASONING_MODEL_LARGE = "o1-preview" 11 | REASONING_MODEL_SMALL = "o1-mini" 12 | 13 | 14 | class WebUrl(BaseModel): 15 | url: str 16 | 17 | 18 | class CreateFileResponse(BaseModel): 19 | file_content: str 20 | file_name: str 21 | 22 | 23 | class FileSelectionResponse(BaseModel): 24 | file: str 25 | model: ModelName = ModelName.BASE_MODEL 26 | 27 | 28 | class FileUpdateResponse(BaseModel): 29 | updates: str 30 | 31 | 32 | class FileDeleteResponse(BaseModel): 33 | file: str 34 | force_delete: bool 35 | -------------------------------------------------------------------------------- /src/voice_assistant/tests/test_realtime_connection.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import os 3 | 4 | import websockets 5 | from dotenv import load_dotenv 6 | 7 | # Load environment variables from .env file 8 | load_dotenv() 9 | 10 | 11 | async def test_realtime_api_connection(): 12 | # Retrieve your API key from the environment variables 13 | api_key = os.getenv("OPENAI_API_KEY") 14 | if not api_key: 15 | print("Please set the OPENAI_API_KEY environment variable in your .env file.") 16 | return 17 | 18 | # Define the WebSocket URL with the appropriate model 19 | # Replace 'gpt-4' with the correct model name if necessary 20 | url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01" 21 | 22 | # Set the required headers 23 | headers = { 24 | "Authorization": f"Bearer {api_key}", 25 | "OpenAI-Beta": "realtime=v1", 26 | } 27 | 28 | # Attempt to establish the WebSocket connection 29 | try: 30 | async with websockets.connect(url, extra_headers=headers): 31 | print("Connected to the server.") 32 | except websockets.InvalidStatusCode as e: 33 | print(f"Failed to connect: {e}") 34 | if e.status_code == 403: 35 | print("HTTP 403 Forbidden: Access denied.") 36 | print("You may not have access to the Realtime API.") 37 | else: 38 | print(f"HTTP {e.status_code}") 39 | except Exception as e: 40 | print(f"An unexpected error occurred: {e}") 41 | 42 | 43 | if __name__ == "__main__": 44 | asyncio.run(test_realtime_api_connection()) 45 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/CreateFile.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from agency_swarm.tools import BaseTool 4 | from dotenv import load_dotenv 5 | from pydantic import Field 6 | 7 | from voice_assistant.config import SCRATCH_PAD_DIR 8 | from voice_assistant.models import CreateFileResponse 9 | from voice_assistant.utils.decorators import timeit_decorator 10 | from voice_assistant.utils.llm_utils import get_structured_output_completion 11 | 12 | load_dotenv() 13 | 14 | 15 | class CreateFile(BaseTool): 16 | """A tool for creating a new file with generated content based on a prompt.""" 17 | 18 | file_name: str = Field(..., description="The name of the file to be created.") 19 | prompt: str = Field( 20 | ..., description="The prompt to generate content for the new file." 21 | ) 22 | 23 | async def run(self): 24 | result = await create_file(self.file_name, self.prompt) 25 | return str(result) 26 | 27 | 28 | @timeit_decorator 29 | async def create_file(file_name: str, prompt: str) -> dict: 30 | file_path = os.path.join(SCRATCH_PAD_DIR, file_name) 31 | 32 | if os.path.exists(file_path): 33 | return {"status": "File already exists"} 34 | 35 | prompt_structure = f""" 36 | 37 | Generate content for a new file based on the user's prompt and the file name. 38 | 39 | 40 | 41 | Based on the user's prompt and the file name, generate content for a new file. 42 | The file name is: {file_name} 43 | Use the following prompt to generate the content: {prompt} 44 | 45 | """ 46 | 47 | response = await get_structured_output_completion( 48 | prompt_structure, CreateFileResponse 49 | ) 50 | 51 | with open(file_path, "w") as f: 52 | f.write(response.file_content) 53 | 54 | return {"status": "File created", "file_name": response.file_name} 55 | 56 | 57 | if __name__ == "__main__": 58 | import asyncio 59 | 60 | tool = CreateFile(file_name="test.txt", prompt="Write a short story about a robot.") 61 | 62 | print(asyncio.run(tool.run())) 63 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/DeleteFile.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from agency_swarm.tools import BaseTool 4 | from dotenv import load_dotenv 5 | from pydantic import Field 6 | 7 | from voice_assistant.config import SCRATCH_PAD_DIR 8 | from voice_assistant.models import FileDeleteResponse 9 | from voice_assistant.utils.decorators import timeit_decorator 10 | from voice_assistant.utils.llm_utils import get_structured_output_completion 11 | 12 | load_dotenv() 13 | 14 | 15 | class DeleteFile(BaseTool): 16 | """A tool for deleting a file based on a prompt.""" 17 | 18 | prompt: str = Field(..., description="The prompt to identify which file to delete.") 19 | force_delete: bool = Field( 20 | False, description="Whether to force delete the file without confirmation." 21 | ) 22 | 23 | async def run(self): 24 | result = await delete_file(self.prompt, self.force_delete) 25 | return str(result) 26 | 27 | 28 | @timeit_decorator 29 | async def delete_file(prompt: str, force_delete: bool = False) -> dict: 30 | available_files = os.listdir(SCRATCH_PAD_DIR) 31 | 32 | # Select file to delete based on user prompt 33 | file_delete_response = await get_structured_output_completion( 34 | create_file_selection_prompt(available_files, prompt), FileDeleteResponse 35 | ) 36 | 37 | if not file_delete_response.file: 38 | return {"status": "No matching file found"} 39 | 40 | file_path = os.path.join(SCRATCH_PAD_DIR, file_delete_response.file) 41 | 42 | if not os.path.exists(file_path): 43 | return {"status": "File does not exist", "file_name": file_delete_response.file} 44 | 45 | if not force_delete: 46 | return { 47 | "status": "Confirmation required", 48 | "file_name": file_delete_response.file, 49 | "message": f"Are you sure you want to delete '{file_delete_response.file}'? Say force delete if you want to delete.", 50 | } 51 | 52 | os.remove(file_path) 53 | return {"status": "File deleted", "file_name": file_delete_response.file} 54 | 55 | 56 | def create_file_selection_prompt(available_files, user_prompt): 57 | return f""" 58 | 59 | Select a file from the available files to delete. 60 | 61 | 62 | 63 | Based on the user's prompt and the list of available files, infer which file the user wants to delete. 64 | If no file matches, return an empty string for 'file'. 65 | 66 | 67 | 68 | {", ".join(available_files)} 69 | 70 | 71 | 72 | {user_prompt} 73 | 74 | """ 75 | 76 | 77 | if __name__ == "__main__": 78 | import asyncio 79 | 80 | tool = DeleteFile(prompt="Delete the test file", force_delete=True) 81 | print(asyncio.run(tool.run())) 82 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/DraftGmail.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import base64 3 | import os 4 | from datetime import datetime 5 | from email.mime.text import MIMEText 6 | from typing import Any, Dict, Optional 7 | 8 | from agency_swarm.tools import BaseTool 9 | from pydantic import Field, PrivateAttr 10 | 11 | from voice_assistant.utils.google_services_utils import GoogleServicesUtils 12 | 13 | 14 | class DraftGmail(BaseTool): 15 | """A tool to draft an email. Either reply_to_id or recipient must be provided.""" 16 | 17 | subject: Optional[str] = Field(None, description="Subject of the email") 18 | content: str = Field(..., description="Content of the email") 19 | recipient: Optional[str] = Field( 20 | None, 21 | description="Recipient of the email. If not provided, the email will be sent to the recipient in the reply_to_id", 22 | ) 23 | reply_to_id: Optional[str] = Field(None, description="ID of the email to reply to") 24 | _service: Optional[GoogleServicesUtils] = PrivateAttr(None) 25 | 26 | async def run(self) -> Dict[str, Any]: 27 | self._service = await GoogleServicesUtils.authenticate_service("gmail") 28 | return await self.draft_email() 29 | 30 | async def draft_email(self) -> Dict[str, Any]: 31 | try: 32 | message = await asyncio.to_thread(self._create_message) 33 | draft = await asyncio.to_thread( 34 | lambda: self._service.users() 35 | .drafts() 36 | .create(userId="me", body={"message": message}) 37 | .execute() 38 | ) 39 | return { 40 | "draft_id": draft["id"], 41 | "message": "Email draft created successfully", 42 | "drafted_at": datetime.utcnow().isoformat(), 43 | } 44 | except Exception as e: 45 | return {"error": str(e), "message": "Failed to create email draft"} 46 | 47 | def _create_message(self) -> Dict[str, Any]: 48 | message = MIMEText(self.content) 49 | thread_id = None 50 | 51 | if self.reply_to_id: 52 | original_message = ( 53 | self._service.users() 54 | .messages() 55 | .get(userId="me", id=self.reply_to_id, format="full") 56 | .execute() 57 | ) 58 | thread_id = original_message.get("threadId") 59 | if not thread_id: 60 | raise ValueError("Original message does not have a threadId.") 61 | 62 | headers = original_message["payload"]["headers"] 63 | original_subject = next( 64 | (header["value"] for header in headers if header["name"] == "Subject"), 65 | "No Subject", 66 | ) 67 | original_from = next( 68 | (header["value"] for header in headers if header["name"] == "From"), 69 | "Unknown", 70 | ) 71 | message["to"] = original_from 72 | message["subject"] = f"Re: {original_subject}" 73 | message["In-Reply-To"] = self.reply_to_id 74 | message["References"] = self.reply_to_id 75 | else: 76 | if self.recipient is None: 77 | raise ValueError("Recipient is required for new emails") 78 | 79 | if self.subject is None: 80 | raise ValueError("Subject is required for new emails") 81 | 82 | message["to"] = self.recipient 83 | message["subject"] = self.subject 84 | 85 | message["from"] = os.getenv("EMAIL_SENDER", "sender@example.com") 86 | raw_message = base64.urlsafe_b64encode(message.as_bytes()).decode("utf-8") 87 | return {"raw": raw_message, "threadId": thread_id} 88 | 89 | 90 | if __name__ == "__main__": 91 | import asyncio 92 | 93 | async def main(): 94 | # Example usage for a new email 95 | tool = DraftGmail( 96 | subject="Important Meeting", 97 | content="Hello,\n\nThis is a draft email for our upcoming meeting.\n\nBest regards,\nYour Name", 98 | recipient="recipient@example.com", 99 | ) 100 | result = await tool.run() 101 | print("New email draft:", result) 102 | 103 | # Example usage for a reply 104 | reply_tool = DraftGmail( 105 | content="Thank you for your email. I'll review the draft and get back to you soon.", 106 | reply_to_id="1929188e90b212c3", # Replace with an actual email ID 107 | ) 108 | reply_result = await reply_tool.run() 109 | print("Reply draft:", reply_result) 110 | 111 | asyncio.run(main()) 112 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/FetchDailyMeetingSchedule.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import logging 3 | from datetime import UTC, datetime 4 | 5 | from agency_swarm.tools import BaseTool 6 | from dotenv import load_dotenv 7 | from pydantic import Field 8 | 9 | from voice_assistant.utils.google_services_utils import GoogleServicesUtils 10 | 11 | load_dotenv() 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | class FetchDailyMeetingSchedule(BaseTool): 17 | """A tool to fetch and format the user's daily meeting schedule from Google Calendar.""" 18 | 19 | date: str = Field( 20 | default_factory=lambda: datetime.now(UTC).strftime("%Y-%m-%d"), 21 | description="The date for which to fetch the meeting schedule. Defaults to today if not provided.", 22 | ) 23 | 24 | async def run(self) -> str: 25 | try: 26 | meetings = await self.fetch_meetings(self.date) 27 | formatted_meetings = self.format_meetings(meetings) 28 | return formatted_meetings 29 | except Exception as e: 30 | logger.error(f"Error in FetchDailyMeetingSchedule: {str(e)}") 31 | return f"An error occurred while fetching the meeting schedule: {str(e)}" 32 | 33 | async def fetch_meetings(self, date) -> list[dict]: 34 | service = await GoogleServicesUtils.authenticate_service("calendar") 35 | events_result = await asyncio.to_thread( 36 | service.events() 37 | .list( 38 | calendarId="primary", 39 | timeMin=f"{date}T00:00:00Z", 40 | timeMax=f"{date}T23:59:59Z", 41 | singleEvents=True, 42 | orderBy="startTime", 43 | ) 44 | .execute 45 | ) 46 | return events_result.get("items", []) 47 | 48 | def format_meetings(self, meetings) -> str: 49 | formatted = [] 50 | for meeting in meetings: 51 | start_time = datetime.fromisoformat( 52 | meeting["start"].get("dateTime", meeting["start"].get("date")) 53 | ) 54 | end_time = datetime.fromisoformat( 55 | meeting["end"].get("dateTime", meeting["end"].get("date")) 56 | ) 57 | 58 | formatted_meeting = f"{start_time.strftime('%I:%M %p')} - {end_time.strftime('%I:%M %p')}: {meeting.get('summary', 'Untitled Event')}" 59 | 60 | if meeting.get("location"): 61 | formatted_meeting += f" | Location: {meeting['location']}" 62 | 63 | if meeting.get("description"): 64 | description = meeting["description"].split("\n")[0] 65 | formatted_meeting += f" | Description: {description}" 66 | 67 | formatted.append(formatted_meeting) 68 | 69 | if not formatted: 70 | return "No meetings scheduled for today." 71 | 72 | return "Today's Agenda:\n" + "\n".join(formatted) 73 | 74 | 75 | if __name__ == "__main__": 76 | tool = FetchDailyMeetingSchedule() 77 | result = asyncio.run(tool.run()) 78 | print(result) 79 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/GetCurrentDateTime.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | from datetime import datetime 3 | 4 | from agency_swarm.tools import BaseTool 5 | 6 | 7 | class GetCurrentDateTime(BaseTool): 8 | """A tool to get the current date, time, and day of the week.""" 9 | 10 | async def run(self) -> str: 11 | return datetime.now().strftime("%A, %Y-%m-%d %H:%M:%S") 12 | 13 | 14 | if __name__ == "__main__": 15 | tool = GetCurrentDateTime() 16 | print(asyncio.run(tool.run())) 17 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/GetGmailSummary.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import base64 3 | import logging 4 | import re 5 | from datetime import datetime, timedelta 6 | from typing import List, Optional 7 | 8 | from agency_swarm.tools import BaseTool 9 | from dotenv import load_dotenv 10 | from pydantic import Field, PrivateAttr 11 | 12 | from voice_assistant.models import ModelName 13 | from voice_assistant.utils.google_services_utils import GoogleServicesUtils 14 | from voice_assistant.utils.llm_utils import get_model_completion 15 | 16 | logger = logging.getLogger(__name__) 17 | 18 | load_dotenv() 19 | 20 | 21 | class GetGmailSummary(BaseTool): 22 | """A tool to summarize unread Gmail messages from the last two days.""" 23 | 24 | max_results: int = Field( 25 | default=10, 26 | description="Maximum number of unread emails to fetch. Defaults to 10.", 27 | ) 28 | _service: Optional[GoogleServicesUtils] = PrivateAttr(None) 29 | 30 | async def run(self) -> str: 31 | """ 32 | Main execution method to fetch and summarize unread Gmail messages. 33 | """ 34 | logger.info("Starting Gmail authentication.") 35 | self._service = await GoogleServicesUtils.authenticate_service("gmail") 36 | 37 | logger.info("Fetching unread messages.") 38 | messages = await self._fetch_unread_messages() 39 | 40 | if not messages: 41 | logger.info("No unread messages found.") 42 | return "No unread Gmail messages found in the last two days." 43 | 44 | logger.info("Summarizing messages using GPT-4o-mini.") 45 | summary = await self._summarize_messages_with_gpt(messages) 46 | 47 | logger.info("Gmail summary completed.") 48 | return summary 49 | 50 | async def _fetch_unread_messages(self) -> List[dict]: 51 | """ 52 | Fetch unread messages from the last two days. 53 | """ 54 | two_days_ago = (datetime.now() - timedelta(days=2)).strftime("%Y/%m/%d") 55 | query = f"is:unread after:{two_days_ago}" 56 | logger.info(f"Executing query: {query}") 57 | 58 | results = await asyncio.to_thread( 59 | lambda: self._service.users() 60 | .messages() 61 | .list(userId="me", q=query, maxResults=self.max_results) 62 | .execute() 63 | ) 64 | 65 | messages = results.get("messages", []) 66 | full_messages = [] 67 | logger.info(f"Number of messages fetched: {len(messages)}") 68 | 69 | for message in messages: 70 | msg = await asyncio.to_thread( 71 | lambda: self._service.users() 72 | .messages() 73 | .get(userId="me", id=message["id"], format="full") 74 | .execute() 75 | ) 76 | msg["id"] = message["id"] 77 | full_messages.append(msg) 78 | 79 | logger.info("All messages fetched successfully.") 80 | return full_messages 81 | 82 | async def _summarize_messages_with_gpt(self, messages: List[dict]) -> str: 83 | """ 84 | Summarize the given messages using GPT model. 85 | """ 86 | full_texts = [] 87 | for msg in messages: 88 | email_data = self._extract_email_data(msg) 89 | full_texts.append(self._format_email_text(email_data)) 90 | 91 | prompt = ( 92 | "Please provide a summary of the following emails. " 93 | "For each email, include the email ID, subject, sender, date, " 94 | "and a brief summary of the content without too many details.\n\n" 95 | ) 96 | summary = await get_model_completion( 97 | prompt + "\n\n".join(full_texts), 98 | ModelName.FAST_MODEL, 99 | ) 100 | return summary 101 | 102 | def _extract_email_data(self, msg: dict) -> dict: 103 | """ 104 | Extract relevant data from an email message. 105 | """ 106 | payload = msg["payload"] 107 | headers = payload.get("headers", []) 108 | return { 109 | "id": msg.get("id", "Unknown ID"), 110 | "subject": next( 111 | (h["value"] for h in headers if h["name"] == "Subject"), "No Subject" 112 | ), 113 | "from": next( 114 | (h["value"] for h in headers if h["name"] == "From"), "Unknown Sender" 115 | ), 116 | "date": next( 117 | (h["value"] for h in headers if h["name"] == "Date"), "Unknown Date" 118 | ), 119 | "body": self._extract_body(payload), 120 | } 121 | 122 | def _format_email_text(self, email_data: dict) -> str: 123 | """ 124 | Format email data into a string representation. 125 | """ 126 | return ( 127 | f"Email ID: {email_data['id']}\n" 128 | f"From: {email_data['from']}\n" 129 | f"Date: {email_data['date']}\n" 130 | f"Subject: {email_data['subject']}\n" 131 | f"Body: {email_data['body']}\n" 132 | ) 133 | 134 | def _extract_body(self, payload: dict) -> str: 135 | """ 136 | Extract the body from an email payload, handling various MIME types and nested parts. 137 | """ 138 | if "parts" in payload: 139 | body = self._recursive_extract(payload["parts"]) 140 | if body: 141 | return body 142 | 143 | # Fallback to the main body if no parts are found 144 | data = payload.get("body", {}).get("data", "") 145 | if data: 146 | try: 147 | decoded_body = base64.urlsafe_b64decode(data).decode("utf-8") 148 | return self._remove_links(decoded_body) 149 | except Exception as e: 150 | logger.error(f"Error decoding main body: {e}") 151 | return "No body content" 152 | 153 | def _recursive_extract(self, parts: List[dict]) -> str: 154 | """ 155 | Recursively extract the body from email parts. 156 | """ 157 | for part in parts: 158 | mime_type = part.get("mimeType", "") 159 | body = part.get("body", {}) 160 | data = body.get("data", "") 161 | 162 | if data and mime_type in ["text/plain", "text/html"]: 163 | try: 164 | decoded_body = base64.urlsafe_b64decode(data).decode("utf-8") 165 | return self._remove_links(decoded_body) 166 | except Exception as e: 167 | logger.error(f"Error decoding {mime_type} part: {e}") 168 | elif "parts" in part: 169 | result = self._recursive_extract(part["parts"]) 170 | if result: 171 | return result 172 | return "" 173 | 174 | def _remove_links(self, text: str) -> str: 175 | """ 176 | Remove URLs from the given text. 177 | """ 178 | url_pattern = re.compile(r"http\S+|www\.\S+") 179 | cleaned_text = url_pattern.sub("", text) 180 | logger.debug("Removed links from the email body.") 181 | return cleaned_text 182 | 183 | 184 | if __name__ == "__main__": 185 | 186 | async def main(): 187 | tool = GetGmailSummary(max_results=5) 188 | result = await tool.run() 189 | print(result) 190 | 191 | asyncio.run(main()) 192 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/GetResponse.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import logging 3 | from typing import Any, Optional 4 | 5 | from agency_swarm import Agency, get_openai_client 6 | from agency_swarm.threads import Thread 7 | from agency_swarm.tools import BaseTool 8 | from openai import OpenAI 9 | from pydantic import Field, PrivateAttr, field_validator 10 | 11 | from voice_assistant.agencies import AGENCIES, AGENCIES_AND_AGENTS_STRING 12 | from voice_assistant.utils.decorators import timeit_decorator 13 | 14 | logger = logging.getLogger(__name__) 15 | 16 | 17 | class GetResponse(BaseTool): 18 | """ 19 | Checks the status of a task or retrieves the response from a specific agent within a specified agency. 20 | 21 | Use this tool after initiating a long-running task with 'SendMessageAsync'. 22 | Use the same parameters you used with 'SendMessageAsync' to check if the task is completed. 23 | If the task is completed, this tool will return the agent's response. 24 | If the task is still in progress, it will inform you accordingly. 25 | 26 | Available Agencies and Agents: 27 | {agency_agents} 28 | """ 29 | 30 | agency_name: str = Field(..., description="The name of the agency.") 31 | agent_name: Optional[str] = Field( 32 | None, description="The name of the agent, or None to use the default agent." 33 | ) 34 | _client: OpenAI = PrivateAttr() 35 | 36 | def __init__(self, **kwargs): 37 | super().__init__(**kwargs) 38 | self._client = get_openai_client() 39 | 40 | @field_validator("agency_name", mode="before") 41 | def validate_agency_name(cls, value: str) -> str: 42 | if value not in AGENCIES: 43 | available = ", ".join(AGENCIES.keys()) 44 | raise ValueError( 45 | f"Agency '{value}' not found. Available agencies: {available}" 46 | ) 47 | return value 48 | 49 | @field_validator("agent_name", mode="before") 50 | def validate_agent_name(cls, value: Optional[str]) -> Optional[str]: 51 | if value: 52 | agent_names = [ 53 | agent.name for agency in AGENCIES.values() for agent in agency.agents 54 | ] 55 | if value not in agent_names: 56 | available = ", ".join(agent_names) 57 | raise ValueError( 58 | f"Agent '{value}' not found. Available agents: {available}" 59 | ) 60 | return value 61 | 62 | @timeit_decorator 63 | async def run(self) -> str: 64 | """ 65 | Executes the GetResponse tool to check task status or retrieve agent response. 66 | 67 | Returns: 68 | str: The result message based on the task status. 69 | """ 70 | agency: Agency = AGENCIES.get(self.agency_name) 71 | 72 | # Determine the thread based on agent_name 73 | if not self.agent_name or self.agent_name == agency.ceo.name: 74 | thread = agency.main_thread 75 | else: 76 | thread = agency.agents_and_threads.get(agency.ceo.name, {}).get( 77 | self.agent_name 78 | ) 79 | 80 | if not thread: 81 | return f"Error: No thread found between '{agency.ceo.name}' and '{self.agent_name}'" 82 | if not thread.thread or not thread.id: 83 | return f"Error: Thread between '{agency.ceo.name}' and '{self.agent_name}' is not initialized" 84 | 85 | run = await asyncio.to_thread(self._get_last_run, thread) 86 | 87 | if not run: 88 | return ( 89 | "System Notification: 'Agent is ready to receive a message. " 90 | "Please send a message with the 'SendMessageAsync' tool.'" 91 | ) 92 | 93 | if run.status in ["queued", "in_progress", "requires_action"]: 94 | return ( 95 | "System Notification: 'Task is not completed yet. Please tell the user to wait " 96 | "and try again later.'" 97 | ) 98 | 99 | if run.status == "failed": 100 | return ( 101 | f"System Notification: 'Agent run failed with error: {run.last_error.message}. " 102 | "You may send another message with the 'SendMessageAsync' tool.'" 103 | ) 104 | 105 | messages = await asyncio.to_thread( 106 | self._client.beta.threads.messages.list, thread_id=thread.id, order="desc" 107 | ) 108 | 109 | if messages.data and messages.data[0].content: 110 | response_text = messages.data[0].content[0].text.value 111 | return f"{self.agent_name}'s Response: '{response_text}'" 112 | else: 113 | return "System Notification: 'No response found from the agent.'" 114 | 115 | def _get_last_run(self, thread: Thread) -> Optional[Any]: 116 | runs = self._client.beta.threads.runs.list( 117 | thread_id=thread.id, 118 | order="desc", 119 | ) 120 | return runs.data[0] if runs.data else None 121 | 122 | 123 | # Dynamically update the class docstring with the list of agencies and their agents 124 | GetResponse.__doc__ = GetResponse.__doc__.format( 125 | agency_agents=AGENCIES_AND_AGENTS_STRING 126 | ) 127 | 128 | 129 | if __name__ == "__main__": 130 | 131 | async def main(): 132 | # Example usage for a specific thread 133 | tool = GetResponse( 134 | agency_name="ResearchAgency", 135 | agent_name="BrowsingAgent", 136 | ) 137 | response = await tool.run() 138 | print(response) 139 | 140 | asyncio.run(main()) 141 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/GetScreenDescription.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import base64 3 | import io 4 | import os 5 | import tempfile 6 | 7 | import aiohttp 8 | from agency_swarm.tools import BaseTool 9 | from dotenv import load_dotenv 10 | from PIL import Image 11 | from pydantic import Field 12 | 13 | from voice_assistant.models import ModelName 14 | from voice_assistant.utils.decorators import timeit_decorator 15 | 16 | load_dotenv() 17 | 18 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") 19 | 20 | 21 | class GetScreenDescription(BaseTool): 22 | """Get a text description of the user's active window.""" 23 | 24 | prompt: str = Field(..., description="Prompt to analyze the screenshot") 25 | 26 | async def run(self) -> str: 27 | """Execute the screen description tool.""" 28 | screenshot_path = await self.take_screenshot() 29 | 30 | try: 31 | file_content = await asyncio.to_thread(self._read_file, screenshot_path) 32 | resized_content = await asyncio.to_thread(self._resize_image, file_content) 33 | encoded_image = base64.b64encode(resized_content).decode("utf-8") 34 | analysis = await self.analyze_image(encoded_image) 35 | finally: 36 | asyncio.create_task(asyncio.to_thread(os.remove, screenshot_path)) 37 | 38 | return analysis 39 | 40 | @timeit_decorator 41 | async def take_screenshot(self) -> str: 42 | """Capture a screenshot of the active window.""" 43 | with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp_file: 44 | screenshot_path = tmp_file.name 45 | 46 | bounds = await self._get_active_window_bounds() 47 | if not bounds: 48 | raise RuntimeError("Unable to retrieve the active window bounds.") 49 | 50 | x, y, width, height = bounds 51 | 52 | process = await asyncio.create_subprocess_exec( 53 | "screencapture", 54 | "-R", 55 | f"{x},{y},{width},{height}", 56 | screenshot_path, 57 | stdout=asyncio.subprocess.PIPE, 58 | stderr=asyncio.subprocess.PIPE, 59 | ) 60 | 61 | stdout, stderr = await process.communicate() 62 | 63 | if process.returncode != 0: 64 | raise RuntimeError(f"screencapture failed: {stderr.decode().strip()}") 65 | 66 | if not os.path.exists(screenshot_path): 67 | raise FileNotFoundError(f"Screenshot was not created at {screenshot_path}") 68 | 69 | return screenshot_path 70 | 71 | async def _get_active_window_bounds(self) -> tuple: 72 | """Retrieve the bounds of the active window.""" 73 | script = """ 74 | tell application "System Events" 75 | set frontApp to first application process whose frontmost is true 76 | tell frontApp 77 | try 78 | set win to front window 79 | set {x, y} to position of win 80 | set {w, h} to size of win 81 | return {x, y, w, h} 82 | on error 83 | return {} 84 | end try 85 | end tell 86 | end tell 87 | """ 88 | process = await asyncio.create_subprocess_exec( 89 | "osascript", 90 | "-e", 91 | script, 92 | stdout=asyncio.subprocess.PIPE, 93 | stderr=asyncio.subprocess.PIPE, 94 | ) 95 | 96 | stdout, stderr = await process.communicate() 97 | 98 | if process.returncode != 0: 99 | return None 100 | 101 | output = stdout.decode().strip() 102 | if not output: 103 | return None 104 | 105 | try: 106 | bounds = eval(output) 107 | return bounds if isinstance(bounds, tuple) and len(bounds) == 4 else None 108 | except Exception as e: 109 | print(f"Error parsing bounds: {e}") 110 | return None 111 | 112 | @timeit_decorator 113 | async def analyze_image(self, base64_image: str) -> str: 114 | """Send the encoded image and prompt to the OpenAI API for analysis.""" 115 | headers = { 116 | "Content-Type": "application/json", 117 | "Authorization": f"Bearer {OPENAI_API_KEY}", 118 | } 119 | 120 | payload = { 121 | "model": ModelName.FAST_MODEL, 122 | "messages": [ 123 | { 124 | "role": "system", 125 | "content": "You are an expert at analyzing screenshots and describing their content. Your output should be a concise and informative description of the screenshot, focusing on the aspects mentioned in the user's prompt. Pay close attention to the specific questions or elements the user is asking about.", 126 | }, 127 | { 128 | "role": "user", 129 | "content": [ 130 | { 131 | "type": "text", 132 | "text": f"Analyze this screenshot, paying particular attention to the following prompt: {self.prompt}", 133 | }, 134 | { 135 | "type": "image_url", 136 | "image_url": { 137 | "url": f"data:image/png;base64,{base64_image}" 138 | }, 139 | }, 140 | ], 141 | }, 142 | ], 143 | "max_tokens": 500, 144 | } 145 | 146 | async with aiohttp.ClientSession() as session: 147 | async with session.post( 148 | "https://api.openai.com/v1/chat/completions", 149 | headers=headers, 150 | json=payload, 151 | ) as response: 152 | if response.status != 200: 153 | error = await response.text() 154 | raise RuntimeError(f"OpenAI API error: {error}") 155 | result = await response.json() 156 | return result["choices"][0]["message"]["content"] 157 | 158 | def _read_file(self, path: str) -> bytes: 159 | """Read and return the content of a file.""" 160 | with open(path, "rb") as image_file: 161 | return image_file.read() 162 | 163 | def _resize_image(self, image_data: bytes) -> bytes: 164 | """Resize the image to reduce payload size while preserving aspect ratio.""" 165 | with Image.open(io.BytesIO(image_data)) as img: 166 | img.thumbnail((1600, 1200), Image.ANTIALIAS) 167 | with io.BytesIO() as output: 168 | img.save(output, format="PNG") 169 | return output.getvalue() 170 | 171 | 172 | if __name__ == "__main__": 173 | 174 | async def test_tool(): 175 | tool = GetScreenDescription( 176 | prompt="What do you see in this screenshot? Describe the main elements." 177 | ) 178 | try: 179 | result = await tool.run() 180 | print(result) 181 | except Exception as e: 182 | print(f"Error during test: {e}") 183 | 184 | asyncio.run(test_tool()) 185 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/OpenBrowser.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import json 3 | import logging 4 | import os 5 | import webbrowser 6 | from concurrent.futures import ThreadPoolExecutor 7 | from enum import Enum 8 | 9 | from agency_swarm.tools import BaseTool 10 | from pydantic import Field 11 | 12 | from voice_assistant.models import WebUrl 13 | from voice_assistant.utils.decorators import timeit_decorator 14 | 15 | logger = logging.getLogger(__name__) 16 | 17 | with open(os.getenv("PERSONALIZATION_FILE")) as f: 18 | personalization = json.load(f) 19 | browser = personalization["browser"] 20 | 21 | 22 | class OpenBrowser(BaseTool): 23 | """Open a browser with a specified URL.""" 24 | 25 | chain_of_thought: str = Field( 26 | ..., description="Step-by-step thought process to determine the URL to open." 27 | ) 28 | url: str = Field(..., description="The URL to open") 29 | 30 | @timeit_decorator 31 | async def run(self): 32 | if self.url: 33 | logger.info(f"📖 open_browser() Opening URL: {self.url}") 34 | loop = asyncio.get_running_loop() 35 | with ThreadPoolExecutor() as pool: 36 | await loop.run_in_executor(pool, webbrowser.get(browser).open, self.url) 37 | return {"status": "Browser opened", "url": self.url} 38 | return {"status": "No URL found"} 39 | 40 | 41 | if __name__ == "__main__": 42 | tool = OpenBrowser( 43 | chain_of_thought="I want to open my favorite website", 44 | url="https://www.linkedin.com", 45 | ) 46 | asyncio.run(tool.run()) 47 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/SendMessage.py: -------------------------------------------------------------------------------- 1 | """ 2 | This tool allows you to send a message to a specific agent within a specified agency and receive a response. 3 | 4 | To use this tool, provide the message you want to send, the name of the agency to which the agent belongs, and optionally the name of the agent to whom the message should be sent. If the agent name is not specified, the message will be sent to the default agent for that agency. 5 | """ 6 | 7 | import asyncio 8 | 9 | from agency_swarm.tools import BaseTool 10 | from pydantic import Field 11 | 12 | from voice_assistant.agencies import AGENCIES, AGENCIES_AND_AGENTS_STRING 13 | from voice_assistant.utils.decorators import timeit_decorator 14 | 15 | 16 | class SendMessage(BaseTool): 17 | """ 18 | Sends a message to a specific agent within a specified agency and waits for an immediate response. 19 | 20 | Use this tool for direct, synchronous communication with agents for tasks that can be completed quickly. 21 | The agent processes the message and returns a response immediately. 22 | If 'agent_name' is not provided, the message is sent to the main agent in the agency. 23 | 24 | To continue the dialogue, invoke this tool again with your follow-up message. 25 | Note: You are responsible for relaying the agent's responses back to the user. 26 | Do not send more than one message at a time. 27 | 28 | Available Agencies and Agents: 29 | {agency_agents} 30 | """ 31 | 32 | message: str = Field(..., description="The message to be sent.") 33 | agency_name: str = Field( 34 | ..., description="The name of the agency to send the message to." 35 | ) 36 | agent_name: str | None = Field( 37 | None, 38 | description="The name of the agent to send the message to, or None to use the default agent.", 39 | ) 40 | 41 | def __init__(self, **data): 42 | super().__init__(**data) 43 | 44 | @timeit_decorator 45 | async def run(self) -> str: 46 | result = await self._send_message() 47 | return str(result) 48 | 49 | async def _send_message(self) -> str: 50 | agency = AGENCIES.get(self.agency_name) 51 | if agency: 52 | recipient_agent = None 53 | if self.agent_name: 54 | recipient_agent = next( 55 | (agent for agent in agency.agents if agent.name == self.agent_name), 56 | None, 57 | ) 58 | if not recipient_agent: 59 | return f"Agent '{self.agent_name}' not found in agency '{self.agency_name}'. Available agents: {', '.join(agent.name for agent in agency.agents)}" 60 | else: 61 | recipient_agent = None 62 | 63 | response = await asyncio.to_thread( 64 | agency.get_completion, 65 | message=self.message, 66 | recipient_agent=recipient_agent, 67 | ) 68 | return response 69 | else: 70 | return f"Agency '{self.agency_name}' not found" 71 | 72 | 73 | # Dynamically update the class docstring with the list of agencies and their agents 74 | SendMessage.__doc__ = SendMessage.__doc__.format( 75 | agency_agents=AGENCIES_AND_AGENTS_STRING 76 | ) 77 | 78 | 79 | if __name__ == "__main__": 80 | tool = SendMessage( 81 | message="Hello, how are you?", 82 | agency_name="ResearchAgency", 83 | agent_name="BrowsingAgent", 84 | ) 85 | print(asyncio.run(tool.run())) 86 | 87 | tool = SendMessage( 88 | message="Hello, how are you?", 89 | agency_name="ResearchAgency", 90 | agent_name=None, 91 | ) 92 | print(asyncio.run(tool.run())) 93 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/SendMessageAsync.py: -------------------------------------------------------------------------------- 1 | """ 2 | This tool allows you to send a message to a specific agent within a specified agency without waiting for a response. 3 | 4 | To use this tool, provide the message you want to send, the name of the agency to which the agent belongs, and optionally the name of the agent to whom the message should be sent. If the agent name is not specified, the message will be sent to the default agent for that agency. 5 | """ 6 | 7 | import asyncio 8 | import logging 9 | 10 | from agency_swarm.agency import Agency 11 | from agency_swarm.threads import Thread 12 | from agency_swarm.threads.thread_async import ThreadAsync 13 | from agency_swarm.tools import BaseTool 14 | from pydantic import Field 15 | 16 | from voice_assistant.agencies import AGENCIES, AGENCIES_AND_AGENTS_STRING 17 | from voice_assistant.utils.decorators import timeit_decorator 18 | 19 | logger = logging.getLogger(__name__) 20 | 21 | 22 | class SendMessageAsync(BaseTool): 23 | """ 24 | Sends a message to a specific agent within a specified agency without waiting for an immediate response. 25 | 26 | Use this tool to initiate long-running tasks asynchronously. 27 | After sending the message, you can use the 'GetResponse' tool with the same 'agency_name' and 'agent_name' values to check the status or retrieve the agent's response. 28 | This allows you to perform other tasks or interact with the user while the agent processes the request. 29 | 30 | Available Agencies and Agents: 31 | {agency_agents} 32 | """ 33 | 34 | message: str = Field(..., description="The message to be sent.") 35 | agency_name: str = Field( 36 | ..., description="The name of the agency to send the message to." 37 | ) 38 | agent_name: str | None = Field( 39 | None, 40 | description="The name of the agent to send the message to, or None to use the default agent.", 41 | ) 42 | 43 | @timeit_decorator 44 | async def run(self) -> str: 45 | result = await self.send_message() 46 | return str(result) 47 | 48 | async def send_message(self) -> str: 49 | agency: Agency | None = AGENCIES.get(self.agency_name) 50 | if not agency: 51 | return f"Agency '{self.agency_name}' not found" 52 | 53 | if not self.agent_name or self.agent_name == agency.ceo.name: 54 | thread: Thread = agency.main_thread 55 | else: 56 | recipient_agent = next( 57 | (agent for agent in agency.agents if agent.name == self.agent_name), 58 | None, 59 | ) 60 | if not recipient_agent: 61 | return f"Agent '{self.agent_name}' not found in agency '{self.agency_name}'. Available agents: {', '.join(agent.name for agent in agency.agents)}" 62 | 63 | thread: Thread = agency.agents_and_threads.get(agency.ceo.name, {}).get( 64 | self.agent_name 65 | ) 66 | 67 | if isinstance(thread, ThreadAsync): 68 | return await asyncio.to_thread( 69 | thread.get_completion_async, 70 | message=self.message, 71 | recipient_agent=recipient_agent, 72 | ) 73 | else: 74 | await asyncio.to_thread( 75 | thread.get_completion, 76 | message=self.message, 77 | recipient_agent=recipient_agent, 78 | ) 79 | return f"Message sent asynchronously. Use 'GetResponse' to check status." 80 | 81 | 82 | # Dynamically update the class docstring with the list of agencies and their agents 83 | SendMessageAsync.__doc__ = SendMessageAsync.__doc__.format( 84 | agency_agents=AGENCIES_AND_AGENTS_STRING 85 | ) 86 | 87 | 88 | if __name__ == "__main__": 89 | tool = SendMessageAsync( 90 | message="Write a long paragraph about the history of the internet.", 91 | agency_name="ResearchAgency", 92 | agent_name="BrowsingAgent", 93 | ) 94 | print(asyncio.run(tool.run())) 95 | 96 | tool = SendMessageAsync( 97 | message="Write a long paragraph about the history of the internet.", 98 | agency_name="ResearchAgency", 99 | agent_name=None, 100 | ) 101 | print(asyncio.run(tool.run())) 102 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/UpdateFile.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | 4 | from agency_swarm.tools import BaseTool 5 | from dotenv import load_dotenv 6 | from pydantic import Field 7 | 8 | from voice_assistant.config import SCRATCH_PAD_DIR 9 | from voice_assistant.models import FileSelectionResponse, ModelName 10 | from voice_assistant.utils.decorators import timeit_decorator 11 | from voice_assistant.utils.llm_utils import ( 12 | get_structured_output_completion, 13 | parse_chat_completion, 14 | ) 15 | 16 | load_dotenv() 17 | 18 | 19 | class UpdateFile(BaseTool): 20 | """A tool for updating the content of a file based on a prompt.""" 21 | 22 | prompt: str = Field( 23 | ..., 24 | description="The prompt to identify which file to update and how to update it.", 25 | ) 26 | 27 | async def run(self): 28 | result = await update_file(self.prompt) 29 | if "model_used" in result and isinstance(result["model_used"], ModelName): 30 | result["model_used"] = result["model_used"].value 31 | return str(result) 32 | 33 | 34 | @timeit_decorator 35 | async def update_file(prompt: str) -> dict: 36 | available_files = os.listdir(SCRATCH_PAD_DIR) 37 | available_model_map = {model.value: model.name for model in ModelName} 38 | 39 | file_selection_response = await get_structured_output_completion( 40 | create_file_selection_prompt( 41 | available_files, json.dumps(available_model_map), prompt 42 | ), 43 | FileSelectionResponse, 44 | ) 45 | 46 | if not file_selection_response.file: 47 | return {"status": "No matching file found"} 48 | 49 | selected_file = file_selection_response.file 50 | selected_model = file_selection_response.model or ModelName.BASE_MODEL 51 | file_path = os.path.join(SCRATCH_PAD_DIR, selected_file) 52 | 53 | with open(file_path, "r") as f: 54 | file_content = f.read() 55 | 56 | updated_content = await parse_chat_completion( 57 | create_file_update_prompt(selected_file, file_content, prompt), 58 | selected_model, 59 | ) 60 | 61 | with open(file_path, "w") as f: 62 | f.write(updated_content) 63 | 64 | return { 65 | "status": "File updated", 66 | "file_name": selected_file, 67 | "model_used": selected_model, 68 | } 69 | 70 | 71 | def create_file_selection_prompt(available_files, available_model_map, user_prompt): 72 | return f""" 73 | 74 | Select a file from the available files and choose the appropriate model based on the user's prompt. 75 | 76 | 77 | 78 | Based on the user's prompt and the list of available files, infer which file the user wants to update. 79 | Also, select the most appropriate model from the available models mapping. 80 | If the user does not specify a model, default to 'base_model'. 81 | If no file matches, return an empty string for 'file'. 82 | 83 | 84 | 85 | {", ".join(available_files)} 86 | 87 | 88 | 89 | {available_model_map} 90 | 91 | 92 | 93 | {user_prompt} 94 | 95 | """ 96 | 97 | 98 | def create_file_update_prompt(file_name, file_content, user_prompt): 99 | return f""" 100 | 101 | Update the content of the file based on the user's prompt. 102 | 103 | 104 | 105 | Based on the user's prompt and the file content, generate the updated content for the file. 106 | The file-name is the name of the file to update. 107 | The user's prompt describes the updates to make. 108 | Respond exclusively with the updates to the file and nothing else; they will be used to overwrite the file entirely using f.write(). 109 | Do not include any preamble or commentary or markdown formatting, just the raw updates. 110 | Be precise and accurate. 111 | 112 | 113 | 114 | {file_name} 115 | 116 | 117 | 118 | {file_content} 119 | 120 | 121 | 122 | {user_prompt} 123 | 124 | """ 125 | 126 | 127 | if __name__ == "__main__": 128 | import asyncio 129 | 130 | tool = UpdateFile(prompt="Update the test file to include a paragraph about AI") 131 | print(asyncio.run(tool.run())) 132 | -------------------------------------------------------------------------------- /src/voice_assistant/tools/__init__.py: -------------------------------------------------------------------------------- 1 | import importlib 2 | import logging 3 | import os 4 | 5 | from agency_swarm.tools import BaseTool 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | 10 | def load_tools(): 11 | tools = [] 12 | current_dir = os.path.dirname(os.path.abspath(__file__)) 13 | for filename in os.listdir(current_dir): 14 | if filename.endswith(".py") and filename != "__init__.py": 15 | module_name = filename[:-3] 16 | module = importlib.import_module(f"voice_assistant.tools.{module_name}") 17 | for name, obj in module.__dict__.items(): 18 | if ( 19 | isinstance(obj, type) 20 | and issubclass(obj, BaseTool) 21 | and obj != BaseTool 22 | ): 23 | tools.append(obj) 24 | return tools 25 | 26 | 27 | def prepare_tool_schemas(): 28 | """Prepare the schemas for the tools.""" 29 | tool_schemas = [] 30 | for tool in TOOLS: 31 | tool_schema = {k: v for k, v in tool.openai_schema.items() if k != "strict"} 32 | tool_type = "function" if not hasattr(tool, "type") else tool.type 33 | tool_schemas.append({**tool_schema, "type": tool_type}) 34 | 35 | logger.debug("Tool Schemas:\n%s", tool_schemas) 36 | return tool_schemas 37 | 38 | 39 | # Load all tools 40 | TOOLS: list[BaseTool] = load_tools() 41 | TOOL_SCHEMAS = prepare_tool_schemas() 42 | -------------------------------------------------------------------------------- /src/voice_assistant/utils/__init__.py: -------------------------------------------------------------------------------- 1 | import base64 2 | 3 | 4 | def base64_encode_audio(audio_bytes: bytes) -> str: 5 | return base64.b64encode(audio_bytes).decode("utf-8") 6 | -------------------------------------------------------------------------------- /src/voice_assistant/utils/decorators.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import functools 3 | import time 4 | 5 | from voice_assistant.utils.log_utils import log_runtime 6 | 7 | 8 | def timeit_decorator(func): 9 | @functools.wraps(func) 10 | async def async_wrapper(*args, **kwargs): 11 | start_time = time.perf_counter() 12 | result = await func(*args, **kwargs) 13 | duration = round(time.perf_counter() - start_time, 4) 14 | if args and hasattr(args[0], "__class__"): 15 | class_name = args[0].__class__.__name__ 16 | log_runtime(f"{class_name}.{func.__name__}", duration) 17 | else: 18 | log_runtime(func.__name__, duration) 19 | return result 20 | 21 | @functools.wraps(func) 22 | def sync_wrapper(*args, **kwargs): 23 | start_time = time.perf_counter() 24 | result = func(*args, **kwargs) 25 | duration = round(time.perf_counter() - start_time, 4) 26 | if args and hasattr(args[0], "__class__"): 27 | class_name = args[0].__class__.__name__ 28 | log_runtime(f"{class_name}.{func.__name__}", duration) 29 | else: 30 | log_runtime(func.__name__, duration) 31 | return result 32 | 33 | return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper 34 | -------------------------------------------------------------------------------- /src/voice_assistant/utils/google_services_utils.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import logging 3 | import os 4 | 5 | from dotenv import load_dotenv 6 | from google.auth.transport.requests import Request 7 | from google.oauth2.credentials import Credentials 8 | from google_auth_oauthlib.flow import InstalledAppFlow 9 | from googleapiclient.discovery import build 10 | 11 | load_dotenv() 12 | 13 | logger = logging.getLogger(__name__) 14 | 15 | 16 | class GoogleServicesUtils: 17 | """ 18 | Utility class for Gmail and Google Calendar authentication and service creation. 19 | """ 20 | 21 | SCOPES = [ 22 | "https://www.googleapis.com/auth/gmail.readonly", 23 | "https://www.googleapis.com/auth/gmail.compose", 24 | "https://www.googleapis.com/auth/calendar.readonly", 25 | ] 26 | 27 | SERVICE_API_VERSIONS = {"gmail": "v1", "calendar": "v3"} 28 | 29 | @staticmethod 30 | async def authenticate_service(service_name): 31 | """ 32 | Authenticates the user and returns a Gmail or Google Calendar service object. 33 | """ 34 | 35 | def authenticate(): 36 | creds = None 37 | token_path = "token.json" 38 | credentials_path = "credentials.json" 39 | 40 | if os.path.exists(token_path): 41 | creds = Credentials.from_authorized_user_file( 42 | token_path, GoogleServicesUtils.SCOPES 43 | ) 44 | logger.info(f"Loaded {service_name} credentials from token.json.") 45 | 46 | if not creds or not creds.valid: 47 | if creds and creds.expired and creds.refresh_token: 48 | logger.info(f"Refreshing expired {service_name} credentials.") 49 | creds.refresh(Request()) 50 | else: 51 | logger.info(f"Initiating new {service_name} authentication flow.") 52 | flow = InstalledAppFlow.from_client_secrets_file( 53 | credentials_path, GoogleServicesUtils.SCOPES 54 | ) 55 | creds = flow.run_local_server(port=8080) # Fixed port 56 | with open(token_path, "w") as token: 57 | token.write(creds.to_json()) 58 | logger.info(f"Saved new {service_name} credentials to token.json.") 59 | 60 | api_version = GoogleServicesUtils.SERVICE_API_VERSIONS.get(service_name) 61 | if api_version is None: 62 | raise ValueError(f"Unsupported service: {service_name}") 63 | 64 | return build(service_name, api_version, credentials=creds) 65 | 66 | try: 67 | service = await asyncio.to_thread(authenticate) 68 | logger.info( 69 | f"{service_name.capitalize()} service authenticated successfully." 70 | ) 71 | return service 72 | except Exception as e: 73 | logger.error(f"Failed to authenticate {service_name} service: {e}") 74 | raise e 75 | 76 | @staticmethod 77 | async def authenticate_gmail(): 78 | """ 79 | Authenticates the user and returns a Gmail service object. 80 | """ 81 | return await GoogleServicesUtils.authenticate_service("gmail") 82 | 83 | @staticmethod 84 | async def authenticate_calendar(): 85 | """ 86 | Authenticates the user and returns a Google Calendar service object. 87 | """ 88 | return await GoogleServicesUtils.authenticate_service("calendar") 89 | -------------------------------------------------------------------------------- /src/voice_assistant/utils/llm_utils.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import os 3 | 4 | import aiohttp 5 | import openai 6 | from pydantic import BaseModel 7 | 8 | from voice_assistant.models import ModelName 9 | 10 | API_KEY = os.getenv("OPENAI_API_KEY") 11 | OPENAI_CLIENT = openai.OpenAI(api_key=API_KEY) 12 | 13 | 14 | async def get_model_completion(prompt: str, model: ModelName) -> str: 15 | headers = { 16 | "Content-Type": "application/json", 17 | "Authorization": f"Bearer {API_KEY}", 18 | } 19 | 20 | payload = { 21 | "model": model.value, 22 | "messages": [ 23 | { 24 | "role": "user", 25 | "content": prompt, 26 | } 27 | ], 28 | } 29 | 30 | async with aiohttp.ClientSession() as session: 31 | async with session.post( 32 | "https://api.openai.com/v1/chat/completions", 33 | headers=headers, 34 | json=payload, 35 | ) as response: 36 | if response.status != 200: 37 | error = await response.text() 38 | raise RuntimeError(f"OpenAI API error: {error}") 39 | result = await response.json() 40 | return result["choices"][0]["message"]["content"] 41 | 42 | 43 | async def get_structured_output_completion( 44 | prompt: str, response_format: BaseModel 45 | ) -> BaseModel: 46 | completion = await asyncio.to_thread( 47 | OPENAI_CLIENT.beta.chat.completions.parse, 48 | model=ModelName.BASE_MODEL.value, 49 | messages=[{"role": "user", "content": prompt}], 50 | response_format=response_format, 51 | ) 52 | message = completion.choices[0].message 53 | if not message.parsed: 54 | raise ValueError(message.refusal) 55 | return message.parsed 56 | 57 | 58 | async def parse_chat_completion(prompt: str, model: ModelName) -> str: 59 | completion = await asyncio.to_thread( 60 | OPENAI_CLIENT.beta.chat.completions.parse, 61 | model=model.value, 62 | messages=[{"role": "user", "content": prompt}], 63 | ) 64 | return completion.choices[0].message.content 65 | -------------------------------------------------------------------------------- /src/voice_assistant/utils/log_utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | from datetime import datetime 4 | 5 | from voice_assistant.config import RUN_TIME_TABLE_LOG_JSON 6 | 7 | logger = logging.getLogger(__name__) 8 | 9 | 10 | def log_runtime(function_or_name: str, duration: float): 11 | time_record = { 12 | "timestamp": datetime.now().isoformat(), 13 | "function": function_or_name, 14 | "duration": f"{duration:.4f}", 15 | } 16 | with open(RUN_TIME_TABLE_LOG_JSON, "a") as file: 17 | json.dump(time_record, file) 18 | file.write("\n") 19 | 20 | logger.info(f"⏰ {function_or_name}() took {duration:.4f} seconds") 21 | 22 | 23 | def log_ws_event(direction: str, event: dict): 24 | event_type = event.get("type", "Unknown") 25 | event_emojis = { 26 | "session.update": "🛠️", 27 | "session.created": "🔌", 28 | "session.updated": "🔄", 29 | "input_audio_buffer.append": "🎤", 30 | "input_audio_buffer.commit": "✅", 31 | "input_audio_buffer.speech_started": "🗣️", 32 | "input_audio_buffer.speech_stopped": "🤫", 33 | "input_audio_buffer.cleared": "🧹", 34 | "input_audio_buffer.committed": "📨", 35 | "conversation.item.create": "📥", 36 | "conversation.item.delete": "🗑️", 37 | "conversation.item.truncate": "✂️", 38 | "conversation.item.created": "📤", 39 | "conversation.item.deleted": "🗑️", 40 | "conversation.item.truncated": "✂️", 41 | "response.create": "➡️", 42 | "response.created": "📝", 43 | "response.output_item.added": "➕", 44 | "response.output_item.done": "✅", 45 | "response.text.delta": "✍️", 46 | "response.text.done": "📝", 47 | "response.audio.delta": "🔊", 48 | "response.audio.done": "🔇", 49 | "response.done": "✔️", 50 | "response.cancel": "⛔", 51 | "response.function_call_arguments.delta": "📥", 52 | "response.function_call_arguments.done": "📥", 53 | "rate_limits.updated": "⏳", 54 | "error": "❌", 55 | "conversation.item.input_audio_transcription.completed": "📝", 56 | "conversation.item.input_audio_transcription.failed": "⚠️", 57 | } 58 | emoji = event_emojis.get(event_type, "❓") 59 | icon = "⬆️ - Out" if direction.lower() == "outgoing" else "⬇️ - In" 60 | logger.info(f"{emoji} {icon} {event_type}") 61 | -------------------------------------------------------------------------------- /src/voice_assistant/visual_interface.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import os 3 | from collections import deque 4 | 5 | import numpy as np 6 | import pygame 7 | 8 | 9 | class VisualInterface: 10 | def __init__(self, width=400, height=400): 11 | pygame.init() 12 | self.width = width 13 | self.height = height 14 | self.screen = pygame.display.set_mode((width, height)) 15 | pygame.display.set_caption("Assistant Voice Activity") 16 | 17 | # Set the app icon 18 | icon_path = os.path.join(os.path.dirname(__file__), "icon.png") 19 | icon = pygame.image.load(icon_path) 20 | pygame.display.set_icon(icon) 21 | 22 | self.clock = pygame.time.Clock() 23 | self.is_active = False 24 | self.is_assistant_speaking = False 25 | self.active_color = (50, 139, 246) # Sky Blue 26 | self.inactive_color = (100, 100, 100) # Gray 27 | self.current_color = self.inactive_color 28 | self.base_radius = 100 29 | self.current_radius = self.base_radius 30 | self.energy_queue = deque(maxlen=50) # Store last 50 energy values 31 | self.update_interval = 0.05 # Update every 50ms 32 | self.max_energy = 1.0 # Initial max energy value 33 | 34 | async def update(self): 35 | for event in pygame.event.get(): 36 | if event.type == pygame.QUIT: 37 | pygame.quit() 38 | return False 39 | 40 | self.screen.fill((0, 0, 0)) # Black background 41 | 42 | # Smooth transition for radius 43 | target_radius = self.base_radius 44 | if self.energy_queue: 45 | normalized_energy = np.mean(self.energy_queue) / ( 46 | self.max_energy or 1.0 47 | ) # Avoid division by zero 48 | target_radius += int(normalized_energy * self.base_radius) 49 | 50 | self.current_radius += (target_radius - self.current_radius) * 0.2 51 | self.current_radius = min( 52 | max(self.current_radius, self.base_radius), self.width // 2 53 | ) 54 | 55 | # Smooth transition for color 56 | target_color = ( 57 | self.active_color 58 | if self.is_active or self.is_assistant_speaking 59 | else self.inactive_color 60 | ) 61 | self.current_color = tuple( 62 | int(self.current_color[i] + (target_color[i] - self.current_color[i]) * 0.1) 63 | for i in range(3) 64 | ) 65 | 66 | pygame.draw.circle( 67 | self.screen, 68 | self.current_color, 69 | (self.width // 2, self.height // 2), 70 | int(self.current_radius), 71 | ) 72 | 73 | pygame.display.flip() 74 | self.clock.tick(60) 75 | await asyncio.sleep(self.update_interval) 76 | return True 77 | 78 | def set_active(self, is_active): 79 | self.is_active = is_active 80 | 81 | def set_assistant_speaking(self, is_speaking): 82 | self.is_assistant_speaking = is_speaking 83 | 84 | def update_energy(self, energy): 85 | if isinstance(energy, np.ndarray): 86 | energy = np.mean(np.abs(energy)) 87 | self.energy_queue.append(energy) 88 | 89 | # Update max_energy dynamically 90 | current_max = max(self.energy_queue) 91 | if current_max > self.max_energy: 92 | self.max_energy = current_max 93 | elif len(self.energy_queue) == self.energy_queue.maxlen: 94 | self.max_energy = max(self.energy_queue) 95 | 96 | def process_audio_data(self, audio_data: bytes): 97 | """Process and update audio energy for visualization.""" 98 | audio_frame = np.frombuffer(audio_data, dtype=np.int16) 99 | energy = np.abs(audio_frame).mean() 100 | self.update_energy(energy) 101 | 102 | 103 | async def run_visual_interface(interface): 104 | while True: 105 | if not await interface.update(): 106 | break 107 | -------------------------------------------------------------------------------- /src/voice_assistant/websocket_handler.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import json 3 | import logging 4 | import time 5 | 6 | import websockets 7 | 8 | from voice_assistant.audio import audio_player 9 | from voice_assistant.tools import TOOLS 10 | from voice_assistant.utils.log_utils import log_runtime, log_ws_event 11 | 12 | logger = logging.getLogger(__name__) 13 | 14 | 15 | async def process_ws_messages(websocket, mic, visual_interface): 16 | assistant_reply = "" 17 | function_call = None 18 | function_call_args = "" 19 | response_start_time = None 20 | 21 | while True: 22 | try: 23 | message = await websocket.recv() 24 | event = json.loads(message) 25 | log_ws_event("incoming", event) 26 | 27 | event_type = event.get("type") 28 | 29 | if event_type == "response.created": 30 | mic.start_receiving() 31 | visual_interface.set_active(True) 32 | elif event_type == "response.output_item.added": 33 | item = event.get("item", {}) 34 | if item.get("type") == "function_call": 35 | function_call = item 36 | function_call_args = "" 37 | elif event_type == "response.function_call_arguments.delta": 38 | function_call_args += event.get("delta", "") 39 | elif event_type == "response.function_call_arguments.done": 40 | if function_call: 41 | function_name = function_call.get("name") 42 | call_id = function_call.get("call_id") 43 | try: 44 | args = ( 45 | json.loads(function_call_args) if function_call_args else {} 46 | ) 47 | except json.JSONDecodeError: 48 | logger.error( 49 | f"Failed to parse function arguments: {function_call_args}" 50 | ) 51 | args = {} 52 | 53 | tool = next( 54 | ( 55 | t 56 | for t in TOOLS 57 | if t.__name__.lower() == function_name.lower() 58 | ), 59 | None, 60 | ) 61 | if tool: 62 | logger.info( 63 | f"🛠️ Calling function: {function_name} with args: {args}" 64 | ) 65 | try: 66 | tool_instance = tool(**args) 67 | result = await tool_instance.run() 68 | logger.info( 69 | f"🛠️ Function {function_name} call result: {result}" 70 | ) 71 | except Exception as e: 72 | logger.error( 73 | f"Error calling function {function_name}: {str(e)}" 74 | ) 75 | result = { 76 | "error": f"Function '{function_name}' failed: {str(e)}" 77 | } 78 | else: 79 | logger.warning(f"Function '{function_name}' not found in TOOLS") 80 | result = {"error": f"Function '{function_name}' not found."} 81 | 82 | function_call_output = { 83 | "type": "conversation.item.create", 84 | "item": { 85 | "type": "function_call_output", 86 | "call_id": call_id, 87 | "output": json.dumps(result), 88 | }, 89 | } 90 | log_ws_event("outgoing", function_call_output) 91 | await websocket.send(json.dumps(function_call_output)) 92 | await websocket.send(json.dumps({"type": "response.create"})) 93 | function_call = None 94 | function_call_args = "" 95 | elif event_type == "response.text.delta": 96 | assistant_reply += event.get("delta", "") 97 | print( 98 | f"Assistant: {event.get('delta', '')}", 99 | end="", 100 | flush=True, 101 | ) 102 | elif event_type == "response.audio.delta": 103 | audio_chunk = base64.b64decode(event["delta"]) 104 | await audio_player.play_audio_chunk(audio_chunk, visual_interface) 105 | elif event_type == "response.done": 106 | if response_start_time is not None: 107 | response_duration = time.perf_counter() - response_start_time 108 | log_runtime("realtime_api_response", response_duration) 109 | response_start_time = None 110 | 111 | logger.info("Assistant response complete.") 112 | await audio_player.stop_playback(visual_interface) 113 | assistant_reply = "" 114 | logger.info("Calling stop_receiving()") 115 | mic.stop_receiving() 116 | visual_interface.set_active(False) 117 | mic.start_recording() 118 | logger.info("Started recording for next user input") 119 | elif event_type == "rate_limits.updated": 120 | mic.start_recording() 121 | logger.info("Resumed recording after rate_limits.updated") 122 | elif event_type == "error": 123 | error_message = event.get("error", {}).get("message", "") 124 | if "buffer is empty" in error_message: 125 | logger.info("Received 'buffer is empty' error, no audio data sent.") 126 | continue 127 | elif "Conversation already has an active response" in error_message: 128 | logger.info( 129 | "Received 'active response' error, adjusting response flow." 130 | ) 131 | continue 132 | else: 133 | logger.error(f"Unhandled error: {error_message}") 134 | break 135 | elif event_type == "input_audio_buffer.speech_started": 136 | logger.info("Speech detected, listening...") 137 | visual_interface.set_active(True) 138 | elif event_type == "input_audio_buffer.speech_stopped": 139 | mic.stop_recording() 140 | logger.info("Speech ended, processing...") 141 | visual_interface.set_active(False) 142 | 143 | response_start_time = time.perf_counter() 144 | except websockets.ConnectionClosed: 145 | logger.warning("WebSocket connection closed") 146 | break 147 | 148 | audio_player.close() 149 | --------------------------------------------------------------------------------