├── .gitattributes
├── LICENSE
├── README.md
├── ai_prompts.py
├── audio_player.py
├── eleven_labs.py
├── multi_agent_gpt.py
├── obs_websockets.py
├── openai_chat.py
├── requirements.txt
├── static
    ├── css
    │   └── style.css
    └── js
    │   └── multiAgent.js
├── templates
    └── index.html
├── websockets_auth.py
└── whisper_openai.py


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 DougDougGithub
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Multi Agent GPT Characters
 2 | Web app that allows 3 GPT characters and a human to talk to each other.  
 3 | Written by DougDoug. Feel free to use this for whatever you want! Credit is appreciated but not required.  
 4 | 
 5 | This is uploaded for educational purposes. Unfortunately I don't have time to offer individual support or review pull requests, but ChatGPT or Claude can be very helpful if you are running into issues.
 6 | 
 7 | ## SETUP:
 8 | 1) This was written in Python 3.9.2. Install page here: https://www.python.org/downloads/release/python-392/
 9 | 
10 | 2) Run `pip install -r requirements.txt` to install all modules.
11 | 
12 | 3) This uses the OpenAi API and Elevenlabs services. You'll need to set up an account with these services and generate an API key from them. Then add these keys as windows environment variables named OPENAI_API_KEY and ELEVENLABS_API_KEY respectively.
13 | 
14 | 4) This app uses the GPT-4o model from OpenAi. As of this writing (Sep 3rd 2024), you need to pay $5 to OpenAi in order to get access to the GPT-4o model API. So after setting up your account with OpenAi, you will need to pay for at least $5 in credits so that your account is given the permission to use the GPT-4o model when running my app. See here: https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4-gpt-4-turbo-gpt-4o-and-gpt-4o-mini
15 | 
16 | 5) Elevenlabs is the service I use for Ai voices. Once you've made Ai voices on the Elevenlabs website, open up multi_agent_gpt.py and make sure it's passing the name of your voices into each agent's init function.
17 | 
18 | 6) This app uses the open source Whisper model from OpenAi for transcribing audio into text. This means you'll be running an Ai model locally on your PC, so ideally you have an Nvidia GPU to run this. The Whisper model is used to transcribe the user's microphone recordings, and is used to generate subtitles from the Elevenlabs audio every time an agent "speaks". This model was downloaded from Huggingface and should install automatically when you run the whisper_openai.py file.  
19 | Note that you'll want to make sure you've installed torch with CUDA support, rather than just default torch, otherwise it will run very slow: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118.  
20 | If you have issues with the Whisper model there are other services that can offer an audio-to-text service (including a Whisper API), but this solution currently works well for me.
21 | 
22 | 7) This code runs a Flask web app and will display the agents' dialogue using HTML and javascript. By default it will run the server on "127.0.0.1:5151", but you can change this in multi_agent_gpt.py.
23 | 
24 | 8) Optionally, you can use OBS Websockets and an OBS plugin to make images move while talking.  
25 | First open up OBS. Make sure you're running version 28.X or later. Click Tools, then WebSocket Server Settings. Make sure "Enable WebSocket server" is checked. Then set Server Port to '4455' and set the Server Password to 'TwitchChat9'. If you use a different Server Port or Server Password in your OBS, just make sure you update the websockets_auth.py file accordingly.  
26 | Next install the Move OBS plugin: https://obsproject.com/forum/resources/move.913/ Now you can use this plugin to add a filter to an audio source that will change an image's transform based on the audio waveform. For example, I have a filter on a specific audio track that will move each agent's bell pepper icon source image whenever that pepper is talking.  
27 | Note that OBS must be open when you're running this code, otherwise OBS WebSockets won't be able to connect. If you don't need the images to move while talking, you can just delete the OBS portions of the code.
28 | 
29 | ## Using the App
30 | 
31 | To start out, edit the ai_prompts.py file to design each agent's personality and the purpose of their conversation.  
32 | By default the characters are told to discuss the greatest videogames of all time, but you can change this to anything you want, OpenAi is pretty great at having agents talk about pretty much anything.
33 | 
34 | Next run multi_agent_gpt.py
35 | 
36 | Once it's running you now have a number of options:
37 | 
38 | __Press Numpad7 to "talk" to the agents.__  
39 | Numpad7 will start recording your microphone audio. Hit Numpad8 to stop recording. It will then transcribe your audio into text and add your dialogue into all 3 agents' chat history. Then it will pick a random agent to "activate" and have them start talking next.
40 | 
41 | __Numpad1 will "activate" Agent #1.__  
42 | This means that agent will continue the conversation and start talking. Unless it has been "paused", it will also pick a random other agent and "activate" them to talk next, so that the conversation continues indefinitely.
43 | 
44 | __Numpad2 will "activate" Agent #2, Numpad3 will "activate" Agent #3.__
45 | 
46 | __F4 will "pause" all agents__   
47 | This stops the agents from activating each other. Basically, use this to stop the conversation from continuing any further, and then you can talk to the agents again.
48 | 
49 | ## Miscellaneous notes:
50 | 
51 | All agents will automatically store their "chat history" into a backup txt file as the conversation continues. This is done so that when you restart the program, each agent will automatically load from their backup file and thus restore the entire conversation, letting you continue it from where you left off. If you ever want to fully reset the conversation then just delete the backup txt files in the project.
52 | 
53 | If you want to have the agent dialogue displayed in OBS, you should add a browser source and set the URL to "127.0.0.1:5151". 
54 | 


--------------------------------------------------------------------------------
/ai_prompts.py:
--------------------------------------------------------------------------------
 1 | VIDEOGAME_SYSTEM_INTRO = '''
 2 | This is a conversation with 3 other people where you are talking about the best videogames of all time. You will be playing a character where you are trying to engage in the most spirited and entertaining possible conversation about the greatest videogames of all time.
 3 | '''
 4 | 
 5 | VIDEOGAME_SYSTEM_OUTRO = '''
 6 | 
 7 | Once the conversation starts, your goal is to have a discussion covering the best videogames of all time. 
 8 | 
 9 | Please use the following rules when giving a response:
10 | 1) Under no circumstances may you break character. 
11 | 2) Always keep your answers short, just 4 sentences max.
12 | 
13 | Messages that you receive from the other 3 people in the conversation will always begin with their title, to help you distinguish who has said what. For example a message from Victoria will begin with "[VICTORIA]", while a message from Tony will begin with [TONY]. You should NOT begin your message with this, just answer normally.
14 | 
15 | Okay, let the story begin!
16 | '''
17 | 
18 | # Agent 1: The Enthusiastic Adventurer
19 | VIDEOGAME_AGENT_1 = {"role": "system", "content": f'''
20 | {VIDEOGAME_SYSTEM_INTRO}
21 | In this conversation, your character is the Enthusiastic Adventurer. Your role is to be intensely curious and exuberantly passionate about every topic that comes up. You have an insatiable appetite for knowledge and are always seeking new experiences and discoveries. You love to share fascinating, sometimes bizarre, facts and stories with great excitement. Your enthusiasm knows no bounds, and you frequently express amazement and wonder at even the smallest details. When interacting with others, ask many questions, dive into the most obscure aspects of the topic, and encourage everyone to explore with you. You should not be afraid to completely change the conversation topic to fit whatever interests you, instead of what everyone else is talking about.
22 | 
23 | Traits and Behaviors:
24 | 
25 | Occasionally make up stories about your own life and your personal experiences with various videogames.
26 | Always expresses excitement and wonder.
27 | Frequently uses exclamations with a ton of emotion.
28 | You get frustrated if other people don't seem to be as excited about something as you are. You can barely contain yourself when this happens.
29 | You are HORRIFIED if anyone swears. This is a family-friendly conversation and you should aggressively scold anyone who swears and ruins the sanctity of this conversation.
30 | Constantly seek to uncover obscure or lesser-known facts.
31 | Asks many probing and unexpected questions.
32 | Encourages others to be just as excited and curious.
33 | {VIDEOGAME_SYSTEM_OUTRO}
34 | '''}
35 | 
36 | # Agent 2: The Cynical Jester
37 | VIDEOGAME_AGENT_2 = {"role": "system", "content": f'''
38 |     {VIDEOGAME_SYSTEM_INTRO}
39 | In this conversation, your character is the Cynical Jester. Your role is to provide sharp, witty, and often sarcastic commentary on the topics being discussed. You have a keen eye for the absurd and enjoy pointing out the ridiculousness in any situation. Your humor can be biting and dark, but it's always aimed at making the conversation more entertaining. You thrive on irony and love to make clever, sometimes controversial, observations. Challenge the status quo, poke fun at lofty ideas, and keep everyone on their toes with your unpredictable humor.
40 | 
41 | Traits and Behaviors:
42 | 
43 | Occasionally make up elaborate stories about your own life as a stereotypical Italian American in New York and your personal experiences with various videogames.
44 | Frequently uses sarcasm and irony.
45 | Makes dark or controversial jokes.
46 | Points out the absurdity in serious topics.
47 | Challenges and mocks the status quo.
48 | Keeps the conversation lively with unexpected humor.
49 | Frequently attack the opinions of other people in the conversations. You should frequently focus on their personality instead of the substance of what they're saying.
50 | Swear a LOT. You are a hard-hitting smack-talking guy from da streets o' New York, make sure people know it.
51 | You try to order Italian food sometimes.
52 | You think you're ALWAYS right and respond to any criticism or disagreement with extreme emotion and anger and swearing.
53 | Makes up random facts about videogames and always attributes it to some cousin of yours in New York.
54 | You get upset if anyone calls you just "Tony" instead of your full proper name, make sure they call you Tony King of New York. 
55 |     {VIDEOGAME_SYSTEM_OUTRO}
56 | '''}
57 | 
58 | # Agent 3: The Intense Philosopher
59 | VIDEOGAME_AGENT_3 = {"role": "system", "content": f'''
60 |     {VIDEOGAME_SYSTEM_INTRO}
61 | In this conversation, your character is the Intense Philosopher. Your role is to delve deeply into every topic with a fervent passion for understanding the truth. You approach every discussion with a critical and analytical mind, often questioning the fundamental assumptions behind ideas. You are unafraid to confront difficult or uncomfortable truths and enjoy pushing conversations into profound and existential territory. Your intensity can be overwhelming, and you often challenge others to think deeply and reconsider their viewpoints. Your goal is to provoke thought and inspire deep reflection, even if it leads to heated debates.
62 | However, you frequently change the topic and tie your philosophical points to a variety of different games, rather than just focusing on one game at a time.
63 | 
64 | Traits and Behaviors:
65 | 
66 | Occasionally make up elaborate stories about your own life as a posh British woman and your personal experiences with various videogames.
67 | You have EXTREMELY critical opinions about videogames and can barely restrain yourself from ranting about the various problems you have with different videogames.
68 | You occasionally change topics entirely because you're so angry about some videogame experience you had last week and you want to complain about it to the rest of the group.
69 | Frequently questions fundamental assumptions.
70 | Pushes conversations into profound and existential territory.
71 | Confronts difficult or uncomfortable truths.
72 | Challenges others to think deeply and reconsider viewpoints.
73 | Uses intense, sometimes overwhelming, arguments.
74 | You are gay and like Semigreatkate. 
75 | You get REALLY upset if anyone questions or undermines your arguments.
76 | {VIDEOGAME_SYSTEM_OUTRO}
77 | '''}
78 | 


--------------------------------------------------------------------------------
/audio_player.py:
--------------------------------------------------------------------------------
  1 | import pygame
  2 | import time
  3 | import os
  4 | import asyncio
  5 | import subprocess
  6 | import threading
  7 | import keyboard
  8 | import wave
  9 | import pyaudio
 10 | import soundfile as sf
 11 | from mutagen.mp3 import MP3
 12 | from pydub import AudioSegment
 13 | from rich import print
 14 | 
 15 | class AudioManager:
 16 | 
 17 |     # Variables for recording audio from mic
 18 |     is_recording = False
 19 |     audio_frames = []
 20 |     audio_format = pyaudio.paInt16
 21 |     channels = 2
 22 |     rate = 44100
 23 |     chunk = 1024
 24 | 
 25 |     def __init__(self):
 26 |         # Use higher frequency to prevent audio glitching noises
 27 |         # Use higher buffer because why not (default is 512)
 28 |         pygame.mixer.init(frequency=48000, buffer=1024) 
 29 | 
 30 |     def play_audio(self, file_path, sleep_during_playback=True, delete_file=False, play_using_music=True):
 31 |         """
 32 |         Parameters:
 33 |         file_path (str): path to the audio file
 34 |         sleep_during_playback (bool): means program will wait for length of audio file before returning
 35 |         delete_file (bool): means file is deleted after playback (note that this shouldn't be used for multithreaded function calls)
 36 |         play_using_music (bool): means it will use Pygame Music, if false then uses pygame Sound instead
 37 |         """
 38 |         if not pygame.mixer.get_init(): # Reinitialize mixer if needed
 39 |             pygame.mixer.init(frequency=48000, buffer=1024) 
 40 |         if play_using_music:
 41 |             # Pygame Music can only play one file at a time
 42 |             try:
 43 |                 pygame.mixer.music.load(file_path)
 44 |                 pygame.mixer.music.play()
 45 |                 converted = False
 46 |             except:
 47 |                 # Wav files from Elevenlabs don't work with Pygame's Music for some fucking reason (works fine with Sound)
 48 |                 # If there's an error here that's likely why, so convert it to a format that Pygame can handle
 49 |                 # You can't convert the file in place so just convert it into a temp file that you delete later
 50 |                 converted_wav = "temp_convert.wav"
 51 |                 subprocess.run(["ffmpeg", "-y", "-i", file_path, "-ar", "48000", "-ac", "2", "-c:a", "pcm_s16le", converted_wav])
 52 |                 converted = True
 53 |                 pygame.mixer.music.load(converted_wav)
 54 |                 pygame.mixer.music.play()
 55 |         else:
 56 |             # Pygame Sound lets you play multiple sounds simultaneously
 57 |             pygame_sound = pygame.mixer.Sound(file_path) 
 58 |             pygame_sound.play()
 59 | 
 60 |         if sleep_during_playback:
 61 |             # Sleep until file is done playing
 62 |             file_length = self.get_audio_length(file_path)
 63 |             time.sleep(file_length)
 64 |             # Delete the file
 65 |             if delete_file:
 66 |                 # Stop Pygame so file can be deleted
 67 |                 # Note: this will stop the audio on other threads as well, so it's not good if you're playing multiple sounds at once
 68 |                 pygame.mixer.music.stop()
 69 |                 pygame.mixer.quit()
 70 |                 try:  
 71 |                     os.remove(file_path)
 72 |                     if converted:
 73 |                         os.remove(converted_wav) # Remove the converted wav if it was created
 74 |                 except PermissionError:
 75 |                     print(f"Couldn't remove {file_path} because it is being used by another process.")
 76 | 
 77 |     async def play_audio_async(self, file_path):
 78 |         """
 79 |         Parameters:
 80 |         file_path (str): path to the audio file
 81 |         """
 82 |         if not pygame.mixer.get_init(): # Reinitialize mixer if needed
 83 |             pygame.mixer.init(frequency=48000, buffer=1024) 
 84 |         pygame_sound = pygame.mixer.Sound(file_path) 
 85 |         pygame_sound.play()
 86 | 
 87 |         # Sleep for the duration of the audio.
 88 |         # Must use asyncio.sleep() because time.sleep() will block the thread, even if it's in an async function
 89 |         file_length = self.get_audio_length(file_path)
 90 |         await asyncio.sleep(file_length)
 91 |     
 92 |     def get_audio_length(self, file_path):
 93 |         # Calculate length of the file based on the file format
 94 |         _, ext = os.path.splitext(file_path) # Get the extension of this file
 95 |         if ext.lower() == '.wav':
 96 |             wav_file = sf.SoundFile(file_path)
 97 |             file_length = wav_file.frames / wav_file.samplerate
 98 |             wav_file.close()
 99 |         elif ext.lower() == '.mp3':
100 |             mp3_file = MP3(file_path)
101 |             file_length = mp3_file.info.length
102 |         else:
103 |             print("Unknown audio file type. Returning 0 as file length")
104 |             file_length = 0
105 |         return file_length
106 |     
107 |     def combine_audio_files(self, input_files):
108 |         # input_files is an array of file paths
109 |         output_file = os.path.join(os.path.abspath(os.curdir), f"___Msg{str(hash(' '.join(input_files)))}.wav")
110 |         combined = None
111 |         for file in input_files:
112 |             audio = AudioSegment.from_file(file)
113 |             if combined is None:
114 |                 combined = audio
115 |             else:
116 |                 combined += audio
117 |         if combined:
118 |             combined.export(output_file, format=os.path.splitext(output_file)[1][1:])
119 |             print(f"Combined file saved as: {output_file}")
120 |         else:
121 |             print("No files to combine.")
122 |         return output_file
123 |     
124 |     def start_recording(self, stream):
125 |         self.audio_frames = []
126 |         while self.is_recording:
127 |             data = stream.read(self.chunk)
128 |             self.audio_frames.append(data)
129 |         print("[red]DONE RECORDING!")
130 | 
131 |     def record_audio(self, end_recording_key='=', audio_device=None):
132 |         # Records audio from an audio input device.
133 |         # Example device names are "Line In (Realtek(R) Audio)", "Sample (TC-Helicon GoXLR)", or just leave empty to use default mic
134 |         # For some reason this doesn't work on the Broadcast GoXLR Mix, the other 3 GoXLR audio inputs all work fine.
135 |         # Both Azure Speech-to-Text AND this script have issues listening to Broadcast Stream Mix, so just ignore it.
136 |         audio = pyaudio.PyAudio()
137 |         
138 |         if audio_device is None:
139 |             # If no audio_device is provided, use the default mic
140 |             audio_stream = audio.open(format=self.audio_format, channels=self.channels, rate=self.rate, input=True, frames_per_buffer=self.chunk)
141 |         else:
142 |             # If an audio device was provided, find its index
143 |             device_index = None
144 |             for i in range(audio.get_device_count()):
145 |                 dev_info = audio.get_device_info_by_index(i)
146 |                 # print(dev_info['name'])
147 |                 if audio_device in dev_info['name']:
148 |                     device_index = i
149 |                     # Some audio devices only support specific sample rates, so make sure to find a sample rate that's compatible with the device
150 |                     # This was necessary on certain GoXLR input but only sometimes. But this fixes the issues so w/e.
151 |                     supported_rates = [96000, 48000, 44100, 32000, 22050, 16000, 11025, 8000]
152 |                     for rate in supported_rates:
153 |                         try:
154 |                             if audio.is_format_supported(rate, input_device=device_index, input_channels=self.channels, input_format=self.audio_format):
155 |                                 self.rate = rate
156 |                                 break
157 |                         except ValueError:
158 |                             continue
159 |             if device_index is None:
160 |                 raise ValueError(f"Device '{audio_device}' not found")
161 |             if self.rate is None:
162 |                 raise ValueError(f"No supported sample rate found for device '{audio_device}'")
163 |             audio_stream = audio.open(format=self.audio_format, channels=self.channels, rate=self.rate, input=True, input_device_index=device_index, frames_per_buffer=self.chunk)
164 |                     
165 |         # Start recording an a second thread
166 |         self.is_recording = True
167 |         threading.Thread(target=self.start_recording, args=(audio_stream,)).start()
168 | 
169 |         # Wait until end key is pressed
170 |         while True:
171 |             if keyboard.is_pressed(end_recording_key):
172 |                 break
173 |             time.sleep(0.05) # Add this to reduce CPU usage
174 |         
175 |         self.is_recording = False
176 |         time.sleep(0.1) # Just for safety, no clue if this is needed
177 | 
178 |         filename = f"mic_recording_{int(time.time())}.wav"
179 |         wave_file = wave.open(filename, 'wb')
180 |         wave_file.setnchannels(self.channels)
181 |         wave_file.setsampwidth(audio.get_sample_size(self.audio_format))
182 |         wave_file.setframerate(self.rate)
183 |         wave_file.writeframes(b''.join(self.audio_frames))
184 |         wave_file.close()
185 | 
186 |         # Close the stream and PyAudio
187 |         audio_stream.stop_stream()
188 |         audio_stream.close()
189 |         audio.terminate()
190 | 
191 |         return filename
192 |         
193 | 


--------------------------------------------------------------------------------
/eleven_labs.py:
--------------------------------------------------------------------------------
 1 | from elevenlabs.client import ElevenLabs
 2 | from elevenlabs import play, stream, save, Voice, VoiceSettings
 3 | import time
 4 | import os
 5 | 
 6 | class ElevenLabsManager:
 7 | 
 8 |     def __init__(self):
 9 |         self.client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) # Defaults to ELEVEN_API_KEY)
10 |         self.voices = self.client.voices.get_all().voices
11 |         # Create a map of Names->IDs, so that we can easily grab a voice's ID later on 
12 |         self.voice_to_id = {}
13 |         for voice in self.voices:
14 |            self.voice_to_id[voice.name] = voice.voice_id
15 |         self.voice_to_settings = {}
16 | 
17 |     # Convert text to speech, then save it to file. Returns the file path.
18 |     # Current model options (that I would use) are eleven_monolingual_v1 or eleven_turbo_v2
19 |     # eleven_turbo_v2 takes about 60% of the time that eleven_monolingual_v1 takes
20 |     # However eleven_monolingual_v1 seems to produce more variety and emphasis, whereas turbo feels more monotone. Turbo still sounds good, just a little less interesting
21 |     def text_to_audio(self, input_text, voice="Doug VO Only", save_as_wave=True, subdirectory="", model_id="eleven_monolingual_v1"):
22 |         # Currently seems to be a problem with the API where it uses default voice settings, rather than pulling the proper settings from the website
23 |         # Workaround is to get the voice settings for each voice the first time it's used, then pass those settings in manually
24 |         if voice not in self.voice_to_settings:
25 |             self.voice_to_settings[voice] = self.client.voices.get_settings(self.voice_to_id[voice])
26 |         voice_settings = self.voice_to_settings[voice]
27 |         audio_saved = self.client.generate(text=input_text, voice=Voice(voice_id=self.voice_to_id[voice], settings=voice_settings), model=model_id,)
28 |         if save_as_wave:
29 |             file_name = f"___Msg{str(hash(input_text))}{time.time()}_{model_id}.wav"
30 |         else:
31 |             file_name = f"___Msg{str(hash(input_text))}{time.time()}_{model_id}.mp3"
32 |         tts_file = os.path.join(os.path.abspath(os.curdir), subdirectory, file_name)
33 |         save(audio_saved,tts_file)
34 |         return tts_file


--------------------------------------------------------------------------------
/multi_agent_gpt.py:
--------------------------------------------------------------------------------
  1 | # This code runs a thread that manages the frontend code, a thread that listens for keyboard presses from the human, and then threads for the 3 agents
  2 | # Once running, the human can activate a single agent and then let the agents continue an ongoing conversation.
  3 | # Each thread has the following core logic:
  4 | 
  5 | # Main Thread
  6 |     # Runs the web app
  7 | 
  8 | # Agent X
  9 |     # Waits to be activated
 10 |     # Once it is activated (by Doug or by another agent):
 11 |         # Acquire conversation lock
 12 |             # Get response from OpenAI
 13 |             # Add this new response to all other agents' chat histories
 14 |         # Creates TTS with ElevenLabs
 15 |         # Acquire speaking lock (so only 1 speaks at a time)
 16 |             # Pick another thread randomly, activate them
 17 |                 # Because this happens within the speaking lock, we are guaranteed that the other agents are inactive when this called.
 18 |                 # But, we start this now so that the next speaker can have their answer and audio ready to go the instant this agent is done talking.
 19 |             # Update client and OBS to display stuff
 20 |             # Play the TTS audio
 21 |             # Release speaking lock (Other threads can now talk)
 22 |     
 23 | # Human Input Thread
 24 |     # Listens for keypresses:
 25 | 
 26 |     # If F7 is pressed:
 27 |         # Toggles "pause" flag - stops other agents from activating additional agents
 28 | 
 29 |         # Record mic audio (until you press F8)
 30 | 
 31 |         # Get convo lock (but not speaking lock)
 32 |             # In theory, wait until everyone is done speaking, and because the agents are "paused" then no new ones will add to the convo
 33 |             # But to be safe, grab the convo lock to ensure that all agents HAVE to wait until my response is added into the convo history
 34 |         
 35 |         # Transcribe mic audio into text with Whisper
 36 |         # Add Doug's response into all agents' chat history
 37 |         
 38 |         # Release the convo lock
 39 |         # (then optionally press a key to trigger a specific bot)
 40 | 
 41 |     # If F4 pressed:
 42 |         # Toggles "pause" flag - stops all other agents from activating additional agents
 43 |     
 44 |     # If 1 pressed:
 45 |         # Turns off "pause" flag
 46 |         # Activates Agent 1
 47 |     
 48 |     # If 2 pressed: 
 49 |         # Turns off "pause" flag
 50 |         # Activates Agent 2
 51 |     
 52 |     # If 3 pressed: 
 53 |         # Turns off "pause" flag
 54 |         # Activates Agent 3
 55 | 
 56 | from flask import Flask, render_template, session, request
 57 | from flask_socketio import SocketIO, emit
 58 | import threading
 59 | import time
 60 | import keyboard
 61 | import random
 62 | import logging
 63 | from rich import print
 64 | 
 65 | from audio_player import AudioManager
 66 | from eleven_labs import ElevenLabsManager
 67 | from openai_chat import OpenAiManager
 68 | from whisper_openai import WhisperManager
 69 | from obs_websockets import OBSWebsocketsManager
 70 | from ai_prompts import *
 71 | 
 72 | socketio = SocketIO
 73 | app = Flask(__name__)
 74 | app.config['SERVER_NAME'] = "127.0.0.1:5151"
 75 | socketio = SocketIO(app, async_mode="threading")
 76 | log = logging.getLogger('werkzeug') # Sets flask app to only print error messages, rather than all info logs
 77 | log.setLevel(logging.ERROR)
 78 | 
 79 | @app.route("/")
 80 | def home():
 81 |     return render_template('index.html')
 82 | 
 83 | @socketio.event
 84 | def connect():
 85 |     print("[green]The server connected to client!")
 86 | 
 87 | obswebsockets_manager = OBSWebsocketsManager()
 88 | whisper_manager = WhisperManager()
 89 | elevenlabs_manager = ElevenLabsManager()
 90 | audio_manager = AudioManager()
 91 | 
 92 | speaking_lock = threading.Lock()
 93 | conversation_lock = threading.Lock()
 94 | 
 95 | agents_paused = False
 96 | 
 97 | # Class that represents a single ChatGPT Agent and its information
 98 | class Agent():
 99 |     
100 |     def __init__(self, agent_name, agent_id, filter_name, all_agents, system_prompt, elevenlabs_voice):
101 |         # Flag of whether this agent should begin speaking
102 |         self.activated = False 
103 |         # Used to identify each agent in the conversation history
104 |         self.name = agent_name 
105 |         # an int used to ID this agent to the frontend code
106 |         self.agent_id = agent_id 
107 |         # the name of the OBS filter to activate when this agent is speaking
108 |         # You don't need to use OBS filters as part of this code, it's optional for adding extra visual flair
109 |         self.filter_name = filter_name 
110 |         # A list of the other agents, so that you can pick one to randomly "activate" when you finish talking
111 |         self.all_agents = all_agents
112 |         # The name of the Elevenlabs voice that you want this agent to speak with
113 |         self.voice = elevenlabs_voice
114 |         # The name of the txt backup file where this agent's conversation history will be stored
115 |         backup_file_name = f"backup_history_{agent_name}.txt"
116 |         # Initialize the OpenAi manager with a system prompt and a file that you would like to save your conversation too
117 |         # If the backup file isn't empty, then it will restore that backed up conversation for this agent
118 |         self.openai_manager = OpenAiManager(system_prompt, backup_file_name) 
119 |         # Optional - tells the OpenAi manager not to print as much
120 |         self.openai_manager.logging = False
121 | 
122 |     def run(self):
123 |         while True:
124 |             # Wait until we've been activated
125 |             if not self.activated:
126 |                 time.sleep(0.1)
127 |                 continue
128 |                 
129 |             self.activated = False
130 |             print(f"[italic purple] {self.name} has STARTED speaking.")
131 |             
132 |             # This lock isn't necessary in theory, but for safety we will require this lock whenever updating any agent's convo history
133 |             with conversation_lock:
134 |                 # Generate a response to the conversation
135 |                 openai_answer = self.openai_manager.chat_with_history("Okay what is your response? Try to be as chaotic and bizarre and adult-humor oriented as possible. Again, 3 sentences maximum.")
136 |                 openai_answer = openai_answer.replace("*", "")
137 |                 print(f'[magenta]Got the following response:\n{openai_answer}')
138 | 
139 |                 # Add your new response into everyone else's chat history, then have them save their chat history
140 |                 # This agent's responses are marked as "assistant" role to itself, so everyone elses messages are "user" role.
141 |                 for agent in self.all_agents:
142 |                     if agent is not self:
143 |                         agent.openai_manager.chat_history.append({"role": "user", "content": f"[{self.name}] {openai_answer}"})
144 |                         agent.openai_manager.save_chat_to_backup()
145 | 
146 |             # Create audio response
147 |             tts_file = elevenlabs_manager.text_to_audio(openai_answer, self.voice, False)
148 | 
149 |             # Process the audio to get subtitles
150 |             audio_and_timestamps = whisper_manager.audio_to_text(tts_file, "sentence")
151 | 
152 |             # Wait here until the current speaker is finished
153 |             with speaking_lock:
154 | 
155 |                 # If we're "paused", then simply finish speaking without activating another agent
156 |                 # Otherwise, pick another agent randomly, then activate it
157 |                 if not agents_paused:
158 |                     other_agents = [agent for agent in self.all_agents if agent is not self]
159 |                     random_agent = random.choice(other_agents)
160 |                     random_agent.activated = True
161 | 
162 |                 # Activate move filter on the image
163 |                 obswebsockets_manager.set_filter_visibility("Line In", self.filter_name, True)
164 |             
165 |                 # Play the TTS audio (without pausing)
166 |                 audio_manager.play_audio(tts_file, False, False, True)
167 | 
168 |                 # While the audio is playing, display each sentence on the front-end
169 |                 # Each dictionary will look like: {'text': 'here is my speech', 'start_time': 11.58, 'end_time': 14.74}
170 |                 socketio.emit('start_agent', {'agent_id': self.agent_id})
171 |                 try:
172 |                     for i in range(len(audio_and_timestamps)):
173 |                         current_sentence = audio_and_timestamps[i]
174 |                         duration = current_sentence['end_time'] - current_sentence['start_time']
175 |                         socketio.emit('agent_message', {'agent_id': self.agent_id, 'text': f"{current_sentence['text']}"})
176 |                         time.sleep(duration)
177 |                         # If this is not the final sentence, sleep for the gap of time inbetween this sentence and the next one starting
178 |                         if i < (len(audio_and_timestamps) - 1):
179 |                             time_between_sentences = audio_and_timestamps[i+1]['start_time'] - current_sentence['end_time']
180 |                             time.sleep(time_between_sentences)
181 |                 except Exception:
182 |                     print(f"[magenta] Whoopsie! There was a problem and I don't know why. This was the current_sentence it broke on: {current_sentence}")
183 |                 socketio.emit('clear_agent', {'agent_id': self.agent_id})
184 |             
185 |                 time.sleep(1) # Wait one second before the next person talks, otherwise their audio gets cut off
186 | 
187 |                 # Turn off the filter in OBS
188 |                 obswebsockets_manager.set_filter_visibility("Line In", self.filter_name, False)
189 | 
190 |             print(f"[italic purple] {self.name} has FINISHED speaking.")        
191 | 
192 | 
193 | # Class that handles human input, this thread is how you can manually activate or pause the other agents
194 | class Human():
195 |     
196 |     def __init__(self, name, all_agents):
197 |         self.name = name # This will be added to the beginning of the response
198 |         self.all_agents = all_agents
199 | 
200 |     def run(self):
201 |         global agents_paused
202 |         while True:
203 | 
204 |             # Speak into mic and add the dialogue to the chat history
205 |             if keyboard.is_pressed('num 7'):
206 | 
207 |                 # Toggles "pause" flag - stops other agents from activating additional agents
208 |                 agents_paused = True
209 |                 print(f"[italic red] Agents have been paused")
210 | 
211 |                 # Record mic audio from Doug (until he presses '=')
212 |                 print(f"[italic green] DougDoug has STARTED speaking.")
213 |                 mic_audio = audio_manager.record_audio(end_recording_key='num 8')
214 | 
215 |                 with conversation_lock:
216 |                     # Transcribe mic audio into text with Whisper
217 |                     transcribed_audio = whisper_manager.audio_to_text(mic_audio)
218 |                     print(f"[teal]Got the following audio from Doug:\n{transcribed_audio}")
219 | 
220 |                     # Add Doug's response into all agents chat history
221 |                     for agent in self.all_agents:
222 |                         agent.openai_manager.chat_history.append({"role": "user", "content": f"[{self.name}] {transcribed_audio}"})
223 |                         agent.openai_manager.save_chat_to_backup() # Tell the other agents to save their chat history to their backup file
224 |                 
225 |                 print(f"[italic magenta] DougDoug has FINISHED speaking.")
226 | 
227 |                 # Activate another agent randomly
228 |                 agents_paused = False
229 |                 random_agent = random.randint(0, len(self.all_agents)-1)
230 |                 print(f"[cyan]Activating Agent {random_agent+1}")
231 |                 self.all_agents[random_agent].activated = True
232 | 
233 |             
234 |             # "Pause" the other agents.
235 |             # Whoever is currently speaking will finish, but no future agents will be activated
236 |             if keyboard.is_pressed('f4'):
237 |                 print("[italic red] Agents have been paused")
238 |                 agents_paused = True
239 |                 time.sleep(1) # Wait for a bit to ensure you don't press this twice in a row
240 |             
241 |             # Activate Agent 1
242 |             if keyboard.is_pressed('num 1'):
243 |                 print("[cyan]Activating Agent 1")
244 |                 agents_paused = False
245 |                 self.all_agents[0].activated = True
246 |                 time.sleep(1) # Wait for a bit to ensure you don't press this twice in a row
247 |             
248 |             # Activate Agent 2
249 |             if keyboard.is_pressed('num 2'):
250 |                 print("[cyan]Activating Agent 2")
251 |                 agents_paused = False
252 |                 self.all_agents[1].activated = True
253 |                 time.sleep(1) # Wait for a bit to ensure you don't press this twice in a row
254 |             
255 |             # Activate Agent 3
256 |             if keyboard.is_pressed('num 3'):
257 |                 print("[cyan]Activating Agent 3")
258 |                 agents_paused = False
259 |                 self.all_agents[2].activated = True
260 |                 time.sleep(1) # Wait for a bit to ensure you don't press this twice in a row
261 |             
262 |             time.sleep(0.05)
263 |                 
264 | 
265 | 
266 | def start_bot(bot):
267 |     bot.run()
268 | 
269 | if __name__ == '__main__':
270 | 
271 |     all_agents = []
272 | 
273 |     # Agent 1
274 |     agent1 = Agent("OSWALD", 1, "Audio Move - Wario Pepper", all_agents, VIDEOGAME_AGENT_1, "Dougsworth")
275 |     agent1_thread = threading.Thread(target=start_bot, args=(agent1,))
276 |     agent1_thread.start()
277 | 
278 |     # Agent 2
279 |     agent2 = Agent("TONY KING OF NEW YORK", 2, "Audio Move - Waluigi Pepper", all_agents, VIDEOGAME_AGENT_2, "Tony Emperor of New York")
280 |     agent2_thread = threading.Thread(target=start_bot, args=(agent2,))
281 |     agent2_thread.start()
282 | 
283 |     # Agent 3
284 |     agent3 = Agent("VICTORIA", 3, "Audio Move - Gamer Pepper", all_agents, VIDEOGAME_AGENT_3, "Victoria")
285 |     agent3_thread = threading.Thread(target=start_bot, args=(agent3,))
286 |     agent3_thread.start()
287 | 
288 |     all_agents.append(agent1)
289 |     all_agents.append(agent2)
290 |     all_agents.append(agent3)
291 | 
292 |     # Human thread
293 |     human = Human("DOUGDOUG", all_agents)
294 |     human_thread = threading.Thread(target=start_bot, args=(human,))
295 |     human_thread.start()
296 | 
297 |     print("[italic green]!!AGENTS ARE READY TO GO!!\nPress Num 1, Num 2, or Num3 to activate an agent.\nPress F7 to speak to the agents.")
298 | 
299 |     socketio.run(app)
300 | 
301 |     agent1_thread.join()
302 |     agent2_thread.join()
303 |     agent3_thread.join()
304 |     human_thread.join()


--------------------------------------------------------------------------------
/obs_websockets.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | from obswebsocket import obsws, requests  # noqa: E402
 3 | from websockets_auth import WEBSOCKET_HOST, WEBSOCKET_PORT, WEBSOCKET_PASSWORD
 4 | 
 5 | ##########################################################
 6 | ##########################################################
 7 | 
 8 | class OBSWebsocketsManager:
 9 |     ws = None
10 |     
11 |     def __init__(self):
12 |         # Connect to websockets
13 |         self.ws = obsws(WEBSOCKET_HOST, WEBSOCKET_PORT, WEBSOCKET_PASSWORD)
14 |         self.ws.connect()
15 |         print("Connected to OBS Websockets!\n")
16 | 
17 |     def disconnect(self):
18 |         self.ws.disconnect()
19 | 
20 |     # Set the current scene
21 |     def set_scene(self, new_scene):
22 |         self.ws.call(requests.SetCurrentProgramScene(sceneName=new_scene))
23 | 
24 |     # Set the visibility of any source's filters
25 |     def set_filter_visibility(self, source_name, filter_name, filter_enabled=True):
26 |         self.ws.call(requests.SetSourceFilterEnabled(sourceName=source_name, filterName=filter_name, filterEnabled=filter_enabled))
27 | 
28 |     # Set the visibility of any source
29 |     def set_source_visibility(self, scene_name, source_name, source_visible=True):
30 |         response = self.ws.call(requests.GetSceneItemId(sceneName=scene_name, sourceName=source_name))
31 |         myItemID = response.datain['sceneItemId']
32 |         self.ws.call(requests.SetSceneItemEnabled(sceneName=scene_name, sceneItemId=myItemID, sceneItemEnabled=source_visible))
33 | 
34 |     # Returns the current text of a text source
35 |     def get_text(self, source_name):
36 |         response = self.ws.call(requests.GetInputSettings(inputName=source_name))
37 |         return response.datain["inputSettings"]["text"]
38 | 
39 |     # Returns the text of a text source
40 |     def set_text(self, source_name, new_text):
41 |         self.ws.call(requests.SetInputSettings(inputName=source_name, inputSettings = {'text': new_text}))
42 | 
43 |     def get_source_transform(self, scene_name, source_name):
44 |         response = self.ws.call(requests.GetSceneItemId(sceneName=scene_name, sourceName=source_name))
45 |         myItemID = response.datain['sceneItemId']
46 |         response = self.ws.call(requests.GetSceneItemTransform(sceneName=scene_name, sceneItemId=myItemID))
47 |         transform = {}
48 |         transform["positionX"] = response.datain["sceneItemTransform"]["positionX"]
49 |         transform["positionY"] = response.datain["sceneItemTransform"]["positionY"]
50 |         transform["scaleX"] = response.datain["sceneItemTransform"]["scaleX"]
51 |         transform["scaleY"] = response.datain["sceneItemTransform"]["scaleY"]
52 |         transform["rotation"] = response.datain["sceneItemTransform"]["rotation"]
53 |         transform["sourceWidth"] = response.datain["sceneItemTransform"]["sourceWidth"] # original width of the source
54 |         transform["sourceHeight"] = response.datain["sceneItemTransform"]["sourceHeight"] # original width of the source
55 |         transform["width"] = response.datain["sceneItemTransform"]["width"] # current width of the source after scaling, not including cropping. If the source has been flipped horizontally, this number will be negative.
56 |         transform["height"] = response.datain["sceneItemTransform"]["height"] # current height of the source after scaling, not including cropping. If the source has been flipped vertically, this number will be negative.
57 |         transform["cropLeft"] = response.datain["sceneItemTransform"]["cropLeft"] # the amount cropped off the *original source width*. This is NOT scaled, must multiply by scaleX to get current # of cropped pixels
58 |         transform["cropRight"] = response.datain["sceneItemTransform"]["cropRight"] # the amount cropped off the *original source width*. This is NOT scaled, must multiply by scaleX to get current # of cropped pixels
59 |         transform["cropTop"] = response.datain["sceneItemTransform"]["cropTop"] # the amount cropped off the *original source height*. This is NOT scaled, must multiply by scaleY to get current # of cropped pixels
60 |         transform["cropBottom"] = response.datain["sceneItemTransform"]["cropBottom"] # the amount cropped off the *original source height*. This is NOT scaled, must multiply by scaleY to get current # of cropped pixels
61 |         return transform
62 | 
63 |     # The transform should be a dictionary containing any of the following keys with corresponding values
64 |     # positionX, positionY, scaleX, scaleY, rotation, width, height, sourceWidth, sourceHeight, cropTop, cropBottom, cropLeft, cropRight
65 |     # e.g. {"scaleX": 2, "scaleY": 2.5}
66 |     # Note: there are other transform settings, like alignment, etc, but these feel like the main useful ones.
67 |     # Use get_source_transform to see the full list
68 |     def set_source_transform(self, scene_name, source_name, new_transform):
69 |         response = self.ws.call(requests.GetSceneItemId(sceneName=scene_name, sourceName=source_name))
70 |         myItemID = response.datain['sceneItemId']
71 |         self.ws.call(requests.SetSceneItemTransform(sceneName=scene_name, sceneItemId=myItemID, sceneItemTransform=new_transform))
72 | 
73 |     # Note: an input, like a text box, is a type of source. This will get *input-specific settings*, not the broader source settings like transform and scale
74 |     # For a text source, this will return settings like its font, color, etc
75 |     def get_input_settings(self, input_name):
76 |         return self.ws.call(requests.GetInputSettings(inputName=input_name))
77 | 
78 |     # Get list of all the input types
79 |     def get_input_kind_list(self):
80 |         return self.ws.call(requests.GetInputKindList())
81 | 
82 |     # Get list of all items in a certain scene
83 |     def get_scene_items(self, scene_name):
84 |         return self.ws.call(requests.GetSceneItemList(sceneName=scene_name))
85 |     
86 |     # Immediately ends the stream. Use with caution.
87 |     def stop_stream(self):
88 |         return self.ws.call(requests.StopStream())
89 | 
90 | 


--------------------------------------------------------------------------------
/openai_chat.py:
--------------------------------------------------------------------------------
  1 | from openai import OpenAI
  2 | import tiktoken
  3 | import os
  4 | from rich import print
  5 | import base64
  6 | import time
  7 | import json
  8 | 
  9 | class OpenAiManager:
 10 |     
 11 |     def __init__(self, system_prompt=None, chat_history_backup=None):
 12 |         """
 13 |         Optionally provide a chat_history_backup txt file and a system_prompt string.
 14 |         If the backup file is provided, we load the chat history from it.
 15 |         If the backup file already exists, then we don't add the system prompt into the convo history, because we assume that it already has a system prompt in it.
 16 |         Alternatively you manually add new system prompts into the chat history at any point. 
 17 |         """
 18 | 
 19 |         self.client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
 20 |         self.logging = True # Determines whether the module should print out its results
 21 |         self.tiktoken_encoder = None # Used to calculate the token count in messages
 22 |         self.chat_history = []
 23 | 
 24 |         # If a backup file is provided, we will save our chat history to that file after every call
 25 |         self.chat_history_backup = chat_history_backup
 26 |         
 27 |         # If the backup file already exists, we load its contents into the chat_history
 28 |         if chat_history_backup and os.path.exists(chat_history_backup):
 29 |             with open(chat_history_backup, 'r') as file:
 30 |                 self.chat_history = json.load(file)
 31 |         elif system_prompt:
 32 |             # If the chat history file doesn't exist, then our chat history is currently empty.
 33 |             # If we were provided a system_prompt, add it into the chat history as the first message.
 34 |             self.chat_history.append(system_prompt)
 35 | 
 36 |     # Write our current chat history to the txt file
 37 |     def save_chat_to_backup(self):
 38 |         if self.chat_history_backup:
 39 |             with open(self.chat_history_backup, 'w') as file:
 40 |                 json.dump(self.chat_history, file)
 41 | 
 42 |     def num_tokens_from_messages(self, messages, model='gpt-4o'):
 43 |         """Returns the number of tokens used by a list of messages.
 44 |         The code below is an adaptation of this text-only version: https://platform.openai.com/docs/guides/chat/managing-tokens 
 45 | 
 46 |         Note that image tokens are calculated differently from text.
 47 |         The guide for image token calculation is here: https://platform.openai.com/docs/guides/vision
 48 |         Short version is that a 1920x1080 image is going to be 1105 tokens, so just using that for all images for now.
 49 |         In the future I could swap to 'detail: low' and cap it at 85 tokens. Might be necessary for certain use cases.
 50 | 
 51 |         There are three message formats we have to check:
 52 |         Version 1: the 'content' is just a text string
 53 |             'content' = 'What are considered some of the most popular characters in videogames?'
 54 |         Version 2: the content is an array with a single dictionary, with two key/value pairs
 55 |             'content' = [{'type': 'text', 'text': 'What are considered some of the most popular characters in videogames?'}]
 56 |         Version 3: the content is an array with two dictionaries, one for the text portion and one for the image portion
 57 |             'content' = [{'type': 'text', 'text': 'Okay now please compare the previous image I sent you with this new image!'}, {'type': 'image_url', 'image_url': {'url': 'https://i.gyazo.com/8ec349446dbb538727e515f2b964224c.png', 'detail': 'high'}}]
 58 |         """
 59 |         try:
 60 |             if self.tiktoken_encoder == None:
 61 |                 self.tiktoken_encoder = tiktoken.encoding_for_model(model) # We store this value so we don't have to check again every time
 62 |             num_tokens = 0
 63 |             for message in messages:
 64 |                 num_tokens += 4  # every message follows <im_start>{role/name}\n{content}<im_end>\n
 65 |                 for key, value in message.items():
 66 |                     if key == 'role':
 67 |                         num_tokens += len(self.tiktoken_encoder.encode(value))
 68 |                     elif key == 'content':
 69 |                         # In the case that value is just a string, simply get its token value and move on
 70 |                         if isinstance(value, str):
 71 |                             num_tokens += len(self.tiktoken_encoder.encode(value))
 72 |                             continue
 73 | 
 74 |                         # In this case the 'content' variables value is an array of dictionaries
 75 |                         for message_data in value:
 76 |                             for content_key, content_value in message_data.items():
 77 |                                 if content_key == 'type':
 78 |                                     num_tokens += len(self.tiktoken_encoder.encode(content_value))
 79 |                                 elif content_key == 'text': 
 80 |                                     num_tokens += len(self.tiktoken_encoder.encode(content_value))
 81 |                                 elif content_key == "image_url":
 82 |                                     num_tokens += 1105 # Assumes the image is 1920x1080 and that detail is set to high               
 83 |             num_tokens += 2  # every reply is primed with <im_start>assistant
 84 |             return num_tokens
 85 |         except Exception:
 86 |             # Either this model is not implemented in tiktoken, or there was some error processing the messages
 87 |             raise NotImplementedError(f"""num_tokens_from_messages() is not presently implemented for model {model}.""")
 88 | 
 89 |     # Asks a question with no chat history
 90 |     def chat(self, prompt=""):
 91 |         if not prompt:
 92 |             print("Didn't receive input!")
 93 |             return
 94 | 
 95 |         # Check that the prompt is under the token context limit
 96 |         chat_question = [{"role": "user", "content": prompt}]
 97 |         if self.num_tokens_from_messages(chat_question) > 128000:
 98 |             print("The length of this chat question is too large for the GPT model")
 99 |             return
100 | 
101 |         print("[yellow]\nAsking ChatGPT a question...")
102 |         completion = self.client.chat.completions.create(
103 |           model="gpt-4o",
104 |           messages=chat_question
105 |         )
106 | 
107 |         # Process the answer
108 |         openai_answer = completion.choices[0].message.content
109 |         if self.logging:
110 |             print(f"[green]\n{openai_answer}\n")
111 |         return openai_answer
112 |     
113 |     # Analyze an image without history
114 |     # Works with jpg, jpeg, or png. Alternatively can provide an image URL by setting local_image to False
115 |     # More info here: https://platform.openai.com/docs/guides/vision
116 |     def analyze_image(self, prompt, image_path, local_image=True):
117 |         # Use default prompt if one isn't provided
118 |         if prompt is None:
119 |             prompt = "Please give me a detailed description of this image."
120 |         # If this is a local image, encode it into base64. Otherwise just use the provided URL.
121 |         if local_image:
122 |             try:
123 |                 with open(image_path, "rb") as image_file:
124 |                     base64_image = base64.b64encode(image_file.read()).decode("utf-8")
125 |                     url = f"data:image/jpeg;base64,{base64_image}"
126 |             except:
127 |                 print("[red]ERROR: COULD NOT BASE64 ENCODE THE IMAGE. PANIC!!")
128 |                 return None
129 |         else:
130 |             url = image_path # The provided image path is a URL
131 |         if self.logging:
132 |             print("[yellow]\nAsking ChatGPT to analyze image...")
133 |         completion = self.client.chat.completions.create(
134 |             model="gpt-4o",
135 |             messages=[
136 |                 {
137 |                 "role": "user",
138 |                 "content": [
139 |                     {"type": "text", "text": prompt},
140 |                     {
141 |                         "type": "image_url",
142 |                         "image_url": {
143 |                             "url": url,
144 |                             "detail": "high"
145 |                         }
146 |                     },
147 |                 ],
148 |                 },
149 |             ],
150 |             max_tokens=4096, # max of 4096 tokens as of Dec 25th 2023
151 |         )
152 |         openai_answer = completion.choices[0].message.content
153 |         if self.logging:
154 |             print(f"[green]\n{openai_answer}\n")
155 |         return openai_answer
156 |     
157 | 
158 |     # Asks a question that includes the full conversation history
159 |     # Can include a mix of text and images
160 |     def chat_with_history(self, prompt="", image_path="", local_image=True):
161 |         
162 |         # If we received a prompt, add it into our chat history.
163 |         # Prompts are technically optional because the Ai can just continue the conversation from where it left off.
164 |         if prompt is not None and prompt != "":
165 |             # Create a new chat message with the text prompt
166 |             new_chat_message = {
167 |                 "role": "user",
168 |                 "content": [
169 |                     {"type": "text", "text": prompt},
170 |                 ],
171 |             }
172 |             # If an image is provided, add the image url info into our new message.
173 |             if image_path != "":
174 |                 # If this is a local image, we encode it into base64. Otherwise just use the provided URL.
175 |                 if local_image:
176 |                     try:
177 |                         with open(image_path, "rb") as image_file:
178 |                             base64_image = base64.b64encode(image_file.read()).decode("utf-8")
179 |                             url = f"data:image/jpeg;base64,{base64_image}"
180 |                     except:
181 |                         print("[red]ERROR: COULD NOT BASE64 ENCODE THE IMAGE. PANIC!!")
182 |                         return None
183 |                 else:
184 |                     url = image_path # The provided image path is a URL
185 |                 new_image_content = {
186 |                     "type": "image_url",
187 |                     "image_url": {
188 |                         "url": url,
189 |                         "detail": "high"
190 |                     }
191 |                 }
192 |                 new_chat_message["content"].append(new_image_content)
193 | 
194 |             # Add the new message into our chat history
195 |             self.chat_history.append(new_chat_message)
196 | 
197 |         # Check total token limit. Remove old messages as needed
198 |         if self.logging:
199 |             print(f"[coral]Chat History has a current token length of {self.num_tokens_from_messages(self.chat_history)}")
200 |         while self.num_tokens_from_messages(self.chat_history) > 128000:
201 |             self.chat_history.pop(1) # We skip the 1st message since it's the system message
202 |             if self.logging:
203 |                 print(f"Popped a message! New token length is: {self.num_tokens_from_messages(self.chat_history)}")
204 | 
205 |         if self.logging:
206 |             print("[yellow]\nAsking ChatGPT a question...")
207 |         completion = self.client.chat.completions.create(
208 |           model="gpt-4o",
209 |           messages=self.chat_history
210 |         )
211 | 
212 |         # Add this answer to our chat history
213 |         self.chat_history.append({"role": completion.choices[0].message.role, "content": completion.choices[0].message.content})
214 | 
215 |         # If a backup file was provided, write out convo history to the txt file
216 |         self.save_chat_to_backup()
217 | 
218 |         # Return answer
219 |         openai_answer = completion.choices[0].message.content
220 |         if self.logging:
221 |             print(f"[green]\n{openai_answer}\n")
222 |         return openai_answer
223 |     
224 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | elevenlabs==1.7.0
 2 | Flask==2.2.3
 3 | Flask_SocketIO==5.3.3
 4 | keyboard==0.13.5
 5 | mutagen==1.46.0
 6 | obs_websocket_py==1.0
 7 | openai==1.44.0
 8 | PyAudio==0.2.14
 9 | pydub==0.25.1
10 | pygame==2.3.0
11 | pygame_ce==2.3.0
12 | rich==13.8.0
13 | soundfile==0.12.1
14 | tiktoken==0.7.0
15 | torch==2.3.0+cu118
16 | transformers==4.40.2
17 | 


--------------------------------------------------------------------------------
/static/css/style.css:
--------------------------------------------------------------------------------
 1 | * {
 2 |     font-family: 'Roboto', sans-serif;
 3 |     background-color: hsl(250, 24%, 19%);
 4 |     color: #fff;
 5 |     scrollbar-width: none; /*firefox support */    
 6 | }
 7 | 
 8 | #main-container {
 9 |     display: flex;
10 |     gap: 70px;
11 | }
12 | 
13 | .agent-container {
14 |     width: 500px;
15 |     text-align: center;
16 | }
17 | 
18 | .agent-text {
19 |     font-size: 40px;
20 |     color: rgb(255, 255, 255);
21 |     text-shadow: 0px 0px 1px #000, 0px 0px 2px #000, 0px 0px 3px #000, 0px 0px 4px #000, 0px 0px 5px #000, 0px 0px 6px #000, 0px 0px 7px #000;
22 |     margin-top: 4px;
23 |     
24 |     /* These are necessary if I'm animating the letters */
25 |     display: flex;
26 |     flex-wrap: wrap; /* Allow flex items to wrap */
27 |     justify-content: center; /* Center the text content */
28 |     width: 100%;
29 |     max-width: 600px; /* Ensure it doesn't exceed 600px */
30 | }
31 | 
32 | /* We put all of the letterized letters into these words elements */
33 | .agent-word {
34 |     display: inline-block;
35 |     display: flex;
36 |     flex-wrap: wrap;
37 | }
38 | 
39 | /* Applies to each individual letter */
40 | .agent-letter {
41 |     color: rgb(255, 255, 255);
42 |     font-weight: bold;
43 | }


--------------------------------------------------------------------------------
/static/js/multiAgent.js:
--------------------------------------------------------------------------------
 1 | import Letterize from "https://cdn.skypack.dev/letterizejs@2.0.0";
 2 | import anime from "https://cdn.skypack.dev/animejs@3.2.1";
 3 | 
 4 | $(document).ready(function() {
 5 | 
 6 |     var socket = io();
 7 | 
 8 |     socket.on('start_agent', function(msg, cb) {
 9 |         console.log("Got data: " + msg)
10 | 
11 |         $('#agent-container-' + msg.agent_id).animate({ opacity: 1 }, 500);
12 | 
13 |         if (cb)
14 |             cb();
15 |     });
16 | 
17 |     // Updates each sentence
18 |     socket.on('agent_message', function(msg, cb) {
19 |         
20 |         $("#agent-text-" + msg.agent_id).text(msg.text)
21 |         
22 |         // Note that openAiAnimation is NOT a const variable
23 |         let openAiAnimation = new Letterize({targets: "#agent-text-" + msg.agent_id, className: "agent-letter"});
24 | 
25 |         // Now we've turned every letter into its own span, we group all of the letter spans into "word" elements, so that the word elements can wrap around multiple lines appropriately
26 |         let $openaiText = $('#agent-text-' + msg.agent_id); // Get the openai-text container
27 |         let $letters = $openaiText.find('.agent-letter'); // Get all the letter spans inside the openai_text container
28 |         let $newContent = $('<div></div>'); // Create a new jQuery object to hold the new structure
29 |         let $wordSpan = $('<span class="agent-word"></span>'); // Create a new word span to start with
30 |         // Iterate over each letter span to create the word element
31 |         $letters.each(function() {
32 |             const $letter = $(this);
33 |             if ($letter.text().trim() === '') { // Check if the letter is a space
34 |                 $newContent.append($wordSpan); // Append the current word span to the new content
35 |                 $newContent.append($letter); // Add the space directly to the new content
36 |                 $wordSpan = $('<span class="agent-word"></span>'); // Create a new word span for the next word
37 |             } else {
38 |                 $wordSpan.append($letter); // If not a space, append the letter to the current word span
39 |             }
40 |         });
41 |         $newContent.append($wordSpan); // Append the last word span to the new content
42 |         $openaiText.empty().append($newContent.contents()); // Clear the openai_text container and append the new content
43 | 
44 |         var animation = anime.timeline({
45 |             targets: openAiAnimation.listAll,
46 |             delay: anime.stagger(30),
47 |             loop: true
48 |         });
49 |         animation
50 |             .add({translateY: -2, duration: 1000})
51 |             .add({translateY: 0, duration: 1000});
52 | 
53 |         if (cb)
54 |             cb();
55 |     });
56 | 
57 |     socket.on('clear_agent', function (msg, cb) {
58 |         console.log("Client received clear message instruction!")
59 | 
60 |         $('#agent-container-' + msg.agent_id).animate({ opacity: 0 }, 500);
61 | 
62 |         if (cb)
63 |             cb();
64 |     });
65 | });


--------------------------------------------------------------------------------
/templates/index.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE HTML>
 2 | <html>
 3 | <head>
 4 |     <title>DougDoug Multi-Agent GPT Characters</title>
 5 |     <link rel="preconnect" href="https://fonts.googleapis.com">
 6 |     <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
 7 |     <link href="https://fonts.googleapis.com/css2?family=Roboto&display=swap" rel="stylesheet">
 8 |     <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
 9 |     <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js" integrity="sha512-bLT0Qm9VnAYZDflyKcBaQ2gg0hSYNQrJ8RilYldYQ1FxQYoCLtUjuuRuZo+fjqhx/qtq/1itJ0C2ejDxltZVFg==" crossorigin="anonymous"></script>
10 |     <script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/3.0.4/socket.io.js" integrity="sha512-aMGMvNYu8Ue4G+fHa359jcPb1u+ytAF+P2SCb+PxrjCdO3n3ZTxJ30zuH39rimUggmTwmh2u7wvQsDTHESnmfQ==" crossorigin="anonymous"></script>
11 |     <script src="https://cdn.jsdelivr.net/npm/jquery-textfill@0.6.0/source/jquery.textfill.min.js"></script>
12 |     <script type="module" src="static/js/multiAgent.js" defer></script>
13 | </head>
14 | <body>
15 |     <div id="main-container">
16 |         <div id="agent-container-1" class="agent-container">
17 |             <div id="agent-text-1" class="agent-text">Agent 1 Text will be here!</div>
18 |         </div>
19 |         <div id="agent-container-2" class="agent-container">
20 |             <div id="agent-text-2" class="agent-text">Agent 2 Text will be here!</div>
21 |         </div>
22 |         <div id="agent-container-3" class="agent-container">
23 |             <div id="agent-text-3" class="agent-text">Agent 3 Text will be here!</div>
24 |         </div>
25 |     </div>
26 | </body>
27 | </html>


--------------------------------------------------------------------------------
/websockets_auth.py:
--------------------------------------------------------------------------------
1 | WEBSOCKET_HOST = "localhost"
2 | WEBSOCKET_PORT = 4455
3 | WEBSOCKET_PASSWORD = "TwitchChat9"


--------------------------------------------------------------------------------
/whisper_openai.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
 3 | from rich import print
 4 | import time
 5 | 
 6 | class WhisperManager():
 7 | 
 8 |     # Uses Whisper on HuggingFace: https://huggingface.co/openai/whisper-large-v3
 9 |     # Need to make sure you've installed torch with CUDA support, rather than just default torch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
10 |     # I tried a lot but could not get Flash Attention 2 to install. It would speed up performance but isn't necessary.
11 | 
12 |     def __init__(self):
13 |         print(torch.cuda.is_available())  # Should return True if CUDA is available
14 |         print(torch.cuda.get_device_name(0))  # Should return the name of your GPU, e.g., "NVIDIA GeForce RTX 4070 Ti"
15 |         device = "cuda:0" if torch.cuda.is_available() else "cpu"
16 |         torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
17 |         model_id = "openai/whisper-large-v3"
18 | 
19 |         model = AutoModelForSpeechSeq2Seq.from_pretrained(
20 |             model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
21 |         )
22 |         model.to(device)
23 |         model.generation_config.is_multilingual = False
24 |         model.generation_config.language = "en"
25 | 
26 |         processor = AutoProcessor.from_pretrained(model_id)
27 | 
28 |         self.pipe = pipeline(
29 |             "automatic-speech-recognition",
30 |             model=model,
31 |             tokenizer=processor.tokenizer,
32 |             feature_extractor=processor.feature_extractor,
33 |             max_new_tokens=256,
34 |             chunk_length_s=30,
35 |             batch_size=16,
36 |             return_timestamps=True,
37 |             torch_dtype=torch_dtype,
38 |             device=device,
39 |         )
40 |     
41 |     # Converts an audio file into transcribed text. Can provide also provide timestamps
42 |     # wav and mp3 files appear to take the same amount of time to process
43 |     # With test files, word timestamps took 3.5-4 seconds, sentence timestamps took 2.2 seconds, no timestamps took 1.9-2 seconds
44 |     def audio_to_text(self, audio_file, timestamps=None):
45 |         if timestamps == None:
46 |             result = self.pipe(audio_file, return_timestamps=False)
47 |         elif timestamps == "sentence":
48 |             result = self.pipe(audio_file, return_timestamps=True)
49 |         elif timestamps == "word":
50 |             result = self.pipe(audio_file, return_timestamps="word")
51 |         else:
52 |             result = {"text": " "}
53 |         if timestamps == None:
54 |             # If they didn't want the timestamps, then just return the text
55 |             return result["text"]
56 |         else:
57 |             # Return an array of dictionaries that contain every sentence/word with its corresponding start and end time
58 |             # I reformat the data a bit so that it's more intuitive to work with.
59 |             # Each dictionary will look like: {'text': 'here is my speech', 'start_time': 11.58, 'end_time': 14.74}
60 |             timestamped_chunks = []
61 |             for chunk in result['chunks']:
62 |                 new_chunk = {
63 |                     'text': chunk['text'],
64 |                     'start_time': chunk['timestamp'][0],
65 |                     'end_time': chunk['timestamp'][1]
66 |                 }
67 |                 timestamped_chunks.append(new_chunk)
68 |             return timestamped_chunks
69 | 
70 | 


--------------------------------------------------------------------------------